*Introduction*

To compute a transform of size *N*, the traditional “row/column” approach to compute the DFT is used. This factorization assumes the transform size *N* can be written *N*=*N*_{1}*N_{2} and requires computation of two sets of smaller DFTs, *N*_{2} transforms of length *N*_{1} (referred to as “column” transforms) and *N*_{1} transforms of length *N*_{2} (referred to as “row” transforms). The *N*_{1}xN_{2} “DFT matrix” *X* contains input samples *x*_{1},*x*_{2},…,*x*_{N2} on row 1, *x*_{N2+}_{1},*x*_{N}_{2+2},…,*x*_{2*}_{N2}, on row 2, etc. In between column and row transforms it is necessary to multiply each of the *N* points by the usual twiddle factor, *W*_{N}^{i,k}, *i*=0,1,..,*N*_{1}-1, *k*=0,1,..*N*_{2}-1. After the row transforms the DFT output *Z* resides in the matrix in column major order.

To compute the row and column DFTs, a new matrix formulation is used. For this each column and row DFT of size *M* (*M=N*_{2} for the row DFTs and *M=N*_{1} for the column DFTs) is further decomposed as *M=N*_{3}*N_{4}=N_{3}*b. It can then be shown that the DFT of *M* can be computed by the expressions

*Y*_{b}=W_{M}•C_{M1}X_{b}

*Z*_{b}=C_{M2}Y_{b}^{t}

where “*b*” refers to the value of *N*_{4} or “base”. Here *X*_{b} is an *b*x*N*_{3} input data matrix (*M* values) and the DFT output *Z*_{b} is a *b*x*N*_{3} matrix (*M* values). Also, *W*_{M} is a small *N*_{3} x N_{3} coefficient matrix used in an element-by-element multiply and *CM*_{1},CM_{2} are arrays of *b*x*b* radix-*b* butterfly matrices.

For all power-of-two circuits the base *b* is set to 4 so that elements *CM*_{1},CM_{2} contain the only the elements {1,-1,*j*,-*j}*. For non-power-of-two circuits the base must be large enough to include all factors of the desired transform size.

*Architecture*

An abstraction of the architecture (Fig. 1) shows that it consists of two *b*x*b* processing element (PE)
pipelined or "systolic arrays" connected by a *b*x1 array of complex multipliers. Each PE in the left-hand-side (LHS) and the right-hand-side (RHS) systolic arrays contain a few registers, multiplexors, two real adders, and miscellaneous logic as shown in Fig. 2. Additionally, each PE in each systolic array has access to a small dual-port RAM, to which RHS PEs write and LHS PEs read. The systolic matrix-matrix multiplications are carried out using well known mappings of signal flow graphs to PE arrays. FFT input data in *X*_{bc} are all fixed-point *n*-bit 2’s complement words. The FFT computation proceeds in two basic steps as described below.

Fig. 1. Systolic processing flow for column and row DFTs. Subscript “*c*” and “*r*” refer to computing column and row DFTs of the *N*_{1}xN_{2} DFT matrix.

*Step 1: Column DFTs*

An input buffer feeds the LHS SA with column data *X*_{bci}, *i*=1..*N*_{2}, of the *N*_{1}xN_{2} DFT matrix, each array of column elements organized as an *M=N*_{1}=*b*x*N*_{3c} matrix. Also, each PE in the *b*x*b* LHS systolic array contains an element of the *b*x*b* matrix *C*_{M1}. The systolic matrix-matrix multiplication results *C*_{M1} X_{bci} then flow out of the LHS systolic array through the multiplier array shown in Fig. 1 where the coefficient multiplication by *W*_{M}, stored in ROM, produces *Y*^{t}_{bci}. A second systolic matrix-matrix multiplication is then performed by the RHS systolic array with inputs *Y*^{t}_{bci} from the left and *C*_{M2} from the bottom (one *C*_{M2} matrix per column *X*_{bci}), producing the results *Z*_{bci}, which are stored in a distributed fashion in the RHS PE RAMs and become the *X*_{bri} for row DFTs in Fig. 1 Step 2 after the twiddle multiplication by *W*_{N}^{i,k}.

*Step 2: Row DFTs*

The processing in this step with *M=N*_{2}=*b*x*N*_{3r} is often identical to that for the column DFTs with two exceptions. First, there is a juxtaposition of the *C*_{M1} values with the *X*_{bri} row inputs. In this case the row *X*_{bri} values are retrieved from the PE internal RAMs, while the *C*_{M1} values now flow into the LHS SA from the bottom. Second, the *Z*_{bri} FFT outputs are stored in a separate RAM output buffer.

Fig. 2. Functionality of PEs in LHS and RHS arrays of PEs.

By choosing a SA size along with appropriate values of *N*_{1}, *N*_{2}, *N*_{3}, and *b,* there are few restrictions on target transform sizes, hence the circuit is fundamentally programmable. The benefit of a programmable circuit is that the computational SA hardware can be highly optimized and then reused with different control circuitry to do many different computations. The SA size is chosen so that throughput requirements are met.