CENTAR » Architecture

Technology: Architecture

Introduction

To compute a transform of size N, the traditional “row/column” approach to compute the DFT is used. This factorization assumes the transform size N can be written N=N₁*N₂ and requires computation of two sets of smaller DFTs, N₂ transforms of length N₁ (referred to as “column” transforms) and N₁ transforms of length N₂ (referred to as “row” transforms). The N₁xN₂ “DFT matrix” X contains input samples x₁,x₂,…,x_N2 on row 1, x_N2+₁,x_N₂₊₂,…,x_2*_N2, on row 2, etc. In between column and row transforms it is necessary to multiply each of the N points by the usual twiddle factor, W_N^i,k, i=0,1,..,N₁-1, k=0,1,..N₂-1. After the row transforms the DFT output Z resides in the matrix in column major order.

To compute the row and column DFTs, a new matrix formulation is used. For this each column and row DFT of size M (M=N₂ for the row DFTs and M=N₁ for the column DFTs) is further decomposed as M=N₃*N₄=N₃*b. It can then be shown that the DFT of M can be computed by the expressions

Y_b=W_M•C_M1X_b

Z_b=C_M2Y_b^t

where “b” refers to the value of N₄ or “base”. Here X_b is an bxN₃ input data matrix (M values) and the DFT output Z_b is a bxN₃ matrix (M values). Also, W_M is a small N₃ x N₃ coefficient matrix used in an element-by-element multiply and CM₁,CM₂ are arrays of bxb radix-b butterfly matrices.

For all power-of-two circuits the base b is set to 4 so that elements CM₁,CM₂ contain the only the elements {1,-1,j,-j}. For non-power-of-two circuits the base must be large enough to include all factors of the desired transform size.

Architecture

An abstraction of the architecture (Fig. 1) shows that it consists of two bxb processing element (PE) pipelined or "systolic arrays" connected by a bx1 array of complex multipliers. Each PE in the left-hand-side (LHS) and the right-hand-side (RHS) systolic arrays contain a few registers, multiplexors, two real adders, and miscellaneous logic as shown in Fig. 2. Additionally, each PE in each systolic array has access to a small dual-port RAM, to which RHS PEs write and LHS PEs read. The systolic matrix-matrix multiplications are carried out using well known mappings of signal flow graphs to PE arrays. FFT input data in X_bc are all fixed-point n-bit 2’s complement words. The FFT computation proceeds in two basic steps as described below.

Fig 1.png

Fig. 1. Systolic processing flow for column and row DFTs. Subscript “c” and “r” refer to computing column and row DFTs of the N₁xN₂ DFT matrix.

Step 1: Column DFTs

An input buffer feeds the LHS SA with column data X_bci, i=1..N₂, of the N₁xN₂ DFT matrix, each array of column elements organized as an M=N₁=bxN_3c matrix. Also, each PE in the bxb LHS systolic array contains an element of the bxb matrix C_M1. The systolic matrix-matrix multiplication results C_M1 X_bci then flow out of the LHS systolic array through the multiplier array shown in Fig. 1 where the coefficient multiplication by W_M, stored in ROM, produces Y^t_bci. A second systolic matrix-matrix multiplication is then performed by the RHS systolic array with inputs Y^t_bci from the left and C_M2 from the bottom (one C_M2 matrix per column X_bci), producing the results Z_bci, which are stored in a distributed fashion in the RHS PE RAMs and become the X_bri for row DFTs in Fig. 1 Step 2 after the twiddle multiplication by W_N^i,k.

Step 2: Row DFTs

The processing in this step with M=N₂=bxN_3r is often identical to that for the column DFTs with two exceptions. First, there is a juxtaposition of the C_M1 values with the X_bri row inputs. In this case the row X_bri values are retrieved from the PE internal RAMs, while the C_M1 values now flow into the LHS SA from the bottom. Second, the Z_bri FFT outputs are stored in a separate RAM output buffer.

Fig 3.png

Fig. 2. Functionality of PEs in LHS and RHS arrays of PEs.

By choosing a SA size along with appropriate values of N₁, N₂, N₃, and b, there are few restrictions on target transform sizes, hence the circuit is fundamentally programmable. The benefit of a programmable circuit is that the computational SA hardware can be highly optimized and then reused with different control circuitry to do many different computations. The SA size is chosen so that throughput requirements are met.