To compute the discreet Fourier transform (DFT) the traditional “row/column” approach is used. This factorization assumes the transform size N can be written the product N =N1*N2 and requires computation of two sets of smaller DFTs, N2 transforms of length N1 (referred to as “column” transforms) and N1 transforms of length N2 (referred to as “row” transforms). The N1xN2 “DFT matrix” X contains input samples xi that are arranged x1,x2,…,xN2 on row 1, xN2+1,xN2+2,…,x2*N2, on row 2, etc. In between column and row transforms it is necessary to multiply each of the N points by the usual twiddle factor, WNi,k, i=0,1,..,N1-1, k=0,1,..N2-1. After the row transforms the DFT output Z resides in the matrix in column major order.
It remains to be shown how the N2-point row and N1-point column DFTs are computed. Centar’s approach is very different from the usual mapping of decimation-in-time and decimation-in-frequency “signal flow” graphs onto linear and parallel circuit structures. Rather a new matrix based algorithm is used to compute the DFT which naturally maps into a succession of radix-b matrix-matrix multiplications that can be done at high speed with minimal complexity using small, locally connected systolic arrays. This derivation starts with the definition of the DFT as
(1) |
where M is the transform size (M=N2 for the row DFTs and M=N1 for the column DFTs), X(m) are the time domain input values, Z(k) are the frequency domain outputs and WM=e-j(2π/M). In matrix terms (1) can be represented as
(2) |
where C is a coefficient matrix containing elements WMmk. If M can be factored as M=N3*N4, then using the reindexings m=m1+N3m2 and k=k1+N3k2 with m1=0,1,…,N3-1, k1=0,1,…,N3-1, m2=0,1,…,N4-1, k2=0,1,…,N4-1 , (1) becomes
(3) |
This expression can be considerably simplified by imposing the restriction that N3/N4 be an integer value so that (WN4)m2k2N3=exp(-2πm2k2N3/N4)=1. For any particular value of m1,k1 the value of the inner parenthesis in (3), Y(k1,m1) can be evaluated from the dot product
so that (3) becomes
(4) |
All the (N3)2 dot product values Y(k1,m1) can be collected in the N3xN3 matrix Yb by performing the matrix multiplication
(5) |
where WM is an N3xN3 matrix with elements WM[k1,m1]=WMm1k1, CM1 is an N3xN4 coefficient matrix with elements CM1[k1,m2]=WMm2k1 , Xb is an N4xN3 matrix with elements Xb[m2,m1]=X(m1+N3m2), Yb is a N3xN3 matrix with elements Yb[k1,m1]=Y(k1,m1) and “•” means element-by-element multiply.
In a similar way for a particular k1,k2 the corresponding Z(k1,N3k2) can be calculated from the dot product
(6) |
and by collecting the dot products as before, a matrix expression for calculating Z is obtained as
(7) |
where CM2 is an N4xN3 coefficient matrix with elements CM2[k2,m1]=WMm1k2 , and Zb is an N4xN3 matrix containing the transform outputs Zb[k2,k1]=Z(k1+k2N3).
The character of (7) is determined largely by the value of N4 or the “base” b (b=N4) because this sets the reachable values of M and the structure of the coefficient matrices CM1 and CM2. In (7) CM1 and CM2 contain M/b2 sub-matrices CB=[c1|c2|...|cb] with the form CM1=[CBt|CBt|...|CBt] t and CM2=[CB|CB|...|CB] due to the periodicity of b or WN4. Also, values of M are constrained to be integer multiples of b2, since it was assumed in (3) that N3/N4 is an integer. Although the choice of b is application dependent, for power-of-two designs only “base-4” (b=4) designs are chosen because this choice provides good architectures that are arithmetically efficient. This selection results in
and
(8) |
where CB in (8) is the coefficient matrix for a 4-point DFT and also describes a radix-4 decimation in time butterfly. Therefore in (7) Yb can be seen as resulting from a series of 4-point transforms of a bit-reversed input X followed by a twiddle multiplication and Zb is obtained from summations of the results of 4-point transforms of (Yb)t.