Because the architecture is inherently regular, it lends itself to a variety of implementation options based on different scaling approaches so that circuit throughput can always match the application requirements. (Any allowed transform size N is possible using any array structure, as long as there is sufficient memory.)
With regard to increasing throughput the two most straightforward strategies are either (1) to alter the number of rows/columns in the DFT matrix or (2) to add parallelism by duplicating the array structure. Circuits with throughputs that exceed 10G complex samples per second for a single FFT size are straightforward to construct. Aggregate throughput is only limited by the available FPGA or ASIC resources available.
(1) Setting throughputs by choice of DFT matrix size
The basic strategy here is to pick the desired transform time and then choose an array size that makes this achievable. This is possible because the architecture uses the basic “row/column” decomposition of the transform size N such that N=NrNc, where the array length in “PE rows” is Nr/b (b=4 for N=2n). Therefore, different resource-speed tradeoffs simply involve changing Nr and Nc. For example a 1024 point (streaming) transform could be computed using three different choices of Nr and Nc values as shown in Table 1. In general higher throughput is achieved by extending the array size along the “North/South” direction (Fig. 1) by adding PE rows.
Nr | Nc | Transform Size |
Array PE Rows |
Throughput (µsec) |
16 | 64 | 1024 | 4 | ~2.0 |
32 | 32 | 1024 | 8 | ~1.5 |
64 | 16 | 1024 | 16 | ~0.90 |
Table 1. Example of estimated performance for scaling options obtained by varying Nr and Nc , keeping N the same.
Alternatively, larger tranform sizes can be mapped to a single array of small fixed size. For example, an array structure with 4 PE rows like that shown in Fig. 1 below can be used to compute the DFT for the transforms from 128-points to
Fig.1 (a) Functional operation of base-b (b=4) architecture and (b) circuit implementation for an M=16-point transform size showing inputs at different times t and internal matrix values. Here X is the transform input and Z is the DFT output. There are 4 “PE rows” in this example.
2048-points as shown in Table. 2. Note that for the number of PE rows to remain fixed, Nr must also be fixed.
Nr | Nc | Transform Size |
Array PE Rows |
Throughput (µsec) |
16 | 8 | 128 | 4 | ~0.25 |
16 | 16 | 256 | 4 | ~0.50 |
16 | 32 | 512 | 4 | ~1.0 |
16 | 64 | 1024 | 4 | ~2.0 |
16 | 128 | 2048 | 4 | ~4.0 |
Table 2. Example of estimated performance for scaling options obtained by varying N and Nc , keeping Nr the same.
(2) Setting throughputs by array parallelism
Here the approach is to duplicate the architecture n times, so that the input is divided into n streams of column DFTs, one going to each of the n array structures. In this way the throughputs listed above can be increased by a factor of n.
Additionally, it is possible to use separate array structures for row and column DFTs which improves throughput by another factor of two to three. For example, a 1024-point FFT can be run in this way at speeds of up to 8G complex samples per second.