CENTAR » Scaling

Technology: Scaling

Because the architecture is inherently regular, it lends itself to a variety of implementation options based on different scaling approaches so that circuit throughput can always match the application requirements. (Any allowed transform size N is possible using any array structure, as long as there is sufficient memory.)

With regard to increasing throughput the two most straightforward strategies are either (1) to alter the number of rows/columns in the DFT matrix or (2) to add parallelism by duplicating the array structure. Circuits with throughputs that exceed 10G complex samples per second for a single FFT size are straightforward to construct. Aggregate throughput is only limited by the available FPGA or ASIC resources available.

(1) Setting throughputs by choice of DFT matrix size

The basic strategy here is to pick the desired transform time and then choose an array size that makes this achievable. This is possible because the architecture uses the basic “row/column” decomposition of the transform size N such that N=N_rN_c, where the array length in “PE rows” is N_r/b (b=4 for N=2ⁿ). Therefore, different resource-speed tradeoffs simply involve changing N_rand N_c. For example a 1024 point (streaming) transform could be computed using three different choices of N_rand N_c values as shown in Table 1. In general higher throughput is achieved by extending the array size along the “North/South” direction (Fig. 1) by adding PE rows.

N_r	N_c	Transform Size	Array PE Rows	Throughput (µsec)
16	64	1024	4	~2.0
32	32	1024	8	~1.5
64	16	1024	16	~0.90

Table 1. Example of estimated performance for scaling options obtained by varying N_rand N_c , keeping N the same.

Alternatively, larger tranform sizes can be mapped to a single array of small fixed size. For example, an array structure with 4 PE rows like that shown in Fig. 1 below can be used to compute the DFT for the transforms from 128-points to

architecture

Fig.1 (a) Functional operation of base-b (b=4) architecture and (b) circuit implementation for an M=16-point transform size showing inputs at different times t and internal matrix values. Here X is the transform input and Z is the DFT output. There are 4 “PE rows” in this example.

2048-points as shown in Table. 2. Note that for the number of PE rows to remain fixed, N_r must also be fixed.

N_r	N_c	Transform Size	Array PE Rows	Throughput (µsec)
16	8	128	4	~0.25
16	16	256	4	~0.50
16	32	512	4	~1.0
16	64	1024	4	~2.0
16	128	2048	4	~4.0

Table 2. Example of estimated performance for scaling options obtained by varying N and N_c, keeping N_r the same.

(2) Setting throughputs by array parallelism

Here the approach is to duplicate the architecture n times, so that the input is divided into n streams of column DFTs, one going to each of the n array structures. In this way the throughputs listed above can be increased by a factor of n.

Additionally, it is possible to use separate array structures for row and column DFTs which improves throughput by another factor of two to three. For example, a 1024-point FFT can be run in this way at speeds of up to 8G complex samples per second.