Tools: Summary

Centar uses a tool called the “symbolic parallel algorithm development environment” (SPADE) to produce optimally fast circuits for a given signal processing algorithm.  To obtain high performance parallel algorithms are required and SPADE can find the best parallel circuit implementations starting from high level, formal algorithm descriptions.  These can be used directly to create FPGA or ASIC hardware implementations or specific parallel code using an appropriate high level language.

The type of algorithms addressed by SPADE are those that can be described by affine iterative statements of the form

y\left ( AI+a \right )\rightarrow x(BI+b) 

where → denotes a general dependency between two algorithm symbols y and x;  A and B are integer matrices, independent of the algorithm index vector I and a,b are scalars.  Here “→” includes commutative and associative operations such as πmax, and min.  An example is the simple convolution

y(i)=\sum_{k=1}^{N}h(k)x(i-k)\; \; \; 1\leq k\leq N,\; \; 1\leq i\leq 2N-1

Such structured algorithms are those found matrix manipulation, linear algebra (least squares, linear systems, factorization, decomposition), digital filtering (convolution, correlation, auto regressive/moving average, Kalman, IIR), sorting, discreet mathematics and many other areas within applied mathematics. Algorithms like these are used in a variety of digital signal and image processing applications.

The parallel hardware abstractions SPADE generates reflect the underlying mathematical structure of affine relationships and are thus structurally simple: regular, pipelined, locally connected, synchronous. Consequently, SPADE is especially suited to hardware applications where parameters such as size, weight, power, or volume drive costs extremely high for non-optimal solutions; or for software applications where non-localized solutions on highly parallel computers can cause serious memory access conflicts during computation.

The input language is a subset of the Maple™ programming language so that algorithms can be tested in a symbolic environment.  For example a linear convolution above might be represented by the code:

h := vector(N);
x := vector(3*N);
y := vector(2*N-1);
for i to 2*N-1 do
   y[i] := add(h[k]*x[i-k+N+1],k=1..N);

Different ways of expressing the algorithm could lead to different parallel algorithm implementations.

SPADE’s outputs are various forms of “space-time” representations.  That is, the algorithm variable indices are “mapped” or transformed so that one corresponds to a time dimension and the others correspond to spatial dimensions.  Thus, the transformed space-time indices describe when and where in space computations occur.  The three basic outputs are:

Mathematical:  matrices specifying the transformations from algorithm indices to space-time indices.

Graphical:  2-D or 3-D  views of algorithm variable positions in space-time along with arrows showing data propagation directions.

Past difficulty in obtaining solutions for parallel algorithm implementations has been due to the many complex, interrelated steps that are involved and the lack of systematic approaches to dealing with them.  The basic steps required to create a parallel mapping of algorithms from the index space of an algorithm representation to a set of coordinates in time and space are:

1.   Algorithm representation:   How a user describes his problem in formal terms.   There are various tradeoffs involved here, as for example

  • one can place restrictions on the input language that make the parallel mapping automation task easier, but make it much more difficult to express the user’s problem and also results in non-portable code.
  • or one can use very flexible, portable languages such as C and Fortran, at the expense of unnecessary “serialization” of the problem.

2.   Scheduling:   Finding an execution sequence for the pieces of an overall computation (e.g., making sure that all argument variables in a computation are available at the right point in space at the correct time).

3.   Reindexing:   Altering the indexing relationships between different variables to best accommodate parallel computations. 

4.   Localization:   The process of decomposing “broadcast” and “fan-in” operations into sequences of local data movement.

5.   Allocation:   The process of allotment of computations to the absract processors in spatial coordinates

The steps above are not independent because a choice made at any step can affect those at other steps.  For example,  allocation and scheduling steps would be in conflict if two computations with the same scheduled execution time are allocated to the same point in space-time.

SPADE is capable of considering all the requirements above simultaneously in such a way that it can find the solution(s) with the smallest overall latency of computation.  It can also use an estimate of the spatial (area) extent, array regularity, or bandwidth as secondary objective functions in finding optimal solutions. 

SPADE is unique in that no heuristics are required, so that algorithm/application familiarity is not critical.  (Usually the designer must identify special characteristics of the algorithm that can be exploited in order to arrive at good mappings.)   Also, very little parallel mapping expertise is needed, an important consideration because parallel algorithm mapping techniques can be very complicated and usually a considerable learning curve is required (on an algorithm by algorithm basis) to produce reasonable results.