Automatic Generation of Systolic Array Designs For Reconfigurable Computing

Please download to get full document.

View again

of 27
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Automatic Generation of Systolic Array Designs For Reconfigurable Computing. Greg Nash Engineering of Reconfigurable Systems and Algorithms (ERSA '02) International Multiconference in Computer Science Las Vegas, Nevada, June 24, 2002. FPGA-Based Systolic Computing: Rationale.
Automatic Generation of Systolic Array Designs ForReconfigurableComputingGreg NashEngineering of Reconfigurable Systems and Algorithms (ERSA '02)International Multiconference in Computer ScienceLas Vegas, Nevada, June 24, 2002FPGA-Based Systolic Computing: Rationale
  • A large number of parallel “systolic” algorithms exist that are well suited for use in real-time applications such as signal/image processing, communications, and adaptive processing
  • Reconfigurable FPGA hardware is the most cost-effective implementation strategy when “programmable” systolic computations are desired
  • But designing systolic arrays is a complex process for which there are no systematic designmethodologies or commercial systolic CAD tools
  • Centar’s Symbolic Parallel Algorithm Development Environment (SPADE) allows a designer to automatically explore the design space of different systolic array implementations so that system level tradeoffs can be efficiently analyzed
  • Parallel Processing with Systolic Arrays
  • Algorithms
  • – linear algebra – graph theory – computational geometry – string matching – sorting/searching – dynamic programming – discreet mathematics – number-theoretic algorithms
  • Applications (real-time/embedded processing)
  • – communications – seismic analysis – signal/image processing – adaptive processing – arithmetic arrays
  • Architecture
  • – simple processing elements – local interconnects – synchronous – fine-grained – pipelined – small local memory – local control – regular arrays
  • Hardware
  • – FPGA/PLD chips – programmable connections – reconfigurable boards – ASICs Systolic Array: Matrix MultiplySpace-Time MappingSystolic ArrayProject alongtime axisCAD Tool Goals
  • Create parallel array designs with reduced requirements for mapping expertise/heuristics
  • Tool inputs derived directly from mathematical solutions
  • Heuristics applied only from within a high level language
  • User control of architectural features by addition of constraints, e.g., boundary vs internal I/O
  • Strong emphasis on visual feedback
  • Filtering of outptut; find delay-optimal solutions with
  • Minimum area
  • Maximum regularity
  • Minimum network bandwidth
  • Ease of use
  • Built-in simulator with practical architectural model
  • VHDL (prototype) translator to facilitate FPGA implementations
  • High Level Tool Descriptioni,kS,TSimulator,GraphicalOutputsMathematicalAlgorithmInputCodeTransformationSearchfor i to N do for j to N do if j=1 and i>=1 and i<=N then l[i,j]:=a[i,j]; elif i=1 and j>1 and j<=N then u[i,j]:=a[i,j]/l[i,i]; fi; if i>=j and j>1 and i<=N then l[i,j]:=a[i,j]-add(l[i,k]*\ u[k,j],k=1..j-1) fi; if j>i and i>1 and j<=N then u[i,j]:=(a[i,j]-add(l[i,k]*\ u[k,j],k=1..i-1))/l[i,i] fi; odod;S=spatial coordinatesT=temporal coordinatesM=transformation solutionSystolic Array: CAD Design Issues
  • Algorithm representation
  • Scheduling
  • Reindexing
  • Localization
  • Allocation
  • Constraint introduction
  • Finding optimal solutions
  • Automatic operation
  • Efficient hardware
  • Symbolic Parallel Algorithm Development Environment (SPADE)
  • Based on DESCARTES (Baltus and Allen, MIT)
  • Finds optimal minimum latency solutions by “exhaustive” search of all possible solutions
  • Schedules vector elements have limited range
  • Spatial relationships between variables described by unimodular matrices
  • Architectural constraints introduced at abstract level to reduce search space
  • Uses more familiar input language (Maple™)
  • Symbolic computations in Maple
  • Structured computations in Fortran
  • Different search formalism
  • Added constraints
  • Built in simulator
  • Algorithm Domain
  • Multiple statements of the general form
  • Where Ax,By/ax,by are integer matrices/vectors, S is the dimension of the algorithm space and the dependencies include commutative and associative operators: min, max, , 
  • Many “systolic” and fined grained parallel algorithm examples fall in this category
  • Algorithms and solutions based on use of affine transforms
  • Affine transforms are most general that preserve regularity
  • Above form can always be systematically transformed to regular arrays with local interconnections
  • Array Design Solution Strategy
  • Minimum execution time (primary objective function)
  • Minimum area/regularity/bandwidth: secondary objective function
  • Search based approach: for each algorithm variable explore
  • All schedules
  • All allocations
  • All variable-variable indexing relationships
  • Solution Search
  • Find affine transformations, T, that map algorithm indices to space-time indices
  • If find
  • T consists of a time component / (vector/scalar) and spatial component S/s (matrix/vector):
  • SPADE treats , , S and s as “search variables”
  • Matrix S is considered as a single variable and is based on unimodular matrices to force dense mappings and limit the number of possible values
  • The elements of vectors ,s are considered separately as search values
  • Search variables only have a small number of possible values
  • Search contains breadth-first and depth-first components
  • Solution Constraints
  • Space of possible solutions grows exponentially with number of search variables and possible values
  • Impose abstract architectural constraints to make search process feasible, e.g.,
  • Causality between dependences, e.g., if variable c depends on a, require
  • I/O at boundary for variable a imposes “time align” constraint
  • Where na is the normal to the plane of a and ut is a unit time vector
  • Simulator
  • Architectural model intended to facilitate transition to hardware
  • Assumes local FSM’s control data movement and arithmetic
  • Moderate amount of local state data
  • Data transfers
  • Eight nearest neighbor paths
  • Simultaneous transfers of data
  • I/O through edge stacks or data pre-loaded in array
  • Basic array activity sequence
  • Data transfer
  • Intermediate calculations (running sums)
  • Result calculations
  • Visual instrumentation of user selected data and data movement
  • 2-D Mapping Example
  • Algorithm (includes fan-in, fan-out, and single I/O variable)
  • Mathematical expression:
  • Maple code:
  • Converted by parser to
  • Mathematical equivalent:
  • Solutions (N=6)
  • - (a) toptimal = 3N - 4; I/O at array edge- (b) toptimal = 2N - 2; No I/O restrictions for i from 2 to N do a[i]=add(a[j]+b[i],j=1..i-1);od;TimeTimeaqaqbbSpaceSpace(a) Single optimal solution(b) 1 of 2 optimal solutionsFaddeev Algorithm
  • Problem
  • Solution
  • Place matrices in array
  • Add linear combination of rows of A to C
  • Chose WA=C
  • Solution “appears” in lower left hand quadrant
  • Design Example: Linear AlgebraMathematical SolutionInput Codefor j to 2*N do for i to 2*N do if i=1 and j>=1 and j<=N then u[i,j] := a[i,j] fi; if i=1 and j>=1 and j<=N then b[i,j] := a[i,j+N] fi; if j>=i and i>1 and j<=N then u[i,j]:=a[i,j]-add(l[i,k]*u[k,j],k=1..i-1) fi; if j>=1 and i>=1 and j<=N and i<=N then b[i,j]:=a[i,j+N]-add(l[i,k]*b[k,j],k=1..i-1) fi; if j>=1 and i>=1 and j<=N and i<=N then d[i,j] := a[i+N,j+N]-add(l[i+N,k]*b[k,j],k=1..N) fi; if j=1 and i>=1 and i<=2*N then l[i,j]:=a[i,j]/u[j,j]; fi; if i>=j and j>1 and i<=2*N and j<=N then l[i,j]:=(a[i,j]-add(l[i,k]*u[k,j],k=1..j-1))/u[j,j] fi; odod;With substitutionsDesign Example SolutionsMathematicalTransformationsComments
  • Two optimal solutions found
  • Mimimum area, secondary objective function
  • Latency 5N-2
  • Area: ½ N(3N+1)
  • linear array of dividers required
  • Space-time View, N=6(One of Two Solutions)Systolic ArraySingle Divider Solutions
  • Replace variable 1/u[i,j] with u_inv[j]:
  • Constrain u_inv[j] to be time aligned (projects to a point)
  • ...if j<=N and j>=1 then u_inv[j]:=1/u[j,j] fi;if j=1 and i>=1 and i<=2*N then l[i,j]:=a[i,j]*u_inv[j] fi;if i>=j and j>1 and i<=2*N and j<=N then l[i,j]:=(a[i,j]-add(l[i,k]*u[k,j],k=1..j-1))*u_inv[j] fi;...u_invSystolic Arraylud
  • 5N-2 latency
  • Single latency optimal, minimum area solution
  • 4N2 area
  • abOther Code Input Formfor j to 2*N do for i to 2*N do if i=1 and j>=1 and j<=N then u[i,j] := A[i,j] fi; if j>=i and i>1 and j<=N then u[i,j]:=A[i,j]-add(l[i,k]*u[k,j],k=1..i-1) fi; if j>=1 and i>1 and j<=N and i<=N then b[i,j]:=b[i,j]-add(l[i,k]*b[k,j],k=1..i-1) fi; if j>=1 and i>=1 and j<=N and i<=N then d[i,j] := d[i,j]-add(l[i+N,k]*b[k,j],k=1..N) fi; if j=1 and i>=1 and i<=N then l[i,j]:=A[i,j]/u[j,j] fi; if i>=j and j>1 and i<=N and j<=N then l[i,j]:=(A[i,j]-add(l[i,k]*u[k,j],k=1..j-1))/u[j,j] fi; if j=1 and i>N and i<=2*N then l[i,j]:=C[i-N,j]/u[j,j] fi; if i>N and j>1 and i<=2*N and j<=N then l[i,j]:=(C[i-N,j]-add(l[i,k]*u[k,j],k=1..j-1))/u[j,j] fi; odod;With substitutions:and b/d initialized to B/DMaximum Regularity Designs (1)
  • Desire simple interconnection network topology
  • Avoid time aligned variables (introduces O(N) memory per PE)
  • Preference for “close” dependency relations between variables
  • Maximum Regularity Design (2)
  • One optimal solution found
  • Mimimum area, secondary objective function
  • Latency 5N-2
  • Area: 4N2
  • 2-D array of dividers required
  • 25% fewer data flow paths
  • Load/unload cycle required for data
  • Space-Time ViewSystolic ArrayDesign Example: 1D DFTMathematical AlgorithmSPADE OutputsBase-4 TransformationDirect Path from Mathematical InputTo Array DesignSPADE Inputfor j to N/4 do for k to N/4 do Y[j,k] := WM[j,k]*add(CM1[j,i]*X[i,k],i=1..b); od; for k to R do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..4*b); odod; Patent PendingAltera Stratix FPGA: DFT MappingAffine Recurrence Equation Types
  • Uniform: x(I) depends on y(I-D)
  • D is integer vector
  • Example (matrix-matrix multiplication)
  • Non-uniform: x(AI+a) depends on y(BI+b)
  • A/B are integer matrices; a/b are vectors
  • Example (matrix-matrix multiplication)
  • Lyapunov Matrix Equation
  • Start with abstract problem (e.g., Lyapunov algorithm, solve for X)
  • Convert to mathematical expression
  • Non-uniform recurrenc equation in Maple language
  • for i to N do for j to N do x[i,j] := (c[i,j]-add(a[i,k]*x[k,j],k=1..i-1)- add(b[l,j]*x[i,l],l=1..j-1))/(a[i,i]+b[j,j]); od;od;Lyapunov: Uniform Algorithm*
  • Difficult to represent algorithms in this form and no systematic method for doing this
  • All variables must be heuristically embedded in space-time
  • “Extra” variables created
  • All variables must exist at all points in the index space
  • *(V. Rowchowdhury, PhD thesis, Stanford 1989)More Information
  • “Constraint Directed CAD Tool For Automatic Latency-Optimal Implementation of 1-D and 2-D Fourier Transforms” (SPIE ITCom 2002, Boston, MA, July 29- August 2, 2002.)
  • Use of constraints to define array designs
  • 2-D FFT example
  • Hardware Efficient Base-4 Systolic Architecture for Computing the Discrete Fourier Transform (2002 IEEE Workshop on Signal Processing Systems, San Diego CA, October 16-18)
  • Details of 1D DFT design
  • Mapping to FPGAs
  • (above papers and extended viewgraphs)
  • We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks

    We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

    More details...

    Sign Now!

    We are very appreciated for your Prompt Action!