A parameterizable SIMD stream processor

TitleA parameterizable SIMD stream processor
Publication TypeConference Paper
Year of Publication2005
AuthorsMunshi, A., W. Bishop, A. Wong, S. Braganza, A. Clinton, and M. McCool
Conference Name18th Canadian Conference on Electrical and Computer Engineering
Keywords2D convolution, Altera FPGA, clock cycle, Computer Graphics, configurable FPU, data encryption, data processing, execution unit array, floating point arithmetic, hardware description languages, IEEE single-precision floating point data, instruction controller, matrix multiplication, media processing, memory interface, memory system, on-chip bandwidth, parallel processing, polygon rendering, resource utilization, routing network, scientific computing, security, SIMD, stream processing, VHDL

Stream processing is a data processing paradigm in which long sequences of homogeneous data records are passed through one or more computational kernels to produce sequences of processed output data. Applications that fit this model include polygon rendering (computer graphics), matrix multiplication (scientific computing), 2D convolution (media processing), and data encryption (security). Computers that exploit stream computations process data faster than conventional microcomputers because they utilize a memory system and an execution model that increases on-chip bandwidth and delivers high throughput. We have designed a general-purpose, parameterizable, SIMD stream processor that operates on IEEE single-precision floating point data. The system is implemented in VHDL, and consists of a configurable FPU, execution unit array, and memory interface. The FPU supports pipelined operations for multiplication, addition, division, and square root. The data width is configurable. The execution array operates in lock-step with an instruction controller, which issues 32-bit instructions to the execution array. To exploit stream parallelism, the number of execution units as well as the number of interleaved threads is specified as a parameter at compilation time. The memory system allows all execution units to access one element of data from memory in every clock cycle. All memory accesses also pass through a routing network to support conditional reads and writes of stream data. Functional and timing simulations have been performed using a variety of benchmark programs. The system has also been synthesized into an Altera FPGA to verify resource utilization