Leopard processor element for vector MALT coprocessor has been designed


Photo: maltsystem.com


The development of a processor element for vector accelerator Leopard has been accomplished. The processor element architecture has been chosen according to the requirements for maximum flexibility (from a programming perspective) at high performance and energy efficiency on target tasks. As a result, the architecture based on ALU tree has been chosen.


ALU tree is a tree with nodes which are elementary ALUs. Such system enables to merge several simple operations in a single complex operation, and, by that, to increase the number of operations per cycle. The amount of the tree inputs is greater than the amount normal ALUs usually have. Some unary operations may be implemented right at the inputs. In ALU tree intermediate values are transferred from one tree node to another directly, which allows to reduce the number of accesses to the register file.


In popular target algorithms, intensively applying table replacements, the moderate amount of computing operations alternate with data memory addresses. Therefore, choosing a tree with a small number of nodes, where is one memory work operation for several lightweight operations, provides effective coverage of most program graphs.


 The number of executed operations per one instruction for the discussed option of the processor element is comparable with performance of processors based on VLIW architecture at considerably less power consumption. Along with ALU, each processor element includes a register file and local memory. Command memory contains 1024 96-bit words, shared memory contains 4096 32-bit words. The processor elements are merged into arrays using SIMD ideology. An array control device has a set of count registers for looping, the depth of nesting is up to 8.