The first MALT prototype on FPGA with specialized accelerators has been developed


FPGA Xilinx Virtex 2000T resource allocation for a processor with 49 general-purpose cores and 490 specialized computers (10 specialized computers per core)


Yellow - special processors
Green - daughter cores
Red - a SPB bus
Blue - memory controller, memory controller buffers, bus arbiter
Purple - core master
Brown - super master core
Light blue - a PCIe controller


Performance and energy efficiency are achievable via specialization and easy programming - via universal constructions. We’ve tried to unite these approaches and, as a result, we’ve developed a prototype of MALT multicore processor on FPGA with specialized accelerators.


A universal component has been reduced from a 210-core processor prototype on FPGA, specialized computers have been developed from scratch for pipeline calculations of popular crypto conversions (conversion per cycle) with key selection, which correspond to the given requirements, including cryptomining’s popular tasks. Address space of specialized computers has displayed into the address space of relevant general-purpose cores. All interaction between a computing core and other devices beyond the computing core has been implemented via messages sent over a serial bus.


In such system several specialized computers may be connected to a single universal core. The number of specialized computers may vary in quite wide range. In our tests we’ve used configurations with 4, 5, 10 and 25 accelerators per single general-purpose core. The number of general-purpose cores had been changing from 94 to 15 accordingly. The optimal value of the number of general-purpose cores and the number of specialized computers depended on data exchange intensity and computing complexity of every particular task. On the one hand, the obtained system provided flexibility of a universal multi-threaded processor, on the other hand, the system provided performance of RTL code written for a specific task.


The first serious testing of the prototype has been completed recently. The performance in the key selection tasks using different threads of selected keys has been studied during the tests. The obtained values of selection speed are up to 61.25 GKeys/s, that’s 99,86% of theoretical maximum. The optimal balance between the number of universal cores and the number of specialized computers on various tasks has been estimated and experimentally proven. Thus, almost 100% scalability of the MALT approach has been demonstrated.