Articles

Mikheev N., Antoniuk V., Elizarov S., Lukyanchenko G. Features of multi-core MALT processors in image processing // Computational methods and programming, Russia, 2020.
This article shows the results of an experimental evaluation of the performance and energy efficiency of multi-core MALT processors in image processing. An example of image filtering using the Sobel operator is shown. The measurements were performed using a low-level MALTemu emulator, a prototype processor in the FPGA, and an experimental VLSI model MALT-Cv2 Rev1. The results are compared with similar results for General-purpose processors and GPUs that support CUDA technology.

Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas, A 5.8 pJ/Op 115 Billion Ops/sec, to 1.78 Trillion Ops/sec 32nm 1000-Processor Array, University of California, Davis, 2016.
1000 programmable processors and 12 independent memory modules capable of simultaneously servicing both data and instruction requests are integrated onto a 32nm PD-SOI CMOS device. At 1.1V, processors operate up to an average of 1.78GHz yielding a maximum total chip computation rate of 1.78 trillion instructions/sec. At 0.84V, 1000 cores execute 1 trillion instructions/sec while dissipating 13.1W.

E. Painkras et al., SpiNNaker: A Multi-Core System-on-Chip for Massively-Parallel Neural Net Simulation, The University of Manchester, United Kingdom, 2012.
This article discusses SpiNNaker chips. SpiNNaker is a massively-parallel computer system designed to model up to a billion spiking neurons in real time. The basic block of the machine is the SpiNNaker multicore System-on-Chip, a Globally Asynchronous Locally Synchronous (GALS) system with 18 ARM968 processor nodes residing in synchronous islands, surrounded by a light-weight, packet-switched asynchronous communications infrastructure.

T. Hruby et al., Keep net working - on a dependable and fast networking stack, Boston, MA, USA USA, 2012
This article discuss in general the implications for the multiserver systems design and cover in detail the implementation and evaluation of a more dependable networking stack. We split the single stack into multiple servers which run on dedicated cores and communicate without kernel involvement. We think that the performance problems should be reconsidered: it is possible to make multiserver systems fast on multicores.

Dongrui Fan, Nan Yuan, Junchao Zhang, et al., Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions. Journal of Computer Science and Technology, Nov. 2009, 24(6):1061-1073.
In this article we propose a many-core architecture, GodsonT, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreadslike programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T.

P.M. Kogge et al., Computer Systems with Lightweight Multithreaded Architectures, U.S. Patent 7,584,332.
The patent describes a method for processing, by one of a memory controller or a memory interface within a lightweight multi-threaded architecture (LIMA) computing system, a request to access a memory location by a program thread being executed by a processor within the LIMA computing system and evaluating, by the one of a memory controller or a memory interface, an extension field of the memory location to determine a state of a value field of the memory location. The method allows programs to actually achieve increases in true concurrency and reductions in latencies. These techniques are ideally suited for implementation using modern VLSI chip technologies such as multi-core chips and PIM chips.

V.Vlassov and C.A.Moritz, Efficient Fine Grained Synchronization Support Using Full/Empty Tagged Shared Memory and Cache Coherency, Department of Teleinformatics, Royal Institute of Technology, Stockholm, Sweden, 2000.
In this report we propose a new efficient way to support fine grained synchronization mechanisms on multiprocessors. We propose to design a full/empty tagged memory hierarchy with aggressive hardware support for fine grained synchronization . Our objective is to improve the performance of the full/empty synchronization mechanism such as implemented in the MIT Alewife machine, by integrating a cache coherency mechanism with the full/empty synchronization.To achieve this, we propose to handle synchronization faults in a similar way as cache misses in a lockup-free cache. In our design, we assume that a full/empty memory operation suspends on a synchronization miss (by analogy to a cache miss) waiting in the memory while the miss is resolved.

Many-Core Fabricated Chips Information Page.
This page contains a comprehensive listing of key attributes of fabricated programmable many-core chips, such as the number of cores, clock rate, power, and chip area.