The first-generation 96-core processor project has been sent to MPW manufacturer


The first-generation 96-core processor project has been sent to get manufactured using MPW. VLSIs are going to be manufactured at TSMC semiconductor foundry (Taiwan) using 28 nanometer TSMC HPCPlus (high-performance computing) technological process. MALT-Cv1 is the first MALT-C chip which will be manufactured “in silicon”. The processor contains 9 general-purpose RISC cores and 96 specialized processor elements integrated into 3 SIMD clusters, 32 elements each.


Read more ...

The design of the MALT-Cv2 processor has been started


The design of the second-generation processor has been started. The first version of processor element, which architecture is based on improved Leopard architecture, has been developed. The testing and measurement of energy consumption on target algorithms, considering time delays, have been completed in CAD Cadence at the frequency of 1 GHz for TSMC HPC+ 28 nm manufacturing process.


Read more ...

Google Reveals Technical Specs and Business Rationale for TPU Processor

Although Google’s Tensor Processing Unit (TPU) has been powering the company’s vast empire of deep learning products since 2015, very little was known about the custom-built processor. This week the web giant published a description of the chip and explained why it’s an order of magnitude faster and more energy-efficient than the CPUs and GPUs it replaces.


Read more ...

Applied Micro Claims Third-Generation ARM Chip Ready to Take on Intel Xeon

Applied Micro announced it is sampling X-Gene 3, its third-generation ARM SoC for servers. According to a report by The Linley Group, the new platform will provide comparable performance to the latest Intel Xeon processors, but at a significantly lower price point.

X-Gene 3 respectable performance profile is a result of its relatively high clock speed and memory bandwidth. The CPU runs its 32 cores at a base frequency of 3.0 GHz, and can achieve 3.3 GHz under turbo mode. To feed those cores with data, the chip includes eight memory channels, which can serve DDR4 devices at up to 2667 MHz, yielding 170 GB/sec of aggregate bandwidth. The SoC also includes 42 lanes of PCIe 3.0 links for external connectivity.

The report states that the X-Gene 3 can handle a “a broad range of cloud workloads, including scale-up and scale-out applications.” It should be particular adept at so-called big data applications like in-memory database processing, thanks to its superior memory bandwidth. Coincidently (or perhaps not), AMD is touting its upcoming “Naples” x86 chip for very similar memory bandwidth capabilities, based on the same 8-channel per socket design.


The design of the 96-core MALT-Cv1 processor’s front end has been completed


The design of the 96-core MALT-Cv1 processor’s front end has been completed for VLSI manufacturing at TSMC 28nm HPC+ factory (Taiwan). The basis is intended for high-performance integrated circuit development (high-performance computing, HPC). The designed processor belongs to MALT-C family. It contains 9 general-purpose RISC cores and 96 SIMD processor elements. Estimated chip area 12 mm2, power consumption 1,2 W at the frequency 0,8 GHz. Estimated date of sample delivery: January, 2018.


Read more ...

Japan kicks off AI supercomputer project



Sunway TaihuLight supercomputer



Japan has started a project to build the world's fastest supercomputer by the end of 2017.


Read more ...

Intel Expands Its Comfort Zone with New ARM-Powered FPGAs for Datacenters

Intel announced it is sampling its Stratix 10 FPGAs, the latest family of field programmable gate arrays that are designed to accelerate a number of datacenter workloads. The new devices, which Intel is calling “the most significant FPGA innovations in over a decade,” offer advanced features like embedded 64-bit ARM processors, second-generation High Bandwidth Memory (HBM2), and DSP blocks

The server applications Intel is targeting with the Stratix 10 family is somewhat tangential to Nvidia's and AMD's newest GPU accelerators, as well as Intel’s own Knights Landing Xeon Phi.. However Intel believes workloads such as signal processing, data compression, data encryption, storage management, and video encoding – in truth tough, practically any server-side application where data throughput is the driving criteria. With the DSP unit offering lots of hardwired flops, these devices can also be used for high performance computing.


The debugging set for MALT has been released


The first version of tool kit for MALT software development and debugging has been released. The kit includes emulator, debugger and profiler. The emulator enables to execute and debug MALT programs on general-purpose computers running under Unix-like systems. The emulator, its integrated GDB debugger and the profiler significantly simplify development and porting programs on MALT system and also make it possible to evaluate the efficiency of algorithm implementation on MALT without running it on real hardware.


Read more ...

Kilocore - World's First 1,000-Processor Chip


Image: The University of California

A microchip containing 1,000 independent programmable processors has been designed by a team at the University of California, Davis, Department of Electrical and Computer Engineering.


Read more ...

ISC High Performance 2016 Conference


ISC Hight Performance 2016


SC High Performance (ISC 2016) Conference, 19-23 June, attracted 3,092 attendees from 53 countries, as well as 146 companies and research organizations showcasing their technologies and services at the ISC exhibition. 


Read more ...

C-compiler for programmed accelerator for the MALT-Cv1 has been developed


We’ve developed a C-compiler which generates optimized code for programmed accelerator architecture. On target tasks the performance of the code generated by the compiler is 80% of the code performance written by a programmer in assembly language! The compiler has been developed with the use of domain-specific language (DSL) set for quick translator creation. Such DSL set enables to describe the main phases of translation. In particular, there are Prolog-like descriptions of program conversion rules and combinatorial approach to build a traversal strategy for intermediate representation graphs.


Read more ...

We've started to create the MALT-Cv1 netlist using 28 nanometer TSMC technology


We've started to create the MALT-Cv1 netlist using 28 nanometer HPC+ (high-performance computing) TSMC technology. Planned area of a chip - 12 mm2. Such area is ecological optimum for pilot batch manufacturing under MPW (Multi-Project Wafer). Estimated energy consumption on a target task is 1 W, which enables to achieve considerably higher energy efficiency calculations than on a CPU and GPU.


Read more ...

Assembler and emulator for programmed accelerator have been developed as parts of MALT


We’ve developed an assembler, maintaining algebraic syntax similar to the one used in C language. Along with that, a program, implemented on the assembler, is also proper for C language. That beneficial side effect of the use of algebraic notation enabled to implement system software modeling for programmed accelerator with high performance via a normal C compiler.


Read more ...

Leopard processor element for vector MALT coprocessor has been designed


The development of a processor element for vector accelerator Leopard has been accomplished. The processor element architecture has been chosen according to the requirements for maximum flexibility (from a programming perspective) at high performance and energy efficiency on target tasks. As a result, the architecture based on ALU tree has been chosen.


Read more ...

The development of MALT-processors with vector and mixed architecture has been started


The development of MALT-processors with vector and mixed architecture has been started.


Solutions of mathematical tasks with the ultimate level of complexity from the field of discrete mathematics with perfect or almost perfect parallelization of data with regards to compact (according to core size) and particularly complex mathematical procedures may be energy-efficient implemented only on specialized programmed or configurable computing structures on FPGA/VLSI or in the form of CPU/GPU blocks.


Read more ...

210-core processor on FPGA Xilinx Virtex7 has been built


Recently we’ve finished the assembling and debugging of a new monster - 210-core processor prototype on FPGA Xilinx Virtex7 2000T. This is the biggest chip in the 7th generation of Xilinx FPGA. And our MALT system is the largest array of independent 32-bit RISC cores prototyped on a single FPGA known today.


Read more ...