We’re sure, that you have practical experience with the commercially available x86/GPU/ARM operating systems in С/С++ languages. Most likely, you’ve programmed GPU in specialized languages like OpenCL or CUDA. Quite possible, you’ve used specific techniques oriented for particular hardware for improved productivity of your code performance for either universal or graphics processors. As any programmer, you would like to save your time while delving into the hardware and to use languages that you are most comfortable with. And, at the same time, you would like to ‘squeeze’ everything you can from your hardware and to get really fast and/or (depends on the customer requirements) energy-efficient code.


Considering all mentioned above, we, having a clear understanding of difficulty of the work with any new architecture, aim to offer a universal approach to program all types of MALT processors. The approach enables to start working with MALT systems with C/C++ code whether it exists already or has to be written using the examples we offer. Then, in accordance with our recommendations, optimize it and get 100% out of the hardware. We provide a full-featured emulator to easily debug a user code. Along with the emulator there is a profiler for bottlenecks detection and optimization.


Depending of their preferences and requirements to application code optimization z programmer can choose one of the approaches:


C++ for MALT. It is the most easiest way to start working with MALT. The С++17 standard is maintained to program scalar cores. The operation with the threads is implemented through the C++ constructions: std::thread, std::mutex, etc. MALT is considered a multicore processor, therefore it’s easy to port the existing software and libraries.


OpenCL for MALT. OpenCL standard is used for operations with scalar and vector MALT cores. The implementation is provided with a library of problem-oriented algorithms that are optimized for MALT. Now it is easier for AMD, NVIDIA, ARM users to switch to MALT.


– MALTCС. Functional NVCC analog from NVIDIA. The personal front end for basic compilers of scalar and vector cores. MALTCC implements the potential of architecture and considerably simplifies program parallelization for target classes.


Let's mention, that a part of programming tools we've talked about is under development, so, please, ask for updated information about tools availability!



C++ for MALT


When using the C++ threads, MALT solutions resemble typical multicore processors. They support language standards right up to C++17. Functions for hardware extensions are located in the malt:: namespace.


The example of C++ for MALT:


#include <vector> 
#include <thread>  
#include <functional>  

int main() {
     const int size_x = 1920;
     const int size_y = 1080;
     uint8_t frame[size_x * size_y], buffer[size_x * size_y];
     load_grayscale_image(frame, size_x, size_y);
     memset(buffer, 0, size_x * size_y * sizeof(uint8_t));

     auto f = [](uint8_t *in, uint8_t *out,
                 int start, int end, int shift) {
         out[start] = in[start];
         for (int offset = start + shift; offset < end; offset++)
             out[offset] = in[offset - 1] / 4 + in[offset] / 2 + in[offset - 1] / 4;
         out[end] = in[end];

   for (int y = 0; y < threads.size(); y++)
       threads[y] = std::thread(f, frame, buffer,
                                y * size_x, (y - 1) * size_x - 1, 1);

   for (auto &t : threads) t.join();

   for (int x = 0; x < threads.size(); x++) 
      threads[x] = std::thread(f, buffer, frame x, x + (size_y - 1) * size_x, size_x); 
   for (auto &t : threads) t.join(); 
   return 0; }





MALTCC is front end which substantially automate the distribution of the tasks for scalar and vector cores in MALT heterogeneous system. MALTCC enables to greatly simplify the development of target tasks, hiding from a programmer the difficulties of thread synchronization in the asynchronous heterogeneous system. MALTCC is the easiest way of MALT development for programmers who are familiar with SDK NVIDIA® CUDA® and the like.


The example of a program in MALTCC:


// Searching passwords hashed with MD5 
#include "maltcc.h"
#include "md5.c"

atomic_t found;
void cpu_check_md5(const char* s) {
  Hash h = HashFn(MD5_Id,s);
  for (int i = 0; i < N_HASHES; i++)        // Result check
    if (!memcmp(h.bytes(),hashes[i],16)) {
      printf("\tFound! %s\n",s); 

__simd __kernel check_words(u32 w[W_LEN]) { // Executed on accelerators 
  u32 i, j, r;
  u32 hs[N_HASHES], abc[ABC_SZ];                  
  __slave on_init() {
    for (int i=0; i < N_HASHES; i++)        // First words of hashes
      hs[i] = cpu_to_le(hashes[i][0]);      // for express check
    for (unsigned i = 0; i < ABC_SZ; i++)   
      abc[i] = alphabet[i];
  j = 0; REPEAT(ABC_SZ) {            
    w[0] = (w[0] & 0xFFFF) | abc[j] << 16;  j++;
    i = 0; REPEAT(N_HASHES) {
        __success |= r==hs[i]; i++;
  __slave on_success() {                    // Callback on success
    u32 ws = le_to_cpu(w[0]);
    char *s = (char*)&ws; s[W_SZ]=0;
    printf("Found? [w=%s]\n",s);
    for (int i = 0; i < ABC_SZ; i++) {      // Looking for all endings
      s[W_SZ-1] = alphabet[i];

__slave dict_fragment(int d) {              // Execution on scalar cores
  for (u32 i=0; i<ABC_SZ; i++) {            
    u32 w = (u32)(alphabet[d]) | (u32)(alphabet[i]) << 8;
    check_words(&w);                        // Checking series of words on vector accelerators

int main(int argc, char *argv[]) {
  for (int i=0; i < ABC_SZ; i++)
    dict_fragment(i);                       // Execution on all scalar cores
  return found;  


NVIDIA® and CUDA® are the registered trademarks of NVIDIA Corporation



OpenCL for MALT 


The OpenCL standard is used for operations scalar and vector MALT cores. The implementation is provided with a library of problem-oriented algorithms that are optimized for MALT. Now it is easier for AMD, NVIDIA, ARM users to switch to MALT.


The example of a program in OpenCL for MALT:


// OpenCL-core example code for marking specific pixels of an image

#define DIM_X   128
#define DIM_Y   128

#define M       9
#define MASK(n) (masks+(n)*M*M)
#define HESS_L  100

int D(__global uchar* img, int x, int y, __constant uchar *mask) {
  x -= M/2; y -= M/2; int v = 0;
  for (int dx=0; dx<M; dx++)
    for (int dy=0; dy<M; dy++) 
      v += mask[dx+M*dy] * img[(x+dx)+DIM_X*(y+dy)];
  return v;    

__kernel void mark_ipoints(__global uchar *img, __constant uchar* masks) {
  for (int x=M/2; x<DIM_X-M/2; x++) { 
    for (int y=M/2; y<DIM_Y-M/2; y++) {
      int Dxx = D(img,x,y,MASK(0)), 
          Dyy = D(img,x,y,MASK(1)), 
          Dxy = D(img,x,y,MASK(2));
      img[x+DIM_X*y] |= (abs(5*Dxx*Dyy - 4*Dxy*Dxy) > HESS_L) << 7;