Below you will find pages that utilize the taxonomy term “Parallelcomputing”
Metal parallelization of llm.c
Metal is Apple’s low-level API for GPU programming and llm.c is Andrej Karpathy’s plain C and CUDA implementation of GPT-2. The C version leverages OpenMP to parallelize the layer functions on the CPU cores. The CUDA version is highly optimized for multi-node multi-accelerator parallelization on NVIDIA GPUs using Open MPI and NCCL.
I once ported the C version to Swift and used Grand Central Dispatch (GCD) for CPU parallelization. The Xcode project is in llm.swift. Despite using the -Ounchecked
Swift compiler option to generate fast code without bounds checks the C version compiled with clang
runs about 6 times faster than Swift.
Run llm.c in TornadoVM
TornadoVM lets Java programs execute on accelerators. llm.c is a plain C implementation of OpenAI‘s GPT-2, the LLM that powered the 1st ChatGPT. Released in fall ‘22, it sparked an AI hype that still lasts. Both are not a perfect fit at first glance, but a Java version of llm.c could make them friends, so I tried to bring them together.
Although there was already a Java port of llm.c, I made my own to get (back) into the groove of Java. I defined some obvious classes, turned C functions into Java methods, replaced pointers with array inidices, used Java Streams instead of OpenMP to parallelize for
-loops, and leveraged the Java Vector API for matrix multiplication (the latter taken from llama2.java, thx for sharing @TheMukel).
Parallel Java with CUDA
An infographic about my first approach to parallelizing Java code in 2017. It worked for me then and probably still does, but now there are tools that are much easier to use and are also much more flexible. One is TornadoVM, which essentially allows the programmer to mark up the code to be parallelized and does all the heavy lifting for execution on popular accelerators (AMD, Intel, NVIDIA) and multiple CPU cores. I created a tutorial on TornadoVM that was published in a German computer magazine: part one and two.
Parallel Java with TornadoVM
A few years ago I wrote a Java app that creates star maps (example). It does this by projecting the coordinates of celestial bodies onto a flat canvas. One of the features is to map images of artistic representations of certain star constellations onto the maps. The approach I took to perform the required calculations turned out to be quite slow.
When I heard about CUDA I was excited by the idea of doing computations on graphics cards. I wondered if my slow sequential Java code could be run much faster in parallel on a GPU.