Articles

From time to time I write articles for computer magazines. Topic ideas come to me by chance, for example by reading news feeds or social media. For me, writing an article with the aim of making a topic understandable to others is a great way to really understand that topic. Click on the cover images below to get to the texts.

Metal parallelization of llm.c

Metal is Apple’s low-level API for GPU programming and llm.c is Andrej Karpathy’s plain C and CUDA implementation of GPT-2. The C version leverages OpenMP to parallelize the layer functions on the CPU cores. The CUDA version is highly optimized for multi-node multi-accelerator parallelization on NVIDIA GPUs using Open MPI and NCCL.

I once ported the C version to Swift and used Grand Central Dispatch (GCD) for CPU parallelization. The Xcode project is in llm.swift. Despite using the -Ounchecked Swift compiler option to generate fast code without bounds checks the C version compiled with clang runs about 6 times faster than Swift.

Run llm.c in TornadoVM

TornadoVM lets Java programs execute on accelerators. llm.c is a plain C implementation of OpenAI‘s GPT-2, the LLM that powered the 1st ChatGPT. Released in fall ‘22, it sparked an AI hype that still lasts. Both are not a perfect fit at first glance, but a Java version of llm.c could make them friends, so I tried to bring them together.

Although there was already a Java port of llm.c, I made my own to get (back) into the groove of Java. I defined some obvious classes, turned C functions into Java methods, replaced pointers with array inidices, used Java Streams instead of OpenMP to parallelize for-loops, and leveraged the Java Vector API for matrix multiplication (the latter taken from llama2.java, thx for sharing @TheMukel).

Articles

Metal parallelization of llm.c

Run llm.c in TornadoVM

More

Parallel Java with CUDA

Parallel Java with TornadoVM