Posts

async_runtime Part II: Components of an asynchronous runtime

In the previous post we saw how asynchronous programming can be used to write concurrent programs in Rust. In this post, we will start writing our own asynchronous runtime, and we will see an overview of its different components. Our implementation is heavily inspired by this series of posts and the actual code of the smol runtime. Components The two main components of an asynchronous runtime are the Executor and the Reactor. The Executor is responsible for scheduling tasks and executing them. It contains the main execution loop of the runtime, which is responsible for polling the available tasks. The Reactor is responsible for registering resources that tasks are waiting for, and for waking the appropriate tasks when those resources are available. The Executor Here is the main loop of our Executor. Note that we are purposefully writing a single-threaded runtime, so we allow our Executor to be !Send and !Sync. ...

async_runtime Part I: Introduction to asynchronous programming in Rust.

Asynchronous programming is a topic that is discussed a lot in the Rust community, and is more generally gaining popularity as a way to perform computations in I/O-extensive environments. In this series of posts, we will explore how Rust handles asynchronous computations by implementing our own asynchronous runtime. Asynchronous programming Concurrency is the ability for a program to make progress on several tasks over overlapping periods of time. This can be achieved by either: ...

`llm_runner` Part III: Parsing and evaluating GPT-2

In Part I we implemented a Transformer encoder in Rust; in Part II we loaded DistilBERT and ran masked language modeling. This post describes the next step in llm-runner: parsing GPT-2 weights (targeting gpt2-medium), evaluating the model with causal self-attention, and testing the pipeline. This post accompanies PR #2 on the llm-runner repository. Adjustments to the Rust structures for GPT Optional normalization in embeddings DistilBERT’s embedding module includes a LayerNorm after summing token and position embeddings. GPT-2 does not normalize that sum before the first block. ...

`llm_runner` Part II: Loading Weights and performing MLM inference

In the previous post, we went through the mathematics and Rust implementation of an encoder architecture. Now, let’s see how we can download a model and parse it into our DistilBert struct. Then, we will test our llm runner by performing masked language modeling (MLM) inference on a user-chosen prompt. Getting the model from Hugging Face We target distilbert-base-uncased, a compact encoder-only model trained with the same objectives as BERT (including MLM). Each model on the Hub is a normal Git repository; weights and tokenizer files are stored with Git LFS. Clone it with plain Git (after a one-time git lfs install on your machine): ...

`llm_runner` Part I: Implementing an Encoder in Rust

Recently, I’ve started a journey to “understand how LLMs work”. Implementing something is a good way to understand it, so I’ve started the llm-runner project. To start this project, it was natural to start with the architecture described in the fundamental paper Attention is all you need Vaswani et al., 2017. [Vas17] was not using the transformer architecture in an LLM yet, but it was the first to introduce the attention mechanism and the transformer architecture. ...