Blog

Short writing on the kinds of engineering problems I want to spend more time on

Making Models Fit: Quantization and Memory Optimization for Single-GPU Inference

Platform, Serving

The model weights alone consume 28 GB of a 48 GB GPU. This post explores what happens when you shrink them, and what you can do with the memory you get back.

Understanding LLM Inference by Building a Server From Scratch

Platform, Serving

A closer look at what happens inside an LLM inference server, from prompt arrival to streamed tokens, with vLLM as the serving baseline.

The Hidden Pipeline Behind LLM Loading

Platform, Serving

That one line of code that loads a model can take minutes. Here is everything that happens behind it: resolving the model ID, downloading sharded weights, building the architecture, and placing tensors on the GPU.