Making Models Fit: Quantization and Memory Optimization for Single-GPU Inference
The model weights alone consume 28 GB of a 48 GB GPU. This post explores what happens when you shrink them, and what you can do with the memory you get back.
I’m a software engineer focused on building infrastructure for large-scale AI systems. My work involves both distributed systems and machine learning infrastructure.
I’m interested in the engineering challenges behind modern AI systems, how models move from research prototypes to reliable production systems that people rely on every day. I work on evaluation systems, data generation pipelines, and platforms for deploying AI.
I like problems that involve both machine learning and systems engineering. Many of the hard parts of AI today are not about better models, but about building systems that hold up in production. I write here about what I’m learning.
The model weights alone consume 28 GB of a 48 GB GPU. This post explores what happens when you shrink them, and what you can do with the memory you get back.
A closer look at what happens inside an LLM inference server, from prompt arrival to streamed tokens, with vLLM as the serving baseline.
That one line of code that loads a model can take minutes. Here is everything that happens behind it: resolving the model ID, downloading sharded weights, building the architecture, and placing tensors on the GPU.