Making Models Fit: Quantization and Memory Optimization for Single-GPU Inference
The model weights alone consume 28 GB of a 48 GB GPU. This post explores what happens when you shrink them, and what you can do with the memory you get back.
Blog
The model weights alone consume 28 GB of a 48 GB GPU. This post explores what happens when you shrink them, and what you can do with the memory you get back.
A closer look at what happens inside an LLM inference server, from prompt arrival to streamed tokens, with vLLM as the serving baseline.
That one line of code that loads a model can take minutes. Here is everything that happens behind it: resolving the model ID, downloading sharded weights, building the architecture, and placing tensors on the GPU.