The process unfolds in two main stages: prefill and decoding.
https://arxiv.org/pdf/2504.19720v1 Taming the Titans: A Survey of Efficient LLM Inference Serving
https://www.anyscale.com/blog/continuous-batching-llm-inference
Memory bandwidth is the rate at which data can be read from or stored into a memory by a processor.
<aside> 💡
LLM inference is memory bandwidth bound because the GPU spends most of its time transferring large amounts of weights and cached data — not doing computation.
</aside>
FlashAttention found significant throughput improvements by reorganizing the attention computation to require less memory-IO.
LLM inference throughput is largely determined by how large a batch you can fit into high-bandwidth GPU memory.
Static batching: GPU is underutilized as generation lengths of different requests in a batch are not the same.

Continuous batching once a sequence in a batch has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching.

https://www.usenix.org/system/files/osdi22-yu.pdf Orca paper
https://www.usenix.org/conference/osdi22/presentation/yu
Q: How can the model handle a batch where requests are in different phase?
Continuous batching doesn't merge prefill and decode tokens into a single model input tensor for one pass. Instead, it enables the serving system's scheduler to intelligently interleave separate prefill computations (for new requests) and decode computations (for ongoing requests). This is made possible by:
This keeps the GPU highly utilized and allows new requests to start processing without waiting for all ongoing requests to fully complete, leading to better overall system performance.
https://huggingface.co/blog/bloom-inference-optimization
https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
https://huggingface.co/docs/transformers/v4.35.2/en/llm_tutorial_optimization
https://huggingface.co/blog/assisted-generation
Memory bandwidth is the limitation in the matrix computations (from the VGPU to the GPU compute cores): the bottleneck in the forward pass comes from loading the model layer weights into the computation cores of your device, not from performing the computations themselves.
Language decoder forward pass, revisited
<aside> 💡
When caching is disabled, the input contains the entire sequence of tokens generated so far and the output contains the logits corresponding to the next token for all positions in the sequence! The logits at position N correspond to the distribution for the next token if the input consisted of the first N tokens, ignoring all subsequent tokens in the sequence.
</aside>