LLM Inference Workflow

The process unfolds in two main stages: prefill and decoding.

Prefill phase: Parallel processing

Decoding Phase: Autoregressive Generation


Model Parallelization


Overview, Survey

https://arxiv.org/pdf/2504.19720v1 Taming the Titans: A Survey of Efficient LLM Inference Serving

https://www.anyscale.com/blog/continuous-batching-llm-inference

Memory bandwidth is the rate at which data can be read from or stored into a memory by a processor.

<aside> 💡

LLM inference is memory bandwidth bound because the GPU spends most of its time transferring large amounts of weights and cached data — not doing computation.

</aside>

FlashAttention found significant throughput improvements by reorganizing the attention computation to require less memory-IO.

LLM inference throughput is largely determined by how large a batch you can fit into high-bandwidth GPU memory.

Static batching: GPU is underutilized as generation lengths of different requests in a batch are not the same.

image.png

Continuous batching once a sequence in a batch has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching.

image.png

https://www.usenix.org/system/files/osdi22-yu.pdf Orca paper

https://www.usenix.org/conference/osdi22/presentation/yu

https://www.baseten.co/blog/continuous-vs-dynamic-batching-for-ai-inference/#naive-implementation-for-basic-testing

Q: How can the model handle a batch where requests are in different phase?

Continuous batching doesn't merge prefill and decode tokens into a single model input tensor for one pass. Instead, it enables the serving system's scheduler to intelligently interleave separate prefill computations (for new requests) and decode computations (for ongoing requests). This is made possible by:

This keeps the GPU highly utilized and allows new requests to start processing without waiting for all ongoing requests to fully complete, leading to better overall system performance.


https://huggingface.co/blog/bloom-inference-optimization

https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

https://huggingface.co/docs/transformers/v4.35.2/en/llm_tutorial_optimization

https://huggingface.co/blog/assisted-generation

Memory bandwidth is the limitation in the matrix computations (from the VGPU to the GPU compute cores): the bottleneck in the forward pass comes from loading the model layer weights into the computation cores of your device, not from performing the computations themselves.

Language decoder forward pass, revisited

<aside> 💡

When caching is disabled, the input contains the entire sequence of tokens generated so far and the output contains the logits corresponding to the next token for all positions in the sequence! The logits at position N correspond to the distribution for the next token if the input consisted of the first N tokens, ignoring all subsequent tokens in the sequence.

</aside>