LLMs Inference Optimization

LLM Inference Workflow

The process unfolds in two main stages: prefill and decoding.

Prefill phase: Parallel processing

Decoding Phase: Autoregressive Generation

Model Parallelization

Overview, Survey

https://arxiv.org/pdf/2504.19720v1 Taming the Titans: A Survey of Efficient LLM Inference Serving

[x] Read this

https://www.anyscale.com/blog/continuous-batching-llm-inference

[x] Read this

Memory bandwidth is the rate at which data can be read from or stored into a memory by a processor.

<aside> 💡

LLM inference is memory bandwidth bound because the GPU spends most of its time transferring large amounts of weights and cached data — not doing computation.

</aside>

Why?

FlashAttention found significant throughput improvements by reorganizing the attention computation to require less memory-IO.

LLM inference throughput is largely determined by how large a batch you can fit into high-bandwidth GPU memory.

Decoder-only inference requests are harder to batch than traditional Transformers
Input and output lengths can greatly vary, leading to very different generation times

Static batching: GPU is underutilized as generation lengths of different requests in a batch are not the same.

Continuous batching once a sequence in a batch has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching.

https://www.usenix.org/system/files/osdi22-yu.pdf Orca paper

https://www.usenix.org/conference/osdi22/presentation/yu

https://www.baseten.co/blog/continuous-vs-dynamic-batching-for-ai-inference/#naive-implementation-for-basic-testing

Q: How can the model handle a batch where requests are in different phase?

Continuous batching doesn't merge prefill and decode tokens into a single model input tensor for one pass. Instead, it enables the serving system's scheduler to intelligently interleave separate prefill computations (for new requests) and decode computations (for ongoing requests). This is made possible by:

Rapid task switching on the GPU.
Independent and dynamic management of each request's KV cache (e.g., via PagedAttention).
Dynamic formation of micro-batches appropriate for either prefill or decode operations.

This keeps the GPU highly utilized and allows new requests to start processing without waiting for all ongoing requests to fully complete, leading to better overall system performance.

Detailed

https://huggingface.co/blog/bloom-inference-optimization

[ ] Read this

https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

[x] Read this

https://huggingface.co/docs/transformers/v4.35.2/en/llm_tutorial_optimization

[x] Read this

https://huggingface.co/blog/assisted-generation

Memory bandwidth is the limitation in the matrix computations (from the VGPU to the GPU compute cores): the bottleneck in the forward pass comes from loading the model layer weights into the computation cores of your device, not from performing the computations themselves.

Language decoder forward pass, revisited

<aside> 💡

When caching is disabled, the input contains the entire sequence of tokens generated so far and the output contains the logits corresponding to the next token for all positions in the sequence! The logits at position N correspond to the distribution for the next token if the input consisted of the first N tokens, ignoring all subsequent tokens in the sequence.

</aside>

[x] Get back after reading 2 papers Block-wise Parallel Decoding and Speculative Sampling.