Definition
LLM serving is the operational side of running language models — provisioning, load balancing, autoscaling, request routing — on top of the inference engine itself. It’s where SRE concerns meet model concerns, and where unmeasured tail latency goes to live.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.
- Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
- SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
- SnapKV: LLM Knows What You are Looking for Before Generation