Concept · 4 episode(s)

LLM Serving

Definition

LLM serving is the operational side of running language models — provisioning, load balancing, autoscaling, request routing — on top of the inference engine itself. It’s where SRE concerns meet model concerns, and where unmeasured tail latency goes to live.

Episodes covering this

179
How DeepSeek Made One User Faster Without Slowing Down the Crowd
DSpark: Confidence-Scheduled Speculative Decoding with
XinCheng, XingkaiYu, ChenzeShao et al. · Peking University / DeepSeek-AI·23 min·Jun 27, 2026
036
Sparse Attention Was the Wrong Frame. Treat It as Geometry Instead.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Dehghankar, Asudeh · University of Illinois Chicago·24 min·May 11, 2026
027
When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
Kamahori, Li, Peter et al. · University of Washington·30 min·May 08, 2026
016
Why Your Coding Agent Stalls While the GPU Runs Hot
MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems
Wang, Ye, Xu et al. · Duke University·24 min·May 03, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.