MARS · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A serving system for AI agents that watches both GPU and CPU pressure to decide what runs next.

As stated in the literature

A holistic LLM serving system for agentic workloads featuring AIMD-based admission, multi-level feedback queue scheduling, and unified KV-eviction across CPU and GPU pressure signals.

Why it matters: Agentic workloads behave very differently from chat workloads, so serving systems designed for them can meaningfully outperform general-purpose ones.

For example, when GPU memory gets tight, MARS decides whether to evict cached state from a chatty agent or admit a new short request based on signals from both CPU and GPU pressure.

Heard on the show

“It's called "MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems," out of Duke, posted to arXiv in April.”

Episode 016 — Why Your Coding Agent Stalls While the GPU Runs Hot

Mentioned in 1 episode

016
Why Your Coding Agent Stalls While the GPU Runs Hot

Related terms

agent AIMD CPU multi-level feedback queue