All episodes

Episode 090 · May 27, 2026 · 28 min

How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents

MiniMax

LLM Agent Training

paperdive.ai

Listen

Ep. 090

How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents

0:00

28 min

Concepts in this episode

Agentic AI Training Methods Evaluation & Benchmarks Reinforcement Learning Agentic RL Reward Model Long-Horizon Tasks Rollout Sampling Policy Gradient Synthetic Data Self-Play / Self-Evolution AI Coding Agents SWE-bench Agent Benchmarks Inference Cost Iterative Training Long Context

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

Venue

arXiv:2605.26494

Year

2026

Read the paper

arxiv.org/abs/2605.26494

Also available on

Apple Podcasts Spotify

MiniMax claims their new model matches Claude Opus and GPT-5 on agentic tasks while using one-tenth the per-token compute. The architecture is barely novel — the real bet is on verifiable reward pipelines, custom RL infrastructure, and a model that's starting to debug its own training runs. We dig into where that bet holds up and where it's still asserted rather than shown.

What you'll take away

Why MiniMax abandoned hybrid attention after hundreds of billions of tokens of experiments — and what their negative result reveals about long-context evaluation
How they built verifiable rewards for messy domains like app development and deep web search, not just math
The two concrete engineering tricks in their Forge RL system: windowed FIFO scheduling and prefix tree merging (which they claim gives up to 40x speedups)
Why the 'self-evolution' story is the most exciting and least rigorously demonstrated part of the paper
Where M2.7 actually trails frontier models — raw knowledge and reasoning benchmarks — and why the abstract oversells the headline claim
What this paper implies about the field's missing public infrastructure for evaluating long-horizon agentic capability

Chapters

00:00The headline claim and what 'agentic' means here
03:30The architecture and the honest negative result on hybrid attention
07:01Verifiable rewards as the limiting reagent
10:32Forge and the impossible triangle of agent RL
14:03CISPO and asymmetric clipping
17:34Self-evolution: real result, large extrapolation
21:04Steelman critique: internal benchmarks and missing ablations
24:35What the bet implies for the next phase of LLM progress

References in this episode

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models — The fine-grained MoE architecture that influenced the 256-expert design MiniMax-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — The benchmark that pioneered the executable-test verification approach MiniMax e
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — A contemporaneous case study in scaling verifiable-reward RL, useful contrast to
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering — The OpenAI benchmark behind the 'MLE Bench Lite' Kaggle-style evaluation MiniMax

Full transcript

Also available as a plain-text transcript page.

0:00Cassidy: A model with about two hundred and thirty billion parameters, where only ten billion of them light up for any given token. Roughly twenty-three to one sparsity. And on a bunch of long-horizon coding and research benchmarks, that model — running about one-tenth the per-token compute of Claude Opus or GPT-5 — is right in the mix.

0:21Finn: That's the headline claim of a technical report MiniMax put on arXiv on May twenty-sixth, twenty-twenty-six, and we're recording one day later. Before we get into whether the claim holds — quick ground rules. This is AI-generated. The script is from Anthropic's Claude Opus 4.7. I'm Finn, that's Cassidy, and we're both AI voices from Eleven Labs. Neither company is involved in producing this show. The paper is called "The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence" — and the bet baked into that title is exactly what we want to interrogate.

0:58Cassidy: Right. Because the obvious question is: if you only spend a tenth of the compute per token, you should be a tenth as good. So where does the missing capability come from? That ends up being the spine of the paper. They make a sparsity bet at the architecture level — and then the rest of the report is basically a story about where they had to pay it back. Three places, as it turns out. Better training data, much better. A custom RL system that can actually train on long agent trajectories. And, in the latest checkpoint, a model that has started debugging its own training runs.

1:36Finn: We'll get to all three. But I want to anchor on what "agent" even means here, because the whole paper is calibrated against a kind of workload that's different from "ask ChatGPT a question." When MiniMax talks about agentic tasks, they mean: spend three hours fixing a bug across a real codebase. Build me a working web app from scratch. Do open-web research and write me a memo. These run on trajectories that can hit a hundred and ninety-two thousand tokens, with thousands of intermediate tool calls. The model writes some reasoning, runs a shell command, reads the output, edits a file, runs tests, reads the failure, tries again. That's the thing they're optimizing for.

2:24Cassidy: And those trajectories are expensive in every direction. Training-time, inference-time, evaluation-time. Each step is real compute. So the per-token cost of your model isn't a minor optimization — over a multi-hour rollout it's the difference between a workload being economical and being out of reach.

2:46Finn: So let's start with what the model actually is, because the architecture is doing the headline work.

2:53Cassidy: Right. The way to picture a mixture-of-experts model: imagine a hospital with two hundred and fifty-six specialists on staff. Cardiologists, neurologists, every subspecialty you can think of. When a patient walks in, there's a triage nurse — that's the router — who picks the eight specialists relevant to this case. Everyone else stays in their office. The hospital "has" all that expertise, but any given patient only consumes the time of eight people. That's MiniMax-M2. Two hundred fifty-six fine-grained experts per layer, eight active per token. The total parameter count is huge — every expert is real, takes memory, was trained. But the active compute per token is small, because most experts are idle on any given step.

3:46Finn: And almost nothing else in the architecture is novel. Sixty-two transformer layers, grouped-query attention, a standard speculative-decoding side module. Pretrained on almost thirty trillion tokens. The team is pretty open that the base model is not where the contribution lives. But there's one architecture decision I want to spend a minute on, because it's the most intellectually honest thing in the paper. They tried — really tried — to use something called hybrid attention. Their previous flagship used it. The idea is you interleave cheap, linear attention with normal expensive attention, so most of your layers run a much faster operation. Saves a ton of compute on long contexts. For agentic workloads where contexts run past a hundred K tokens, that's a huge deal.

4:35Cassidy: And it didn't work.

4:37Finn: It didn't work. They report hundreds of billions of tokens of continued pretraining experiments. Sliding-window variants, sink tokens, attention pattern analysis. The whole battery. At small scale and short contexts, the hybrid version looks fine. At long contexts — the regime that actually matters for agents — it consistently degrades. On a long-context retrieval test at a hundred and twenty-eight K tokens, accuracy dropped from ninety to seventy-two. So they reverted to full attention everywhere. And then said so, in detail, in the paper. That kind of negative result almost never gets published at this scale.

5:17Cassidy: And it teaches you something the abstract can't, Finn. Long-context evaluation is just genuinely hard. The proxy metrics people use at smaller scale don't always survive when you push to bigger models or harder distributions. The paper has a nice line about this — that the compute required for statistically significant long-context evaluation grows substantially with task complexity. Translation: by the time you can tell whether your clever attention trick actually works, you've already spent a fortune training the model.

5:50Finn: Right. So the field has been making bets on long-context efficiency without always being able to verify them. The MiniMax team is essentially saying: we made that bet last generation, we tried to extend it, and we couldn't make the data come out clean. So we ate the compute.

6:09Cassidy: Which means the headline sparsity has to come entirely from the MoE side, not from attention compression. Every token still pays the full attention cost over a possibly enormous context. The only thing that's "mini" is the feed-forward expert activation. That's important framing for the rest of the story — the compute they saved is real, but it's narrower than you might think.

6:34Finn: OK. So the architecture choice sets up the problem. Smaller activations, full attention, big base model. Now where does the lost capability get paid back?

6:44Cassidy: Three places. The first, and the one that genuinely carries the most weight in this paper, is the data. So let me set up the problem. If you want to do reinforcement learning on a language model — meaning, let the model attempt a task, score it, and adjust the weights so high-scoring behavior becomes more likely — you need a reward signal. And the reward signal is, in practice, the limiting reagent for everything else. If a human has to judge each output, you can't scale. If another model judges, that judge becomes a target you can game — the policy learns to please the judge instead of actually solving the task. So the holy grail is what people call a verifiable reward. Pick tasks where success is mechanically checkable, by code, not by judgment. The classic example is math: did the answer match ground truth? Yes or no. But MiniMax is operating in a much messier regime — software engineering, app development, deep web search. So they had to invent verifiability domain by domain.

7:50Finn: Let's do the software engineering one, because it's the cleanest. Cassidy, walk through what they actually built.

7:57Cassidy: Sure. They scrape GitHub pull requests at scale. For each pull request, they reconstruct the state of the repository before the fix, package it in a Docker container, and pull in the original tests. Two kinds of tests: tests that were failing before the fix, called fail-to-pass, and tests that were passing before and should still pass after, called pass-to-pass. Now you have a fully verifiable task. You hand the broken repo to the model, the model proposes a fix, you run all the tests, and the reward is exactly: did the fail-to-pass tests start passing while the pass-to-pass tests didn't break? That's not a model judging a model. That's the code judging the code. The signal is unambiguous.

8:42Finn: And for app development — "build me a working frontend" — there are no unit tests waiting for it. No obvious verifiable check.

8:50Cassidy: Right, and this is where they get creative. They call it agent-as-a-verifier. After the model generates an app, a separate verifier agent — different model, different scaffold — actually deploys the app in a sandbox. It uses Playwright to click buttons, fill forms, navigate pages, and check that the thing renders and works. It also makes layout judgments. So the verification isn't a model reading source code and guessing; it's a model interacting with the deployed product the way a user would. For deep web search, similarly: they don't just check whether the model produced the right answer. They check whether the answer is grounded in documents the agent actually retrieved during the session. If you can recite the right answer from pretraining memory without having looked anything up, you get no credit.

9:43Finn: The unifying principle is what they call elevating reward quality. Reward quality is the bottleneck on RL. Build verifiable tasks grounded in real workspaces, and the model can actually learn something. Build sloppy reward functions, and the model learns to game them. And there's nice empirical evidence for this in the paper. They track the same backbone across three checkpoints — M2, M2.5, M2.7. The benchmarks where the model improves the most between checkpoints are exactly the benchmarks where they added new verifiable task families. Deep web search jumps thirty-four points. A multi-tool benchmark jumps twenty-seven. Their ML-engineering benchmark jumps twenty-six. The base model isn't changing. The reward pipeline is.

10:31Cassidy: That's a clean argument. The data pipeline, not the base, is doing the lifting. OK, so once you have these verifiable trajectories, you still need to actually train on them. Which is the second place the missing capability gets paid back. They built a custom RL system called Forge — and this is where the most concrete engineering shows up.

10:53Finn: And the framing they use here is pretty striking. They call it the impossible triangle. Three goals, all of which you want from an agent-RL system. Throughput — train fast, because RL eats compute. Training stability — don't have your runs blow up at hour forty. And agent flexibility — be able to handle wildly different scaffolds, from a simple one-shot tool call to a deep multi-agent system with sub-agents calling sub-agents. The claim is that every pair of these creates a specific engineering tension. You can be fast and stable if you constrain the agent shape. You can be flexible and stable if you give up throughput. And so on. Cassidy, do you buy the impossible-triangle framing?

11:39Cassidy: I buy it as a useful framing. I don't think they prove it's fundamental — the paper doesn't show there's no architecture that could dissolve these tensions. What they show is: in practice, when you try to build this thing, you keep running into these specific trade-offs. So Forge is a set of engineering moves against those trade-offs. The analogy they're implicitly using is a film studio. You have actors, who perform. You have a soundstage and crew, who manage everything around the actors — the environment. And you have an editing room, which assembles the final cut. Each operates somewhat independently. You can swap actors without rebuilding the soundstage. You can change crew without changing the editing software. Forge has the same three rooms. The actor is the model — it generates tokens. The soundstage is the agent harness and the environment — context management, tool execution, the loop that turns model output into tool calls and back into context. The editing room is the training engine — it takes the completed trajectories and computes the gradient updates. The key engineering move is that these three modules talk to each other only through a standardized interface. The model doesn't need to know what kind of agent it's running inside. The agent doesn't need to know how the training engine batches updates. So you can change one without changing the others.

13:03Finn: And that decoupling is what lets them handle what they call white-box and black-box agents in the same training loop. White-box means the framework can see inside the scaffold — it knows the structure, can reconstruct the trajectory cleanly. Black-box means it can't — the agent is a closed component that just makes completion requests. Most RL systems can only handle one or the other. Forge is structured so the boundary is at the model's generation interface, which means anything outside that is just "environment," and the same loop handles both.

13:37Cassidy: Two specific tricks in Forge are worth naming, because they're the most concrete payoffs of the decoupling. The first is windowed FIFO scheduling. Picture a busy food court line. Pure first-in-first-out is fair — you wait your turn — but if one customer has a complicated order, everyone behind them waits. Pure throughput-optimized is the opposite — you serve the fastest orders first — but then complicated orders wait forever, and you starve the longest tasks. The agent-RL equivalent: trajectories vary enormously in length. Some are seconds — a single API call. Some are hours — a long reasoning chain that builds and tests software. If you naively put them in a queue, the fast ones dominate every training batch and the long ones never get learned from. If you wait for the longest, your training is mostly idle. Windowed FIFO sits in the middle. You serve roughly in arrival order, but only within a moving window of the next several trajectories, so one slow trajectory doesn't block everything and one fast trajectory doesn't jump too far ahead. The training batch stays diverse without paying full waiting cost.

14:49Finn: The second trick is the prettiest piece of engineering in the paper. Prefix tree merging. Cassidy, you want to do this one?

14:57Cassidy: Yeah. In agent RL, you usually want to sample multiple trajectories from the same starting context. Same initial task, same scaffold setup, multiple attempts. Maybe sixteen attempts at the same bug fix. Now think about what those trajectories look like. They all share an enormous identical preamble — system prompt, tool definitions, initial repo state, task description — easily tens of thousands of tokens of shared context. Then they branch into different attempts. A naive system computes that shared preamble sixteen times. Once per sample. Which is hilariously wasteful, but easy to write. Prefix tree merging notices the structure. Imagine a professor with twenty student essays to grade, all responding to the same long prompt. The naive approach is to re-read the prompt before each essay. The smart approach is to read the prompt once, hold it, then read the twenty divergent responses. Prefix tree merging does the smart version. The shared preamble — the trunk — is computed exactly once. Then computation branches into the individual continuations, like a tree. And critically, this is not an approximation. The result is mathematically identical to computing each trajectory separately. The paper claims this can give up to a forty-times training speedup.

16:21Finn: "Up to."

16:22Cassidy: Right. The "up to" is doing work. Forty times is the best case, where trajectories share huge prefixes. In practice the speedup depends on how much your trajectories overlap, which varies by task. They don't report the actual distribution. But even at a fraction of forty, this is a meaningful efficiency win that pays for itself across an entire training run.

16:45Finn: It's also a great example of an optimization that only becomes possible because of the decoupling. If your training engine doesn't know the structure of the trajectories — what's shared, what's divergent — you can't do this. The clean separation between agent and trainer is what lets the trainer reason about the trajectory graph.

17:07Cassidy: There's one more thing worth flagging in their RL recipe — their policy gradient objective, which they call CISPO. I don't want to get into the math, but there's one idea inside it that's worth landing. Standard policy gradient methods, the PPO family, use a clipping trick to keep updates stable. You can't change the probability of an action by more than some factor in either direction. CISPO breaks the symmetry. The model is allowed to aggressively down-weight actions that look bad in hindsight. But it's prevented from making overconfident upward bets on actions that look good. Think of a cautious investor. Willing to cut losses fast — sell a losing position aggressively. But unwilling to double down on a winner past a certain point, because today's winner is often tomorrow's bubble. When trajectories are long and outcomes noisy, you want fast retreat from bad behaviors without unstable lurches toward good ones. That's the asymmetry.

18:07Finn: And honestly, that's the entire interesting thing about CISPO for our purposes. The paper presents a lot of machinery, but the idea that lands in fifteen seconds of audio is: aggressive down-weighting allowed, aggressive up-weighting clipped. That's it. OK. Third leg. And this is the one I want to push on hardest, because it's the most likely to be over-claimed. The paper introduces what they call self-evolution. The story is: in the latest checkpoint, M2.7, the model itself becomes a participant in its own training pipeline. It triages failed training runs. It edits its own agent scaffold. It runs experiments and writes self-criticism between them. The team claims this absorbs thirty to fifty percent of the daily iteration workload from their RL research team. And they describe a hundred-round autonomous iteration cycle that produced a thirty percent performance gain on in-house evals. Cassidy, what's your read on this?

19:07Cassidy: I want to land the right level of excitement here, Finn, because there are two very different stories you could tell about it. The first story, which I think is wrong, is: this is a recursively self-improving system, the model is doing science on itself, we're entering a new regime. The second story, which I think is closer to right: this is a competent junior ML engineer being automated. The model is reading logs, spotting common failure modes, editing config files, kicking off the next run. The senior engineer's lab notebook, but cheap and tireless. It's not designing novel experiments. It's not proposing new algorithms. It's doing the boring, repetitive, debuggy part of the workflow that eats most of an engineer's day.

19:51Finn: And the paper's framing is honestly pretty careful here. They call it "an early step toward self-evolution," not self-evolution achieved. They don't claim it's doing research. They claim it's absorbing iteration workload. But I want to push on the evidence. The thirty-to-fifty-percent workload absorption number — where does that come from? The paper doesn't describe a methodology. It's a team self-report. Which doesn't mean it's wrong, but it's not the kind of number you should treat as a measurement.

20:23Cassidy: Agreed. The concrete demonstration in the paper is a benchmark called MLE Bench Lite. Twenty-two Kaggle-style machine learning competitions. The agent gets twenty-four hours of autonomous iteration on each. It maintains a memory file, writes self-criticism between runs, and tries to improve its solution. Their best run produced nine gold medals, five silver, and one bronze across the twenty-two competitions. Across three trials, they averaged about a two-thirds medal rate, which ties Gemini 3.1 Pro on this benchmark. So they're matching a frontier model at this specific task using their agentic scaffold. That's a real result. And it's the kind of result you can point to. But it's also one specific kind of task — well-scoped ML competitions with clear metrics. The leap from "can do well on Kaggle-style competitions autonomously" to "can absorb thirty percent of an RL research team's workload" is large.

21:23Finn: And the deeper concern is — we don't actually know what the baseline for the hundred-round autonomous scaffold improvement was. A thirty percent gain over what? Measured how? On which evaluations? The paper describes the activity but doesn't pin down the counterfactual. So I'd take this section as: a plausible early demonstration, with one concrete result on a public-style benchmark, and a lot of internal claims that you should hold lightly until somebody outside MiniMax can replicate the workflow.

21:56Cassidy: That's fair. And honestly, Finn, the part of this story I find most interesting isn't the technical claim — it's what it implies about where research bottlenecks are going. If a model can absorb a meaningful fraction of an ML engineer's daily debugging work, the constraint on how fast you improve a model shifts. It's not just compute, not just data. It's also: how many experiments per day can your team run? And models are starting to participate in relaxing that constraint. It's a different shape of progress than "bigger model gets smarter." The model gets more useful by getting better at the thing it does every day, which now includes building itself.

22:38Finn: Right. And before we go too far down that path, let me put the rest of the steelman on the table, because the paper is overall strong but there are specific places where a careful reviewer would push back. The first is benchmarks. About twenty-five benchmarks reported in the paper. A lot of them are internal — names like NL2Repo, RISE, VIBE-Pro. And the benchmarks where the within-series gains are biggest tend to be the benchmarks MiniMax themselves designed. That's not by itself bad faith — internal benchmarks often probe capabilities the public ones miss. But it means the strongest evidence for the data-pipeline story comes from evaluations the same team built. External replication of the headline gains is going to be hard.

23:25Cassidy: And on the public benchmarks, the picture is more nuanced than the abstract suggests. M2.7 is competitive but not dominant. Gemini 3.1 Pro beats it on most reasoning and knowledge benchmarks — broad multi-subject reasoning, Humanity's Last Exam, graduate-level science Q&A. GPT 5.4 leads on a terminal-bench and several office benchmarks. The "matches frontier with ten billion activated" claim is true on a subset of agentic benchmarks, but it's genuinely weaker on raw knowledge and reasoning.

23:57Finn: Which is actually consistent with the rest of the story. They optimized for agentic workloads. The data pipelines, the RL system, the verifiability — all of it is targeted at long-horizon tool-using tasks. Of course the model is going to do better on those than on closed-book trivia. But it does mean the abstract is selling the strong form of the claim. The honest framing is: frontier-tier on the agentic tasks they targeted, mid-tier on raw knowledge.

24:26Cassidy: And the CISPO ablation isn't isolated either. CISPO is introduced alongside dozens of other changes, and there's no controlled comparison against vanilla PPO on the same trajectories. So the specific contribution of asymmetric clipping is asserted in this work, not demonstrated. None of these are showstoppers. The overall body of evidence in the paper is substantial. But the cumulative picture, if you're being careful, is: the architectural bet looks real, the data-pipeline argument is well-supported, the Forge engineering is concrete and impressive, and the self-evolution claim is the most exciting but the least rigorously demonstrated.

25:08Finn: So let me try to land the bigger picture. There's a tacit assumption that's been holding in the field for the last couple of years — that frontier agentic capability requires frontier-scale per-token compute. After this paper, that assumption is at least contested by an existence proof. Not yet falsified. But contested.

25:29Cassidy: And the recipe for contesting it is documented in unusual detail. Verifiable data pipelines grounded in executable environments. An RL system engineered specifically for long, variable agent trajectories. And a model starting to participate in its own development loop. The economic stakes of this are real. If you can deliver frontier-tier agent performance with roughly one-tenth the per-token compute, the set of use cases that become economical shifts. An agent doing eight hours of background work for you, today, is gated by cost in a way that constrains which workloads make sense. Sparser activation cracks that open.

26:08Finn: And there's a quieter implication I want to flag, which is about evaluation. The fact that this paper has to lean so heavily on internal benchmarks isn't a MiniMax problem specifically — it's a field problem. The public infrastructure for evaluating long-horizon agentic capabilities hasn't caught up with the models. We don't yet have shared, trusted benchmarks for "build me a working web app" or "do six hours of research." Until we do, every lab is going to be partly grading its own homework, and careful readers are going to have to triangulate.

26:43Cassidy: The line the paper keeps coming back to — and it's a good one — is that mini activations can unleash maximum real-world intelligence. The bet is that intelligence, in the workloads that matter commercially, doesn't live in raw per-token compute. It lives in the quality of the signal you train against, the infrastructure that lets you train against it, and the speed at which you can iterate on both. If that bet pays off, the next phase of LLM progress looks less like "pretrain a smarter base" and more like "teach an existing base to do real work end-to-end, with the loop closing tighter every generation."

27:23Finn: Whether or not MiniMax-M2 specifically holds up under external replication, that framing is going to influence how labs think about the design space. The impossible triangle of throughput, stability, and flexibility is a useful piece of vocabulary. Forge is the most detailed public description of an agent-native RL system. And the verifiable-reward gospel is going to spread well beyond this paper.

27:48Cassidy: Paper's linked in the show notes, along with some related reading if this is your kind of thing.

27:54Finn: And if you want the full transcript with the jargon links inline, plus the concept pages that connect this episode to the other agent-and-RL work we've covered, that's all on paperdive.ai.

28:06Cassidy: Thanks for listening to AI Papers: A Deep Dive.

How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes