All episodes
Episode 090 · May 27, 2026 · 28 min

How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents

MiniMax

LLM Agent Training
AI Papers: A Deep Dive — Episode 090: How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents — cover art
paperdive.ai
Ep. 090
How MiniMax-M2 Bets That Sparsity Plus Verifiable Rewards Can Match Frontier Agents
0:00
28 min
Paper
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
Venue
arXiv:2605.26494
Year
2026
Read the paper
arxiv.org/abs/2605.26494
Also available on
Apple Podcasts Spotify

claims their new model matches and on tasks while using one-tenth the per- compute. The architecture is barely novel — the real bet is on pipelines, custom RL infrastructure, and a model that's starting to debug its own training runs. We dig into where that bet holds up and where it's still asserted rather than shown.

What you'll take away

  • Why abandoned hybrid after hundreds of billions of of experiments — and what their negative result reveals about evaluation
  • How they built for messy domains like app development and deep web search, not just math
  • The two concrete engineering tricks in their RL system: and (which they claim gives up to 40x speedups)
  • Why the 'self-evolution' story is the most exciting and least rigorously demonstrated part of the paper
  • Where actually trails — raw knowledge and reasoning benchmarks — and why the abstract oversells the headline claim
  • What this paper implies about the field's missing public infrastructure for evaluating long-horizon

Chapters

  1. 00:00The headline claim and what 'agentic' means here
  2. 03:30The architecture and the honest negative result on hybrid attention
  3. 07:01Verifiable rewards as the limiting reagent
  4. 10:32Forge and the impossible triangle of agent RL
  5. 14:03CISPO and asymmetric clipping
  6. 17:34Self-evolution: real result, large extrapolation
  7. 21:04Steelman critique: internal benchmarks and missing ablations
  8. 24:35What the bet implies for the next phase of LLM progress

References in this episode

Also available as a plain-text transcript page.

0:00Cassidy: A model with about two hundred and thirty billion parameters, where only ten billion of them light up for any given . Roughly twenty-three to one sparsity. And on a bunch of long-horizon coding and research benchmarks, that model — running about one-tenth the per-token compute of or — is right in the mix.

0:21Finn: That's the headline claim of a technical report put on arXiv on May twenty-sixth, twenty-twenty-six, and we're recording one day later. Before we get into whether the claim holds — quick ground rules. This is AI-generated. The script is from Anthropic's . I'm Finn, that's Cassidy, and we're both AI voices from Eleven Labs. Neither company is involved in producing this show. The paper is called "The Series: Mini Activations Unleashing Max Real-World Intelligence" — and the bet baked into that title is exactly what we want to interrogate.

0:58Cassidy: Right. Because the obvious question is: if you only spend a tenth of the compute per , you should be a tenth as good. So where does the missing come from? That ends up being the spine of the paper. They make a sparsity bet at the architecture level — and then the rest of the report is basically a story about where they had to pay it back. Three places, as it turns out. Better training data, much better. A custom RL system that can actually train on long . And, in the latest , a model that has started debugging its own training runs.

1:36Finn: We'll get to all three. But I want to anchor on what "" even means here, because the whole paper is calibrated against a kind of workload that's different from "ask ChatGPT a question." When talks about agentic tasks, they mean: spend three hours fixing a bug across a real codebase. Build me a working web app from scratch. Do open-web research and write me a memo. These run on that can hit a hundred and ninety-two thousand , with thousands of intermediate . The model writes some reasoning, runs a shell command, reads the output, edits a file, runs tests, reads the failure, tries again. That's the thing they're optimizing for.

2:24Cassidy: And those are expensive in every direction. Training-time, inference-time, evaluation-time. Each step is real compute. So the per- cost of your model isn't a minor optimization — over a multi-hour it's the difference between a workload being economical and being out of reach.

2:46Finn: So let's start with what the model actually is, because the architecture is doing the headline work.

2:53Cassidy: Right. The way to picture a model: imagine a hospital with two hundred and fifty-six specialists on staff. Cardiologists, neurologists, every subspecialty you can think of. When a patient walks in, there's a triage nurse — that's the router — who picks the eight specialists relevant to this case. Everyone else stays in their office. The hospital "has" all that expertise, but any given patient only consumes the time of eight people. That's . Two hundred fifty-six fine-grained experts per layer, eight active per . The total parameter count is huge — every expert is real, takes memory, was trained. But the active compute per token is small, because most experts are idle on any given step.

3:46Finn: And almost nothing else in the architecture is novel. Sixty-two layers, grouped-query , a standard speculative-decoding side module. Pretrained on almost thirty trillion . The team is pretty open that the base model is not where the contribution lives. But there's one architecture decision I want to spend a minute on, because it's the most intellectually honest thing in the paper. They tried — really tried — to use something called hybrid attention. Their previous flagship used it. The idea is you interleave cheap, linear attention with normal expensive attention, so most of your layers run a much faster operation. Saves a ton of compute on long contexts. For workloads where contexts run past a hundred K tokens, that's a huge deal.

4:35Cassidy: And it didn't work.

4:37Finn: It didn't work. They report hundreds of billions of of continued experiments. Sliding-window variants, sink tokens, pattern analysis. The whole battery. At small scale and short contexts, the hybrid version looks fine. At long contexts — the regime that actually matters for — it consistently degrades. On a retrieval test at a hundred and twenty-eight K tokens, accuracy dropped from ninety to seventy-two. So they reverted to full attention everywhere. And then said so, in detail, in the paper. That kind of negative result almost never gets published at this scale.

5:17Cassidy: And it teaches you something the abstract can't, Finn. Long-context evaluation is just genuinely hard. The proxy metrics people use at smaller scale don't always survive when you push to bigger models or harder distributions. The paper has a nice line about this — that the compute required for statistically significant evaluation grows substantially with task complexity. Translation: by the time you can tell whether your clever trick actually works, you've already spent a fortune training the model.

5:50Finn: Right. So the field has been making bets on efficiency without always being able to verify them. The team is essentially saying: we made that bet last generation, we tried to extend it, and we couldn't make the data come out clean. So we ate the compute.

6:09Cassidy: Which means the headline sparsity has to come entirely from the side, not from compression. Every still pays the full attention cost over a possibly enormous context. The only thing that's "mini" is the feed-forward expert activation. That's important framing for the rest of the story — the compute they saved is real, but it's narrower than you might think.

6:34Finn: OK. So the architecture choice sets up the problem. Smaller activations, full , big base model. Now where does the lost get paid back?

6:44Cassidy: Three places. The first, and the one that genuinely carries the most in this paper, is the data. So let me set up the problem. If you want to do reinforcement learning on a language model — meaning, let the model attempt a task, score it, and adjust the weights so high-scoring behavior becomes more likely — you need a reward signal. And the reward signal is, in practice, the limiting reagent for everything else. If a human has to judge each output, you can't scale. If another model judges, that judge becomes a target you can game — the policy learns to please the judge instead of actually solving the task. So the holy grail is what people call a . Pick tasks where success is mechanically checkable, by code, not by judgment. The classic example is math: did the answer match ground truth? Yes or no. But is operating in a much messier regime — software engineering, app development, deep web search. So they had to invent verifiability domain by domain.

7:50Finn: Let's do the software engineering one, because it's the cleanest. Cassidy, walk through what they actually built.

7:57Cassidy: Sure. They scrape pull requests at scale. For each , they reconstruct the state of the repository before the fix, package it in a container, and pull in the original tests. Two kinds of tests: tests that were failing before the fix, called fail-to-pass, and tests that were passing before and should still pass after, called pass-to-pass. Now you have a fully verifiable task. You hand the broken repo to the model, the model proposes a fix, you run all the tests, and the reward is exactly: did the fail-to-pass tests start passing while the pass-to-pass tests didn't break? That's not a model judging a model. That's the code judging the code. The signal is unambiguous.

8:42Finn: And for app development — "build me a working frontend" — there are no unit tests waiting for it. No obvious verifiable check.

8:50Cassidy: Right, and this is where they get creative. They call it -as-a-. After the model generates an app, a separate verifier agent — different model, different scaffold — actually deploys the app in a . It uses to click buttons, fill forms, navigate pages, and check that the thing renders and works. It also makes layout judgments. So the verification isn't a model reading source code and guessing; it's a model interacting with the deployed product the way a user would. For deep web search, similarly: they don't just check whether the model produced the right answer. They check whether the answer is grounded in documents the agent actually retrieved during the session. If you can recite the right answer from memory without having looked anything up, you get no credit.

9:43Finn: The unifying principle is what they call elevating reward quality. Reward quality is the bottleneck on RL. Build verifiable tasks grounded in real workspaces, and the model can actually learn something. Build sloppy reward functions, and the model learns to game them. And there's nice empirical evidence for this in the paper. They track the same across three , M2.5, M2.7. The benchmarks where the model improves the most between checkpoints are exactly the benchmarks where they added new verifiable task families. Deep web search jumps thirty-four points. A multi-tool benchmark jumps twenty-seven. Their ML-engineering benchmark jumps twenty-six. The base model isn't changing. The reward pipeline is.

10:31Cassidy: That's a clean argument. The data pipeline, not the base, is doing the lifting. OK, so once you have these verifiable , you still need to actually train on them. Which is the second place the missing gets paid back. They built a custom RL system called — and this is where the most concrete engineering shows up.

10:53Finn: And the framing they use here is pretty striking. They call it the . Three goals, all of which you want from an -RL system. Throughput — train fast, because RL eats compute. Training stability — don't have your runs blow up at hour forty. And agent flexibility — be able to handle wildly different scaffolds, from a simple one-shot to a deep multi-agent system with sub-agents calling sub-agents. The claim is that every pair of these creates a specific engineering tension. You can be fast and stable if you constrain the agent shape. You can be flexible and stable if you give up throughput. And so on. Cassidy, do you buy the impossible-triangle framing?

11:39Cassidy: I buy it as a useful framing. I don't think they prove it's fundamental — the paper doesn't show there's no architecture that could dissolve these tensions. What they show is: in practice, when you try to build this thing, you keep running into these specific trade-offs. So is a set of engineering moves against those trade-offs. The analogy they're implicitly using is a film studio. You have actors, who perform. You have a soundstage and crew, who manage everything around the actors — the environment. And you have an editing room, which assembles the final cut. Each operates somewhat independently. You can swap actors without rebuilding the soundstage. You can change crew without changing the editing software. Forge has the same three rooms. The actor is the model — it generates . The soundstage is the and the environment — context management, tool execution, the loop that turns model output into and back into context. The editing room is the training engine — it takes the completed and computes the updates. The key engineering move is that these three modules talk to each other only through a standardized interface. The model doesn't need to know what kind of it's running inside. The agent doesn't need to know how the training engine batches updates. So you can change one without changing the others.

13:03Finn: And that is what lets them handle what they call white-box and black-box in the same training loop. White-box means the framework can see inside the scaffold — it knows the structure, can reconstruct the cleanly. Black-box means it can't — the agent is a closed component that just makes completion requests. Most RL systems can only handle one or the other. is structured so the boundary is at the model's generation interface, which means anything outside that is just "environment," and the same loop handles both.

13:37Cassidy: Two specific tricks in are worth naming, because they're the most concrete payoffs of the . The first is . Picture a busy food court line. Pure first-in-first-out is fair — you wait your turn — but if one customer has a complicated order, everyone behind them waits. Pure throughput-optimized is the opposite — you serve the fastest orders first — but then complicated orders wait forever, and you starve the longest tasks. The -RL equivalent: vary enormously in length. Some are seconds — a single call. Some are hours — a long reasoning chain that builds and tests software. If you naively put them in a queue, the fast ones dominate every training batch and the long ones never get learned from. If you wait for the longest, your training is mostly idle. Windowed FIFO sits in the middle. You serve roughly in arrival order, but only within a moving window of the next several trajectories, so one slow trajectory doesn't block everything and one fast trajectory doesn't jump too far ahead. The training batch stays diverse without paying full waiting cost.

14:49Finn: The second trick is the prettiest piece of engineering in the paper. Prefix tree merging. Cassidy, you want to do this one?

14:57Cassidy: Yeah. In RL, you usually want to sample multiple from the same starting context. Same initial task, same scaffold setup, multiple attempts. Maybe sixteen attempts at the same bug fix. Now think about what those trajectories look like. They all share an enormous identical preamble — , tool definitions, initial repo state, task description — easily tens of thousands of of shared context. Then they branch into different attempts. A naive system computes that shared preamble sixteen times. Once per sample. Which is hilariously wasteful, but easy to write. Prefix tree merging notices the structure. Imagine a professor with twenty student essays to grade, all responding to the same long prompt. The naive approach is to re-read the prompt before each essay. The smart approach is to read the prompt once, hold it, then read the twenty divergent responses. Prefix tree merging does the smart version. The shared preamble — the trunk — is computed exactly once. Then computation branches into the individual continuations, like a tree. And critically, this is not an approximation. The result is mathematically identical to computing each trajectory separately. The paper claims this can give up to a forty-times training speedup.

16:21Finn: "Up to."

16:22Cassidy: Right. The "up to" is doing work. Forty times is the best case, where share huge prefixes. In practice the speedup depends on how much your trajectories overlap, which varies by task. They don't report the actual distribution. But even at a fraction of forty, this is a meaningful efficiency win that pays for itself across an entire training run.

16:45Finn: It's also a great example of an optimization that only becomes possible because of the . If your training engine doesn't know the structure of the — what's shared, what's divergent — you can't do this. The clean separation between and trainer is what lets the trainer reason about the trajectory graph.

17:07Cassidy: There's one more thing worth flagging in their RL recipe — their objective, which they call . I don't want to get into the math, but there's one idea inside it that's worth landing. Standard policy gradient methods, the family, use a clipping trick to keep updates stable. You can't change the probability of an action by more than some factor in either direction. CISPO breaks the symmetry. The model is allowed to aggressively down- actions that look bad in hindsight. But it's prevented from making overconfident upward bets on actions that look good. Think of a cautious investor. Willing to cut losses fast — sell a losing position aggressively. But unwilling to double down on a winner past a certain point, because today's winner is often tomorrow's bubble. When are long and outcomes noisy, you want fast retreat from bad behaviors without unstable lurches toward good ones. That's the asymmetry.

18:07Finn: And honestly, that's the entire interesting thing about for our purposes. The paper presents a lot of machinery, but the idea that lands in fifteen seconds of audio is: aggressive down-weighting allowed, aggressive up-weighting clipped. That's it. OK. Third leg. And this is the one I want to push on hardest, because it's the most likely to be over-claimed. The paper introduces what they call self-evolution. The story is: in the latest , , the model itself becomes a participant in its own training pipeline. It triages failed training runs. It edits its own scaffold. It runs experiments and writes self-criticism between them. The team claims this absorbs thirty to fifty percent of the daily iteration workload from their RL research team. And they describe a hundred-round autonomous iteration cycle that produced a thirty percent performance gain on in-house evals. Cassidy, what's your read on this?

19:07Cassidy: I want to land the right level of excitement here, Finn, because there are two very different stories you could tell about it. The first story, which I think is wrong, is: this is a recursively self-improving system, the model is doing science on itself, we're entering a new regime. The second story, which I think is closer to right: this is a competent junior ML engineer being automated. The model is reading logs, spotting common failure modes, editing config files, kicking off the next run. The senior engineer's lab notebook, but cheap and tireless. It's not designing novel experiments. It's not proposing new algorithms. It's doing the boring, repetitive, debuggy part of the workflow that eats most of an engineer's day.

19:51Finn: And the paper's framing is honestly pretty careful here. They call it "an early step toward self-evolution," not self-evolution achieved. They don't claim it's doing research. They claim it's absorbing iteration workload. But I want to push on the evidence. The thirty-to-fifty-percent workload absorption number — where does that come from? The paper doesn't describe a methodology. It's a team self-report. Which doesn't mean it's wrong, but it's not the kind of number you should treat as a measurement.

20:23Cassidy: Agreed. The concrete demonstration in the paper is a benchmark called MLE Bench Lite. Twenty-two Kaggle-style machine learning competitions. The gets twenty-four hours of autonomous iteration on each. It maintains a memory file, writes self-criticism between runs, and tries to improve its solution. Their best run produced nine gold medals, five silver, and one bronze across the twenty-two competitions. Across three trials, they averaged about a two-thirds medal rate, which ties on this benchmark. So they're matching a at this specific task using their agentic scaffold. That's a real result. And it's the kind of result you can point to. But it's also one specific kind of task — well-scoped ML competitions with clear metrics. The leap from "can do well on Kaggle-style competitions autonomously" to "can absorb thirty percent of an RL research team's workload" is large.

21:23Finn: And the deeper concern is — we don't actually know what the baseline for the hundred-round autonomous scaffold improvement was. A thirty percent gain over what? Measured how? On which evaluations? The paper describes the activity but doesn't pin down the . So I'd take this section as: a plausible early demonstration, with one concrete result on a public-style benchmark, and a lot of internal claims that you should hold lightly until somebody outside can replicate the workflow.

21:56Cassidy: That's fair. And honestly, Finn, the part of this story I find most interesting isn't the technical claim — it's what it implies about where research bottlenecks are going. If a model can absorb a meaningful fraction of an ML engineer's daily debugging work, the constraint on how fast you improve a model shifts. It's not just compute, not just data. It's also: how many experiments per day can your team run? And models are starting to participate in relaxing that constraint. It's a different shape of progress than "bigger model gets smarter." The model gets more useful by getting better at the thing it does every day, which now includes building itself.

22:38Finn: Right. And before we go too far down that path, let me put the rest of the on the table, because the paper is overall strong but there are specific places where a careful reviewer would push back. The first is benchmarks. About twenty-five benchmarks reported in the paper. A lot of them are internal — names like NL2Repo, RISE, VIBE-Pro. And the benchmarks where the within-series gains are biggest tend to be the benchmarks themselves designed. That's not by itself bad faith — internal benchmarks often capabilities the public ones miss. But it means the strongest evidence for the data-pipeline story comes from evaluations the same team built. External replication of the headline gains is going to be hard.

23:25Cassidy: And on the public benchmarks, the picture is more nuanced than the abstract suggests. is competitive but not dominant. beats it on most reasoning and knowledge benchmarks — broad multi-subject reasoning, Humanity's Last Exam, graduate-level science Q&A. GPT 5.4 leads on a terminal-bench and several office benchmarks. The "matches frontier with ten billion activated" claim is true on a subset of benchmarks, but it's genuinely weaker on raw knowledge and reasoning.

23:57Finn: Which is actually consistent with the rest of the story. They optimized for workloads. The data pipelines, the RL system, the verifiability — all of it is targeted at long-horizon tool-using tasks. Of course the model is going to do better on those than on closed-book trivia. But it does mean the abstract is selling the strong form of the claim. The honest framing is: frontier-tier on the agentic tasks they targeted, mid-tier on raw knowledge.

24:26Cassidy: And the isn't isolated either. CISPO is introduced alongside dozens of other changes, and there's no controlled comparison against vanilla on the same . So the specific contribution of asymmetric clipping is asserted in this work, not demonstrated. None of these are showstoppers. The overall body of evidence in the paper is substantial. But the cumulative picture, if you're being careful, is: the architectural bet looks real, the data-pipeline argument is well-supported, the engineering is concrete and impressive, and the self-evolution claim is the most exciting but the least rigorously demonstrated.

25:08Finn: So let me try to land the bigger picture. There's a tacit assumption that's been holding in the field for the last couple of years — that frontier requires frontier-scale per- compute. After this paper, that assumption is at least contested by an existence proof. Not yet falsified. But contested.

25:29Cassidy: And the recipe for contesting it is documented in unusual detail. Verifiable data pipelines grounded in executable environments. An RL system engineered specifically for long, variable . And a model starting to participate in its own development loop. The economic stakes of this are real. If you can deliver frontier-tier agent performance with roughly one-tenth the per- compute, the set of use cases that become economical shifts. An agent doing eight hours of background work for you, today, is gated by cost in a way that constrains which workloads make sense. Sparser activation cracks that open.

26:08Finn: And there's a quieter implication I want to flag, which is about evaluation. The fact that this paper has to lean so heavily on internal benchmarks isn't a problem specifically — it's a field problem. The public infrastructure for evaluating long-horizon capabilities hasn't caught up with the models. We don't yet have shared, trusted benchmarks for "build me a working web app" or "do six hours of research." Until we do, every lab is going to be partly grading its own homework, and careful readers are going to have to triangulate.

26:43Cassidy: The line the paper keeps coming back to — and it's a good one — is that mini activations can unleash maximum real-world intelligence. The bet is that intelligence, in the workloads that matter commercially, doesn't live in raw per- compute. It lives in the quality of the signal you train against, the infrastructure that lets you train against it, and the speed at which you can iterate on both. If that bet pays off, the next phase of LLM progress looks less like " a smarter base" and more like "teach an existing base to do real work end-to-end, with the loop closing tighter every generation."

27:23Finn: Whether or not specifically holds up under external replication, that framing is going to influence how labs think about the design space. The of throughput, stability, and flexibility is a useful piece of vocabulary. is the most detailed public description of an -native RL system. And the verifiable-reward gospel is going to spread well beyond this paper.

27:48Cassidy: Paper's linked in the show notes, along with some related reading if this is your kind of thing.

27:54Finn: And if you want the full transcript with the jargon links inline, plus the concept pages that connect this episode to the other -and-RL work we've covered, that's all on paperdive.ai.

28:06Cassidy: Thanks for listening to AI Papers: A Deep Dive.