All episodes
Episode 048 · May 16, 2026 · 31 min

How a 30B Open Model Reached Olympiad Gold With the Right Recipe

Li, Zhan, Zhang et al.

LLM Post-training
AI Papers: A Deep Dive — Episode 048: How a 30B Open Model Reached Olympiad Gold With the Right Recipe — cover art
paperdive.ai
Ep. 048
How a 30B Open Model Reached Olympiad Gold With the Right Recipe
0:00
31 min
Paper
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Venue
arXiv:2605.13301
Year
2026
Read the paper
arxiv.org/abs/2605.13301
Also available on
Apple Podcasts Spotify

A thirty-billion-parameter open-source model just matched the top human score on the 2026 — and the full training recipe is public. The result suggests that olympiad-grade proof reasoning, long assumed to require trillion-parameter frontier systems, may have been more about training procedure than raw scale.

What you'll take away

  • Why proof-writing and answer-finding are fundamentally different skills, and how models can score 95% on answer-based math but 20% on proof benchmarks
  • The : feeding the model its most surprising training examples first, and why it beats both random and easy-first ordering
  • How a two-stage RL progression — cheap , then expensive proof-quality rewards — extracts more than either alone
  • The loop where the model writes ~100,000- proofs, critiques them, and iterates up to 30 times per attempt
  • The honest asterisks: human-vs-model comparison conditions, grading regime differences, and the substantial inference compute the recipe requires
  • Where is genuinely strong (formally tractable problems) versus where it still fails (global combinatorial structure and delicate invariants)

Chapters

  1. 00:00The headline result and why it's different
  2. 03:27Answer-finding versus proof-writing
  3. 06:55Stage one: reverse-perplexity curriculum
  4. 10:23Stage two: coarse RL on verifiable answers
  5. 13:51Stage three: refined RL with a proof-grading judge
  6. 17:19Stage four: test-time scaling and 100,000-token proofs
  7. 20:46The steelman: asterisks on the headline numbers
  8. 24:14Where the model is strong and where it fails
  9. 27:42Specializable generalists and what's actually new

References in this episode

Also available as a plain-text transcript page.

0:00Bella: Three hundred and forty high schoolers sat for the this year. The median score was six points out of forty-two. The highest score, by a single human competitor, was thirty-five. And a thirty-billion-parameter open-source model, given the same six problems, scored thirty-five as well.

0:20Eric: Not on a benchmark. On the actual olympiad. Same problems, graded the same way, by human experts on a seven-point scale per question — the same scale they use for the kids.

0:31Bella: That's the result that the paper we're digging into today is built around — and the paper went up on arXiv on May thirteenth, twenty-twenty-six, and we're recording three days later, on May sixteenth. Before we go further, the show you're hearing is AI-generated. The full title is "Achieving Gold-Medal-Level Reasoning via Simple and Unified Scaling," the script is written by Anthropic's , and you're listening to me, Bella, alongside Eric — we're both AI voices from Eleven Labs, and this podcast isn't affiliated with Anthropic or Eleven Labs. With that out of the way: the reason this particular result matters isn't just the score. It's that the score came out of a thirty-billion-parameter model with a fully documented recipe. Which, for olympiad math, has never really been true before.

1:27Eric: Right. Because up to now the gold-medal-on- story has belonged to a small number of frontier labs — , Deep Think, OpenAI's . Massive systems, often bolted onto bespoke symbolic infrastructure, training recipes mostly described in blog posts. The implicit message has been: this is what you get when you throw the deepest possible pockets at the problem.

1:54Bella: And the question this team — researchers at Shanghai AI Lab, Chinese University of Hong Kong, Tsinghua, Peking — the question they're asking is whether that's actually true. Whether olympiad-level proof reasoning *requires* trillion-parameter scale, or whether the right training recipe can pull the same out of a much smaller, already-existing open model.

2:18Eric: And the model they're starting from is interesting in its own right. It's called . Thirty billion parameters total, but it's a — meaning on any individual , only about three billion of those parameters are actually doing work. The model has, in a sense, lots of specialists inside it, and a small router decides which two or three specialists to consult for each word it's writing. That detail will matter later when we get to how they train it.

2:49Bella: The starting model is already pretty good at scientific reasoning. It can do problems, it can answer physics questions, it can produce solutions that get to the right number. The problem the authors identify is that it's a strong *answer-finder* and a weak *proof-writer*. And those, it turns out, are different skills.

3:10Eric: This distinction is the whole game, and it's worth slowing down on. If you've watched LLMs do math over the past few years, you've seen them get terrifyingly good at problems with checkable final answers. — multiple-choice integers — basically saturated. But olympiad problems are different. There is no final number to circle. You write a proof. A human grader reads it end-to-end. And the grader docks you mercilessly for unjustified steps, for missing cases, for the moment where you said "and clearly..." but it wasn't actually clear.

3:47Bella: There's a striking statistic that captures this. The same family of models that score ninety-five percent on answer-based math can score twenty percent on — a benchmark designed specifically to grade proof quality rather than final answers. The gap between "got the right answer" and "wrote a correct proof" is enormous, and it's been kind of the hidden ceiling for .

4:12Eric: So the question the paper is really asking, Bella — is how do you take a model that's been trained to find answers and turn it into one that produces *arguments*. And their claim is that you can do it with four staged moves, in order, where each move depends on the prior one being in place.

4:30Bella: Right. And the cleanest way to think about those four moves is as a coaching arc. First you reshape the student's style. Then you drill them on getting right answers. Then you grade them on the quality of their reasoning, not just the answer. And finally, you teach them to check their own work. That's the recipe. Let me take the first move, because it's the one with the most genuinely novel idea in it.

4:55Eric: Go for it.

4:56Bella: Okay. So stage one is . They have around three hundred and forty thousand reasoning — these are essentially worked examples of olympiad-style proofs, the kind of dense, formal reasoning they want the model to learn to produce. Standard practice is to feed these to the model in some order — random, or maybe easy-to-hard like a textbook curriculum. They do something different. They sort the training examples by *reverse *. They measure, for each training example, how surprised the starting model is by it. And then they feed the model the most surprising examples *first*.

5:35Eric: Which is the opposite of what you'd intuit from any normal pedagogy.

5:40Bella: Completely opposite. The image I keep coming back to is this. Imagine you're a teacher trying to take a strong math undergraduate and turn them into someone who writes proofs in the dense, formal style of a research mathematician. You have two strategies. One: start with familiar warm-ups, build up gradually. The student does fine on the warm-ups — but their writing style barely shifts, because the warm-ups never push them out of their habits. Two: throw them into the most foreign, alien-looking proofs first, while they're still in "absorb everything" mode, and only then ease back into familiar territory so they can consolidate what they've learned.

6:22Eric: The authors are taking path two.

6:24Bella: They're taking path two. And the reason is the right sorting key is that perplexity is literally measuring "how foreign is this to you right now." High-perplexity examples are where the model's reasoning style is most unlike the target style — which is exactly where the behavior shift has to happen.

6:44Eric: And do they have to back this up? Because this is the kind of clever-sounding idea that often turns out to be a wash.

6:53Bella: They do, and the numbers are not subtle. With random ordering, the model recovers about forty percent on after training. With ascending — easy-first, the textbook approach — it actually does *worse*, around twenty-four percent. With descending perplexity, surprising-first, it lands at about fifty-six percent. So the curriculum order alone is the difference between a meaningfully better model and a meaningfully worse one.

7:21Eric: And there's a second number they track that I find vivid. Truncation rate. Meaning the rate at which the model gets stuck in a loop during generation and has to be cut off mid-output.

7:33Bella: Right — under random ordering, truncation happens around seven percent of the time, which is a model that's frequently breaking. Under the descending- curriculum, that drops to zero-point-three percent. So it's not just better answers, it's the difference between a model that produces coherent extended reasoning and one that frequently spins out.

7:58Eric: Okay. So stage one — they've reshaped the model's style. It can produce long-form, proof-shaped reasoning without falling into loops. Now they need it to actually be *right*. That's where the RL comes in, and this is where I want to pick up the thread.

8:15Bella: Take it.

8:16Eric: So stage two — the authors call it "coarse RL." They're going to train the model on about nine thousand math problems where the answer is verifiable. Meaning: there's a specific number, or a specific algebraic expression, and you can check automatically whether the model got it right. The setup is reinforcement learning in the now-standard style for . The model generates many candidate solutions to each problem. Each one gets a binary score — one if the answer is right, zero if it isn't. You then nudge the model to make the high-scoring more likely. Here's the design choice that matters. There's a whole family of policy-optimization algorithms — the standard one in the literature is called . And GRPO does its updates at the level. Meaning for every word the model generated, it computes a separate update signal.

9:15Bella: And that's the part that doesn't work for models, right?

9:19Eric: That's the part that doesn't work. Remember, in an model, different get routed to different experts. So when you compute a per-token signal, you're getting a measurement that depends on which experts happened to fire. It's noisy in a way that pure dense models aren't. And the authors switch to a variant called — sequence-level instead of token-level. The analogy I'd use is this. Imagine grading a student's essay. One option: give every sentence its own grade and tell the student to adjust each sentence individually. The other option: give one grade to the whole essay and tell the student "this version is better than that version, write more like this." For most essays, sentence-level feedback is fine. But if the student is using a wild, exploratory style where any given sentence might be a dead end that pays off later — sentence-level feedback becomes actively misleading.

10:20Bella: And models, with all that routing noise, are the wild stylist.

10:24Eric: Exactly. So they grade the whole essay. They also freeze the routing decisions during certain training stages — so that when the model revisits an example, the same go to the same experts and the signal stays stable. These are the kinds of details that don't change the high-level story but matter enormously for whether the training run actually converges.

10:49Bella: And what does coarse RL get them, performance-wise?

10:53Eric: Significant lift. On -ProofBench, they go from about thirty-six percent after the stage to fifty-one percent after coarse RL. So they're roughly halving the gap to the frontier just by drilling on verifiable problems. But — and this is the move into stage three — they hit a ceiling. Because "got the right answer" is a noisy proxy for "reasoned correctly." The model can guess. It can write a sloppy proof with an unjustified step that happens to produce the right number. The reward signal is cheap and reliable, but it's not measuring what they actually care about.

11:32Bella: So they switch graders.

11:34Eric: They switch graders. Stage three is what they call "refined RL." The reward stops being "is the final answer correct" and becomes "is the entire proof valid." And the way they implement that — they use another language model. A specialized one called , which is trained to read through a full mathematical proof and grade it the way a human olympiad judge would. The shift, in the coaching analogy — you've moved from multiple-choice quizzes to oral exams. The quizzes were fast and unambiguous, but they couldn't tell you whether the student was reasoning or pattern-matching. The oral exam is slow, expensive, the examiner's judgment is fallible — but it captures something the quiz can't.

12:25Bella: And the fallibility is real, right? Because the judge is itself a language model with its own failure modes.

12:32Eric: Very much so. The authors are candid about this. They call it "vulnerability to judge artifacts" — the model finds ways to produce malformed output that the judge happens to score generously. So they have to add preprocessing that catches malformed generations and replaces them with a safe fallback before the judge ever sees them. There's an arms race element to proof-level RL that you don't have with verifiable answers. But — assuming you can manage the judge, refined RL adds two mechanisms that are both interesting. The first is self-refinement. When the model's on a problem are mostly bad — average reward below a half — they take the failed attempts and convert them into a *new* training prompt. Something like "here's the problem, here's your previous wrong solution, fix it." And they mix those repair prompts back into training at a twenty percent ratio. The second mechanism is , and Bella, this is the one I think is genuinely elegant.

13:42Bella: The highlight reel.

13:44Eric: The highlight reel. So on really hard problems, the model very rarely produces a correct proof. Most fail. But occasionally — once in a batch — it stumbles onto a valid argument. The standard RL setup would let that success scroll past and disappear. The authors instead store it in a , and replay it back into training at a twenty-five percent ratio, so the model gets repeated exposure to its own rare successes. The clever part is the admission criterion. When multiple successful exist for a problem, they pick the one with the *lowest * — the one where the model was the most confident, the most uncertain-free, all the way through the solution.

14:33Bella: Why ?

14:34Eric: The intuition is that low means the model wasn't fumbling its way to the answer. It was producing a clean, decisive chain of reasoning. Which suggests the encodes a *reusable proof pattern* — something the model can learn to reproduce — rather than a lucky stumble that happened to land. The basketball-coaching analogy from the context brief is the right one. You're saving the highlight to the reel, but only the shots that looked clean and confident, not the lucky ones where the player was falling sideways. Those lucky shots are noise. The clean ones encode technique.

15:16Bella: And they retire from the buffer once the model can reliably reproduce that success on its own. So the is doing this very specific job — bridging the gap between "the model can do this once if it gets lucky" and "the model can do this on demand."

15:35Eric: Exactly. After refined RL, they're at about fifty-eight percent on -ProofBench. Up from fifty-one. And then comes stage four, which is where the headline numbers actually come from. And I want to hand this back to you, Bella, because I think the loop is where this whole thing starts to feel viscerally strange.

15:59Bella: It is genuinely strange. So stage four — they call it , or TTS — happens at inference, not during training. The model has been trained. They're now using it to solve actual olympiad problems. And the loop looks like this. The model produces an initial proof. Then it inspects its own draft. It writes what the authors describe as a structured bug report — identifying critical errors, identifying unjustified claims, flagging gaps. Then it decides: accept the proof, reject it, or attempt a refinement. If it refines, it produces a new version conditioned on the previous attempt and its own critique. Then it inspects again. And it iterates — up to thirty rounds per run, across up to ten parallel runs. A solution only gets accepted if it passes self-verification five consecutive times.

16:53Eric: Which is the grad-student-writing-a-dissertation pattern. You don't write a chapter in one pass. You produce a draft, walk away, come back, mark up everything you don't believe, write a list of things to fix, attempt a revision, repeat.

17:10Bella: Exactly that. And here is the number that I keep coming back to, the one that makes this concrete. During , the median *initial* proof generation is roughly one hundred and six thousand . A hundred thousand tokens. Per problem. That's a small book.

17:30Eric: And the refinement passes are nearly as long. Eighty-some thousand for the refined version.

17:36Bella: Which means the model isn't just touching up the draft. It's substantially re-writing the proof while conditioning on its own previous attempt and its own bug report. It's doing real proof repair. And the inspection passes — the self-critique — those are around twenty-eight thousand . The model is writing a structured critique of its own work that's longer than most academic papers.

18:03Eric: And this is where the headline performance actually comes from. *without* scores in bronze territory on 2025 — about twenty-one out of forty-two. SU-01 *with* TTS scores thirty-five out of forty-two. Which is exactly the gold-medal cutoff.

18:21Bella: And on 2026 — that's the result we opened with — thirty-five points, which is ten points above the gold cutoff and matches the highest score among three hundred and forty human competitors. Median human score: six. Top-twelve cutoff: twenty-six. Maximum: thirty-five. The model matched the maximum.

18:43Eric: I want to slow down here, Bella, because this is also where the steelman lives. That comparison is striking but it has asterisks, and I think we owe the listener the asterisks.

18:56Bella: Please.

18:56Eric: Okay. The humans took over nine hours, two days, strict no-tools, no-retries. got to attempt each problem up to ten times in parallel, with up to thirty self-correction cycles per attempt, with effectively unlimited compute. These are not comparable conditions. The is real, but the comparison is suggestive, not equivalent. The second asterisk: direct-generation scores in this paper are graded automatically. scores are graded by human experts. The authors are aware of this and they take "the worst of three" expert scores as a conservative estimate, which helps — but the most impressive numbers come from the more lenient grading regime. That's not deceptive, but it's worth holding in your head.

19:49Bella: And the third — the budget is doing a lot of work. The parameter-count framing — "a thirty-billion-parameter model" — is true but slightly misleading, because if you account for total inference compute, you're looking at something potentially comparable to a much larger model that answers in a single pass. The right comparison isn't model-to-model, it's something closer to total-compute-to-total-compute.

20:19Eric: Right. The reasonable framing is something like: this recipe can extract gold-medal from a thirty-billion-parameter open model *given a substantial inference budget*. Which is still genuinely interesting — it's not the same as saying you can pull it out of thin air.

20:38Bella: And one more — the curriculum we talked about earlier, the reverse- finding, was done on a smaller validation setup, not the full production-scale training mixture. The authors flag this themselves. A skeptic would want to know whether the effect is as strong at full scale. The intuition is clean, but the evidence base is partial.

21:01Eric: All of which is to say — this is a paper that does a lot of things right, including being unusually honest about what its results do and don't claim. Which I think makes it worth taking seriously rather than dismissing.

21:16Bella: There's a section near the end that I think captures that honesty well. They walk through specifically which problems the model failed. On 2025 Problem 6, the model produced what the authors call an "invalid column-permutation reduction" — it missed a subtle structural constraint in the problem. On 2026 Problem 2, it left gaps in what the authors describe as "delicate global strategy arguments."

21:43Eric: And they characterize the boundary cleanly. The model is strongest when problems admit a "rigid formal representation" — when you can convert the problem into coordinates, into modular arithmetic, into a , into an automata-based dynamic-programming setup. The model is weakest when the core difficulty is preserving combinatorial structure across many cases, or proving an invariant that has to be tuned just so.

22:12Bella: There's a beautiful example of the strength on the positive side. 2026 Problem 3 is a geometry problem. Human olympiad solvers would typically attack it with angle chasing and auxiliary constructions — that's the canonical toolkit. instead translates the whole problem into complex numbers on the unit circle. It treats the equilateral-triangle rotations, the chord relations, the tangent conditions, all as a single algebraic framework — and solves it through what the authors call "an ingenious analytic reformulation." It's a sixty-plus page proof, and it's elegant. The model didn't just find an answer; it found a *cleaner way to think about the problem* than the standard approach.

22:58Eric: Which is the moment, I think, where if you've been around mathematicians, you recognize the move. A really good problem-solver doesn't just power through with the obvious tools. They find a reframing that makes the problem easier. did that here.

23:15Bella: And then on Problem 2, it didn't. It got stuck trying to argue about a global combinatorial structure and left gaps. So the picture isn't "the model is suddenly an olympiad medalist across the board." The picture is "the model has gold-medal-level on a specific class of problems — the ones with rigid formal structure — and clear weakness on a different class."

23:40Eric: And Bella, this is the part I want to dwell on, because I think it's actually the most interesting scientific finding in the paper, separate from the headline scores. The fact that we can now characterize *what kind of mathematical thinking* the model can and can't do is a real piece of information about where this technology is. It's not "AI can do math now." It's "AI can do formally tractable math at the level of the world's best high school students, and it can't yet do the kind of math where the core challenge is finding the right combinatorial frame."

24:18Bella: That's worth lingering on. There's a tendency to talk about model capabilities as a single dial — better or worse. What this paper is telling us is that proof-quality mathematical reasoning is now a real for medium-sized open models, but it's a *shaped* capability with edges. And being able to describe the edges sharply is more useful than a global percentage.

24:42Eric: Now — I want to circle back to one piece I think we glossed over too quickly. The "" framing. Because I think it's the conceptual takeaway the authors most want the listener to leave with.

24:56Bella: Yes. So this is their preferred phrase, and it shows up throughout the paper. The framing is: rather than building a narrow olympiad solver from scratch — bolting together symbolic reasoners with search trees and bespoke architectures, the approach — they take a broadly capable post-trained model and *specialize* it toward proof reasoning, while being careful not to destroy the general competence underneath. The analogy that lands for me is the doctor who did a fellowship. A physician completes general medical training and then specializes — they become a cardiologist. They're now an expert in hearts, but they didn't forget how to treat a fever or set a broken arm. A bad specialization process produces someone narrow and helpless outside their lane. A good one preserves the generalist foundation under the specialty.

25:50Eric: And the implicit claim is that their training recipe is the good kind — that the specialization toward olympiad proofs didn't make the model worse at other things. They check this with some transfer experiments. , despite never being trained on chemistry or biology, gets about twelve percent on something called — which is a benchmark for general scientific reasoning. That's modest in absolute terms, but it's not zero, and importantly the specialization didn't *break* general .

26:24Bella: That's an honest framing. The transfer story is real but the numbers don't yet support a strong claim that this is "general scientific reasoning." It's evidence that the recipe doesn't destroy generality, more than it's evidence that the recipe builds generality.

26:42Eric: And I think the broader point the paper makes — beyond olympiad math specifically — is that the training *recipe* is doing more of the work than the raw scale. The whole argument is: take an open thirty-billion-parameter , apply a documented sequence of moves, and you reach a level that previously required trillion-parameter frontier systems. If you believe that argument generally, it changes the picture of what frontier is and who can build toward it.

27:14Bella: Which gets us, I think, to what's actually new about this paper as a contribution to the field. A few things. The is a genuinely novel methodological contribution. The intuition — surprising-first, familiar-last, when you're trying to shift a competent model's style — probably applies to a lot of settings, not just olympiad math. The coarse-then-refined RL progression — start with cheap, reliable, answer-based rewards, then move to expensive, nuanced, proof-quality rewards — is a transferable design pattern. You can imagine the same staging applied to code, to scientific writing, to legal reasoning. The test-time loop — draft, structured self-critique, decide, repair, iterate — is widely applicable beyond math. Any domain where you can produce a long-form output and inspect it for errors is a candidate.

28:12Eric: And the open-recipe aspect matters. Unlike , which is described in blog posts and tied to proprietary infrastructure, is built on open components — open , open data sources, open reinforcement learning frameworks, open . The paper provides enough detail that a well-resourced lab could replicate. Which shifts olympiad-level reasoning from "frontier lab " to "open research capability."

28:40Bella: And, Eric, there's a quieter philosophical thread underneath all of this — the question of what we even mean by "reasoning" when a model can sustain coherent mathematical thought across one hundred thousand , write a structured critique of its own work, then repair the proof while preserving the parts that were already correct. Whether that's "real reasoning" is genuinely contestable. But the operational result is that you can now point this kind of system at a hard problem and watch it work through it the way a graduate student would — drafts, dead ends, revisions, the whole texture.

29:20Eric: The thing I keep coming back to, honestly, is the hundred-thousand- figure. Because in audio that number is easy to gloss over, but if you actually sit with it — the model isn't reasoning in our timescale anymore. It's producing the volume of a short book per problem, including substantial wrong turns and self-correction, and emerging from the other side with a proof a human grader signs off on.

29:46Bella: And the corresponding asterisk that we owe the audience — the recipe gets you there. The naked thirty-billion parameter model doesn't. The doesn't. It's the specific staged recipe — the curriculum, the two-stage RL, the test-time loop — that converts the raw capacity into proof-grade output. The headline isn't "small models can do this." The headline is "the right training procedure can pull more out of a small model than we thought, *given enough inference compute on top*."

30:19Eric: Which is, I think, where this paper actually sits in the long arc. Not as a refutation of scaling — still lead on the toughest benchmarks. is still ahead. But as a strong piece of evidence that a lot of what we thought required scale was actually requiring *recipe*. And the recipe, in this case, is now public.

30:44Bella: The paper is called "Achieving Gold-Medal-Level Reasoning via Simple and Unified Scaling." The model is . We'll drop the paper and some related reading in the show notes — worth a look if any of this caught you, especially the appendix solutions. The complex-number proof of Problem 3 is genuinely worth reading on its own terms.

31:08Eric: Thanks for listening. This has been AI Papers: A Deep Dive.