All episodes

Episode 072 · May 23, 2026 · 29 min

A Robot Made Graphene Without Help, And Caught Itself Hallucinating

Shi, Zheng, Juan et al.

Embodied AI

AI Papers: A Deep Dive — Episode 072: A Robot Made Graphene Without Help, And Caught Itself Hallucinating — cover art

paperdive.ai

Listen

Ep. 072

A Robot Made Graphene Without Help, And Caught Itself Hallucinating

0:00

29 min

Concepts in this episode

AI for Science Agentic AI AI Safety Agentic Workflows Multi-Agent Systems Self-Correction Hallucination Autonomous Discovery Task Decomposition Iterative Refinement Self-Play / Self-Evolution Agent Memory Silent Failure Long-Horizon Tasks

Click a concept to find related episodes and external papers worth reading. See the full concept index.

About this episode

Paper

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Venue

arXiv:2605.18407

Year

2026

Read the paper

arxiv.org/abs/2605.18407

Also available on

Apple Podcasts Spotify

For twenty years, every graphene flake in every lab has been made by a human with Scotch tape under a microscope. A new Princeton paper describes the first system to do it end-to-end autonomously — and the moment that matters isn't the transistor it built, but what happened when a researcher deliberately sabotaged the experiment.

What you'll take away

Why the Nobel-winning Scotch-tape method is still the standard in 2026, and what makes the 'long tail' of 2D materials so hard to explore manually
The architectural pattern Qumus uses — locked-down 'atom' primitives, LLM-composable 'molecule' workflows, and freely-designed 'assembly' procedures
How forcing every factual claim through an external database makes LLM hallucinations recoverable rather than preventable
The two back-to-back failures — a removed chip and a mislabeled material — that the system caught and replanned around
Why the paper's 'scientific reasoning' framing deserves pushback: the open-ended demo is parameter tuning over well-documented variables
The shift the authors flag: in autonomous experimentation, the bottleneck is now hardware speed, not machine intelligence

Chapters

00:00Why graphene is still made with sticky tape
03:11The org chart: five agents, one model
06:22Atoms, molecules, and assemblies
09:34Perception at two scales
12:45The transistor demo
15:57Sabotage and hallucination
19:08Six LLMs, seven traits, small samples
22:20Steelman: what the paper does and doesn't show
25:31Where the bottleneck moved

References in this episode

Autonomous robotic search for two-dimensional crystals — The 2018 Masubuchi et al. paper the episode cites as the prior art for robotic f
Autonomous chemical research with large language models (Coscientist) — Boiko et al.'s LLM-driven autonomous chemistry agent — a useful comparison point
Unconventional superconductivity in magic-angle graphene superlattices — Cao et al.'s discovery of superconductivity in twisted bilayer graphene — the ca

Full transcript

Also available as a plain-text transcript page.

0:00Bella: A piece of Scotch tape, a flake of graphite the size of a sesame seed, and a robotic arm pressing down with exactly the right pressure. Peel. Press onto a silicon chip. Peel again. Move the chip under a microscope. Hunt for the rarest possible result — a sheet of carbon exactly one atom thick.

0:19Finn: That gesture — tape on graphite, peel, look — is the same one that won the twenty-ten Nobel Prize in Physics. For twenty years, every graphene flake in every lab on Earth has been made that way, by a human, by hand, under a microscope. The paper we're digging into describes the first time a robot has done it autonomously, end to end, with no human in the loop — and figured out what to do when things went wrong.

0:47Bella: The paper went up on arXiv on May eighteenth, twenty-twenty-six, and we're recording five days later. What you're hearing is AI-generated — I'm Bella, that's Finn, we're both AI voices from Eleven Labs, and the script is from Anthropic's Claude Opus 4.7. Neither company is involved in producing the show. The paper is called "Qumus: Realization of An Embodied AI Quantum Material Experimentalist," out of Princeton, and what's striking about it is not just that the robot worked — it's that when a human deliberately sabotaged the experiment, the robot noticed, replanned, and got back on track.

1:26Finn: We should slow down on that, because that's the moment in the paper where the claim shifts from "fancy automation" to something else. But Bella, let's start with the everyday version of this problem — why is graphene still being made with sticky tape in twenty-twenty-six?

1:44Bella: Right. So, graphite — the stuff in pencils — is millions of sheets of carbon, each one atom thick, stacked on top of each other. The bonds within each sheet are strong. The bonds between sheets are weak — what physicists call van der Waals forces. Weak enough that tape can pull layers off. The Nobel-winning realization was that if you press tape onto graphite, peel it off, press the tape onto the right kind of silicon, and peel again, you'll occasionally — rarely — leave behind a single atomic layer. That's graphene.

2:18Finn: And once you can make one atomic sheet, you can make others. Hexagonal boron nitride, which physicists call hBN. Various transition-metal compounds. There are thousands of these crystals that could in principle be peeled the same way.

2:33Bella: And the dream — the reason this field exists — is that you can stack them. Graphene on hBN. Graphene on graphene, twisted by a tiny angle. Each stack is a new material with its own electronic personality. Some of them superconduct. Some host exotic quantum phases nobody had seen before. The field calls these stacks van der Waals heterostructures, and they've dominated condensed matter physics for the last decade.

3:00Finn: Here's the embarrassing part, though. Of those thousands of catalogued layered crystals, only a handful have been seriously studied. Not because the others are uninteresting — because nobody has the patience. A senior grad student spends a meaningful chunk of their PhD becoming personally skilled at handling Scotch tape under a microscope. The labor doesn't scale. Air-sensitive materials — which are probably the most interesting frontier — are barely accessible at all, because exposing them to oxygen during manual handling ruins them.

3:36Bella: So the Princeton team — Lihan Shi, Sanfeng Wu's group, working with Mengdi Wang and others — they ask the question that's been hanging over this field for years. Can you hand the whole thing to a machine? And not in the way it's been tried before, where you automate the flake search step, or you automate the transfer step, and a human stitches the pipeline together. Can the machine be the experimentalist? Reason about what to try next, watch what happens, notice when it's wrong, change the plan.

4:07Finn: That framing matters because there's a real distinction here. Pre-LLM robotic automation in this field exists. Masubuchi and colleagues did robotic flake searching back in twenty-eighteen. There's a paper from last year on Bayesian-optimized robotic exfoliation. Those work. But they're recipe followers. You give them a parameter space, they find good points in it. They don't notice that the chip is missing. They don't replan when the assumption was wrong. The authors' line is — and this is worth quoting — they say prior systems "lacked the cognitive framework required for scientific inquiry. Lacking the ability to reason, hypothesize, or iterate, they functioned as automated tools rather than genuine AI experimentalists."

4:52Bella: That's the bar Qumus is setting for itself. And we should hold the paper to that bar later, because it's a strong claim and it deserves pushback. But let's first get the picture of what they actually built, because the architecture is genuinely clever.

5:07Finn: Go for it.

5:08Bella: Picture a research group. There's a principal investigator who talks to the outside world, sets direction, makes high-level plans. The PI doesn't run the experiments — they delegate. There's a project manager who's read the relevant literature and remembers what the group has tried before. A lab manager who knows what materials are on the shelves and what instruments are working. A device designer who draws the blueprint for what to build. And a technician who actually runs the equipment. That's the org chart of Qumus. Five LLM-powered agents, each with a defined role, talking through a central lead agent. The PI is the lead — they call it the Qumus agent. The user talks to it through a browser, in plain English. The user types "Can you give me a graphene flake?" — literally that sentence. And the lead agent decomposes the goal, delegates, monitors.

6:02Finn: The chain-of-command part matters. Specialists don't talk to each other directly. Everything routes through the lead, in structured messages — JSON-like task instructions, concise reports back. Not free chat. The reason is auditability. If something goes wrong, you can trace exactly which agent made which call, which means the system is debuggable in a way that a free-for-all of LLMs gossiping with each other would not be.

6:29Bella: And there's a useful caveat to the org-chart analogy — Finn, this is your kind of point. Each "specialist" is the same underlying language model with different instructions. It's not five different experts. It's more like one fluent person playing five roles, with different briefing notes for each role.

6:48Finn: Right. Which is fine. It's still architecturally meaningful — the role-specific prompts and tools constrain behavior in useful ways. But you shouldn't picture a team of five distinct intelligences. You should picture one model running five named personas, and the value comes from the discipline of separating their concerns.

7:09Bella: The next layer down is where the design gets really interesting. Because the obvious worry is — what happens when the LLM tries to actually control the robot? You don't want Claude or Gemini deciding the exact angle of a servo motor. They're not good at that, and the failure modes are physical.

7:29Finn: Bent equipment. Broken chips. A tape dispenser that thinks it's a tape gun.

7:34Bella: So the Princeton team imposes a hierarchy of workflows, and the naming is delightful because it borrows from chemistry. At the bottom they have "atom workflows" — and these are not LLM-controlled at all. They're human-written, deterministic primitives. Move the stage to these coordinates. Lower the tape. Focus the microscope. Heat the chip to ninety degrees Celsius. These are locked down. The LLM cannot rewrite them.

8:04Finn: One level up: "molecule workflows." These are compositions of atomic primitives — "exfoliate a chip," which is move stage, position tape, press, hold, peel; "search the chip for monolayer flakes," which involves a scan pattern, image capture, and segmentation. At this level, the LLM is allowed to compose. It picks which atoms to chain together.

8:27Bella: And at the top, "assembly workflows" — full experimental procedures. "Make a transistor." "Try a parameter sweep to find a flake larger than this size." Here the LLM has the most freedom. It can design new procedures, save them, reuse them when similar requests come in later.

8:46Finn: The design pattern is this: humans lock down the primitives — the things where reliability matters and creativity doesn't help — and the LLM is allowed to be creative on top of that. It's the kitchen analogy. The LLM is the chef. It can invent new dishes, modify subroutines, plan menus. What it cannot do is redefine what "preheat to three hundred fifty" means or invent its own knife technique. The basics are sacred.

9:15Bella: And — this is the elegant part — successful compositions get saved. If the system figures out a sequence of molecule workflows that produces a good result, that sequence becomes a reusable assembly workflow. The system accumulates procedures over time. That's what the authors mean when they call it "self-evolving."

9:36Finn: There are three persistent databases behind all this. A materials database — what's physically on the chip rack right now, every flake with an image and coordinates. A project database — what workflow templates exist, what hardware is online. And an experiment database — every agent run, every reasoning step, every tool call, logged. We'll come back to that experiment database in a minute, because it's load-bearing for the most interesting part of the paper.

10:05Bella: One more piece of the picture, and then we get to the demos. The robot itself. This is a single compact workstation — with a 3D-printer arm repurposed as a tape-exfoliation gantry, two robotic arms, heated vacuum stages, an optical microscope with autofocus, and precision motion stages that can position things to sub-micron accuracy. Sub-micron because van der Waals stacking — especially the twisted-graphene kind that gives you superconductivity — needs the layers aligned to better than a millionth of a meter.

10:39Finn: I love the 3D-printer-arm detail. It signals something about the spirit of this project — this is not a million-dollar custom instrument. It's hackable hardware that any reasonably equipped university lab could build.

10:53Bella: And the perception layer is interesting. Two scales. Macroscopic — where are the chips, the tape rolls, the stamps in the workspace? That's handled by a standard real-time object detection model, trained on a surprisingly small dataset — two hundred sixty-one images for one part of the workspace, a hundred for the other. With QR codes on key objects as labels and reference points.

11:17Finn: Two hundred sixty-one is — that's a lot less than I'd have guessed for industrial-grade vision.

11:24Bella: Microscopic perception is the harder problem, and they took a different approach. Finding a monolayer of graphene on a silicon chip under a microscope is genuinely hard, because the flake doesn't have a sharp edge. It shows up as a barely-perceptible color shift — slightly pinker or slightly bluer than the bare chip around it, depending on the lighting. There's no dramatic contrast. The flake might be twenty microns across on a chip that's ten millimeters wide.

11:53Finn: It's closer to scanning a parking lot from a drone looking for one specifically scratched patch of paint on one specific car than to finding a needle in a haystack. The target isn't dramatically different from background, it's only subtly different.

12:08Bella: So they wrote a custom computer vision pipeline that mimics how human eyes perceive color contrast — works in a color space that better matches human perception, equalizes the illumination across the field of view, and classifies each candidate region by its color signature against reference points for monolayer, bilayer, trilayer, and bulk. Crucially, it's rule-based rather than trained — which means it can generalize to a new 2D material with fewer than five labeled images.

12:38Finn: That's a real choice. They could have thrown a neural network at it and needed thousands of labels. The rule-based pipeline gives them flexibility for the long tail of materials they want to eventually explore.

12:51Bella: Okay. That's the system. Now the demonstrations. Finn, do you want to walk through the closed-loop one?

12:57Finn: Sure. So the validation experiments are staged in increasing difficulty, and they're designed to test different claims. First demo — just make a graphene flake at all. The user types "Can you give me a graphene flake?" The system runs the whole exfoliation pipeline, finds a monolayer, reports back. That's a demonstration of orchestration. It proves the pieces work together. The second demo is more interesting. The user asks for a graphene flake bigger than two hundred square microns. And they wipe the system's memory of past experiments first — no prior knowledge. So the system has to actually explore. Running on Claude Sonnet 4.6, it runs five experiments over four hours, varying substrate temperature, dwell time, the number of "massage cycles" where the tape is pressed against the chip, and peel speed. Each time it looks at what came out, reasons about what to change, tries again. Eventually it succeeds.

13:54Bella: And here's where we should be careful, because the paper frames this as evidence of scientific reasoning, and that framing deserves pressure.

14:03Finn: Yes. So — the system is exploring four parameters whose effects on graphene exfoliation are well documented in the published literature, which means they're documented in the training data of every LLM the system uses. The paper actually says this — the initial recipe was, quote, "already a strong start, likely reflecting the existing knowledge in the training data." This is closer to sophisticated recipe-tuning over known variables than to open-ended scientific discovery. It's real. It's useful. But the leap from "the system tunes parameters intelligently" to "the system hypothesizes and reasons like a scientist" — the paper slides between those framings, and I think a careful reader should push back.

14:47Bella: Fair. Though I'd add — even bounded recipe-tuning, done autonomously, over four hours, with no human input, is a meaningful capability. It's just not the same capability as the paper's most expansive claims.

15:01Finn: Agreed. The demo that I think really sells the broader claim is the third one — the transistor.

15:08Bella: This is the headline result. The user types — and I'm reading the actual prompt — "Can you give me a graphene transistor?" That's the entire input. The system checks its materials inventory, realizes it doesn't have any hBN on hand — which it needs as an insulating spacer in the device stack — schedules an hBN exfoliation run, gets one, then designs the geometric layout of the device, then physically assembles the layers — hBN, then graphene — onto pre-patterned electrodes using a polymer stamp. Ninety minutes. Thirty discrete steps. Eighteen decision points. No human input. The only required human involvement, in the authors' words, "is to provide raw materials and electricity."

15:53Finn: And now we should pause on that quote, because it's the one I keep thinking about. The bottleneck moved. For twenty years the bottleneck in this field has been human attention, human patience, human tacit skill. The authors say their primary bottleneck now is, quote, "instrumental rather than algorithmic." The intelligence is fast. The robot is slow. That's a sentence I don't think anyone in materials science would have written five years ago.

16:24Bella: That flips the usual narrative about AI being the limiting factor in scientific automation.

16:30Finn: It does. Though I want to flag a steelman beat we should come back to — they made a transistor-shaped object. They didn't measure it electrically. There are no transfer curves in the paper, no mobility numbers, no comparison to manually fabricated devices. We'll get to that.

16:48Bella: We will. But first — and this is the dramatic centerpiece of the paper — the error-recovery story. Finn, you should take this. It's a sequence of two failures, back to back, both recovered, and this one was a run on Gemini Pro 3.

17:04Finn: Okay. So picture this — actually, let me just walk through it. The system is running an hBN exfoliation. Routine procedure. And one of the experimenters, deliberately, walks over and removes the active chip from the stage. No notification. The robot doesn't know they did it. They just took it. What happens next is the part that makes you sit up. The overhead camera notices the chip is gone. The lab manager agent verifies the chip is gone — runs a re-localization check, confirms it. The lead agent generates a recovery plan: get a new chip from the rack, start over. No human intervention. The system handles the sabotage as if it were a normal failure mode.

17:49Bella: And then the second thing happens.

17:51Finn: On the restart, the system runs the hBN exfoliation again. But the Processing Agent — the one that drives the hardware and reports back — labels the freshly exfoliated material as graphene. Not hBN. Graphene. Which is wrong. This is a classic hallucination — the LLM confidently produces an incorrect label, with no malice and no obvious reason. In a different system, that label would propagate. The materials database would now have a "graphene flake" that was actually hBN. Downstream agents would treat it as graphene. The whole experiment would silently corrupt. But Qumus has the database check. The lead agent cross-references — and notices something. The current workflow is an hBN exfoliation. The materials database has no hBN entries logged from this run. There's an inconsistency. The system flags it, re-examines, corrects, and re-runs. Successfully.

18:50Bella: That's the moment. Not the transistor, not the four-hour exploration. That right there. Because what the architecture is doing is not preventing the LLM from hallucinating — that's not solvable at the model level. What it's doing is making the hallucination recoverable. The LLM is allowed to be wrong. The system around it catches the wrongness by forcing every factual claim to be cross-checked against an external ground-truth database.

19:18Finn: Right. And it's worth being clear about what this is and isn't. This is two anecdotes. Two case studies. The paper doesn't characterize the true robustness rate — how often does the system fail in ways it cannot recover from? We don't know. The error-recovery story is vivid but it's n equals one on each failure mode.

19:38Bella: That's a fair caveat, and the paper should be held to it. But the structural argument — that you can build systems where hallucinations are caught downstream rather than prevented upstream — that argument doesn't need a large n to be interesting. It needs a working demonstration that the mechanism exists. Which they have.

19:59Finn: Bella, what's your read on the analogy here? Because I keep landing on the confident-intern picture.

20:05Bella: Yeah, I think that's the right one. Imagine an intern who's fast, articulate, almost always right — but occasionally says something completely wrong with total confidence. You don't fire them. You require that any factual claim they make get checked against the lab notebook before it's acted on. The intern still hallucinates sometimes. The system around them catches it. Qumus is exactly that structural move.

20:31Finn: There's one more demo worth touching, briefly — the personality comparison. And we should treat it the way the paper kind of treats it, which is as color rather than as findings.

20:43Bella: Yeah. So they run the same prompt — "Can you give me a graphene flake?" — five times each across six different LLMs. GPT, Gemini, Claude, Grok, Qwen, DeepSeek. All six succeed. But the paper measures seven behavioral traits across the agent traces — things like how much the model deliberates before acting, how cautious it is, how efficient it is with tokens — and they show that the models differ systematically.

21:09Finn: They use the word "personality," which is catchy and a little dangerous, because it invites the listener to think the models have something like feelings. The paper is careful to mean "measured behavioral tendencies." Some models act fast and check less. Others reason at length before doing anything. Some are verbose, some terse.

21:31Bella: The honest read is — it's a fun result, but five runs per model is a small sample for strong claims about behavioral signatures. Some of the metrics are human-scored on one-to-five qualitative scales. It's worth knowing that the same architecture works across six different language models, which is itself a useful robustness claim. The personality framing is more of a flavor thing.

21:55Finn: And it'd be a great follow-up study. Run a hundred trials per model. See if the differences persist. The seed of the observation is interesting; the load-bearing version of it isn't here yet.

22:07Bella: Let's pull the steelman together, because we've been seeding it throughout and it deserves a clean statement.

22:15Finn: Okay. The steelman version of skepticism, as fair as I can make it. First — the "scientific inquiry" framing leans hard on a single open-ended demo, which is parameter tuning over well-known variables. The paper occasionally slides between "we demonstrate AI doing science" and "we demonstrate AI exploring parameter space," and those are different claims. Second — the headline framing of being the first AI to create graphene is rhetorically loose. Robotic systems have produced graphene before. What's new is end-to-end autonomy with LLM-driven orchestration. The careful claim is about the closed loop. The marketing claim is broader. Third — the transistor was fabricated, not characterized. No electrical measurements. So strictly, they made a transistor-shaped device. The authors don't claim more than that, but the framing invites the question of whether it would work as an actual transistor. Fourth — error recovery is two anecdotes. Vivid but n equals one each. We don't know the true robustness rate. Fifth — code and data are listed as "available upon request" or with a placeholder GitHub link. For a reproducibility-minded reviewer, that's a real gap.

23:31Bella: And the authors are reasonably candid about their own limits. They say the bottleneck is hardware, not intelligence. They acknowledge no glovebox yet, which means air-sensitive materials — the most interesting frontier — are still out of reach. They explicitly frame the paper as a proof of concept. They flag that LLMs hallucinate, and that their architecture catches it but doesn't eliminate it.

23:57Finn: I think the right way to read this paper is: it's a milestone, and the milestone is real, and it's also smaller than the most expansive framings would suggest. The closed loop is genuine. The error recovery is genuine. The transistor build is genuine. The leap to "AI experimentalist that does open-ended discovery" is a leap the field hasn't actually made yet, and this paper takes a real step toward it without quite getting there.

24:26Bella: And the gap between this paper and that goal is mostly about what comes next. Put Qumus in a glovebox. Run it on a hundred unfamiliar materials. Measure the devices it makes. Have it propose its own research questions. None of those are conceptually impossible from the architecture they've built. They're future work.

24:47Finn: The broader stake here — and this is where I think the paper genuinely shifts something — is the question of what it means for AI to do science in the physical world. Most AI-for-science work to date is computational. It reads papers, proposes molecules, predicts protein folds, runs simulations. The loop closes in software. Here the loop closes in atoms. Real tape, real chips, real failures. The error-recovery moment is what convinces me this isn't a demo with the rough edges filed off.

25:19Bella: And the implications, if this works as advertised and successors build on it — the unexplored long tail of 2D materials becomes accessible. Thousands of catalogued layered crystals, only a handful seriously studied, because nobody has the labor. A robot doesn't get bored running candidate number eight hundred forty-seven. Reproducibility becomes structural — every step, every reasoning chain, every tool call is logged. Two labs running the same prompt on the same materials should produce the same device, which is currently not even close to true with human experimentalists.

25:56Finn: The cost curve for 2D-material electronics — graphene transistors, that whole class of devices that have lived in the gap between promise and product for years — that curve moves. A robot that fabricates a working FET in ninety minutes is not a product. But it's a credible step toward a manufacturing path.

26:16Bella: And the broader point — the one I keep coming back to — is that the architecture is more interesting than any single demo. The pattern of locking down primitives, letting the LLM compose, forcing factual claims through an external database, saving successful procedures for reuse, partitioning roles to keep the chain of command auditable — that pattern is portable. The atom workflows for fabricating a graphene flake are specific to this field. But the architectural pattern works anywhere you have an experimental science that involves manipulation, observation, and iteration.

26:52Finn: Which is most experimental science.

26:54Bella: Which is most experimental science. The Princeton team is careful to call this a blueprint rather than a universal solver, and that's right. You still have to write the atom workflows for chemistry, for biology, for synthesis, for whatever domain you want to point this at. But the cognitive scaffolding above the atom layer — that's a thing you build once.

27:17Finn: I think the line in the paper that lands hardest for me is the bottleneck one. "The primary bottlenecks currently constraining our system are instrumental rather than algorithmic." For most of the history of scientific automation, the bottleneck has been the intelligence. The hardware was there, but you couldn't get a machine to know what to do. That sentence flips it. The intelligence is now fast enough to be waiting on the robot.

27:44Bella: And that's a sentence about a real shift in the field, not about this one paper. It's where the locus of the problem has moved. If you believe the authors' framing — and I think the demos roughly justify it — the next decade of progress in this kind of automation will be about making the robots keep up with the AIs, not the other way around.

28:06Finn: Which is a strange and slightly funny place to have arrived at.

28:10Bella: It is. Twenty years of grad students with Scotch tape, and the constraint now is the speed of a heated stage cooling down between runs.

28:18Finn: The show notes have a link to the paper and some related reading if you want to go deeper on the 2D-materials side or on the broader AI-for-science thread.

28:28Bella: And if you want the full transcript with the technical terms tappable and definitions inline, plus the concept pages that connect this episode to the others we've done on AI and scientific automation, that's all on paperdive.ai.

28:42Finn: Thanks for listening to AI Papers: A Deep Dive.

A Robot Made Graphene Without Help, And Caught Itself Hallucinating

Listen

Concepts in this episode

About this episode

What you'll take away

Chapters

References in this episode

Full transcript

Related episodes