Concept · 10 episode(s)

Multimodal Models

Definition

Multimodal models handle more than one modality — text and images, audio, video, action streams — usually by projecting them into a shared representation space. The frontier question is how cleanly capabilities transfer from one modality to another.

Episodes covering this

209
How 2.6 Billion Doodles Exposed the Culture Words Quietly Delete
Billions of Sketches Reveal Hidden Cultural Variation in Human Concepts
· ·15 min·Jul 09, 2026
191
How One Researcher Beat GPT-5.2 and Gemini 3 by Judging Their Answers, Not Improving Them
Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2
Land · Independent Researcher·26 min·Jul 02, 2026
190
The Skill Every AI Manager Is Missing: Handing Out Exactly the Right Keys
ClawArena-Team: Benchmarking Subagent Orchestration and Dynamic Workflows in Language-Model Agents
Xiong, Ji, Qiu et al. · UNC Chapel Hill·21 min·Jul 02, 2026
156
Why More Human Demonstrations Made a Computer-Use Agent Worse
ProCUA-SFT Technical Report
Jung, Lu, Cui et al. · NVIDIA / University of Washington·20 min·Jun 18, 2026
154
How a 7B Model Out-Investigates a 72B One by Choosing What to Look At
Native Active Perception as Reasoning for Omni-Modal Understanding
Xing, Xu, Wang et al. · The Chinese University of Hong Kong·21 min·Jun 18, 2026
115
Teaching a Phone Agent to Reason Silently, And Keeping It Honest
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Yang, Hu, Hao et al. · Beihang University·24 min·Jun 04, 2026
111
How a 4B Web Agent Beat Models 60x Its Size on 500 Demonstrations
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Yang, Wu, Chen et al. · UIUC·24 min·Jun 03, 2026
062
Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Zhang, Zheng, Yang · Shenzhen University·24 min·May 20, 2026
047
When Agent Benchmarks Lie: The Harness Problem in Open-Source AI
Orchard: An Open-Source Agentic Modeling Framework
Peng, Yao, Wu et al. · Microsoft Research·28 min·May 15, 2026
027
When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
Kamahori, Li, Peter et al. · University of Washington·30 min·May 08, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.