Definition
Multimodal models handle more than one modality — text and images, audio, video, action streams — usually by projecting them into a shared representation space. The frontier question is how cleanly capabilities transfer from one modality to another.