Definition
AI systems that handle more than one kind of input — like text and images together — instead of just one.
Models trained on or operating over multiple data modalities (text, image, audio, video, action), often using a unified representation space across modalities.