Definition
A unified vision-language model that interleaves text and image generation.
A unified multimodal model performing autoregressive text generation interleaved with diffusion-based image generation; used in VibeServe as a long-tail serving target.