Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva*, Gordon Guocheng Qian*^†, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang

November 2025

Preprint Project Video

Media · 机器之心

Media · 新智元

Abstract

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface. Our key idea is to encode diverse control signals, including subject references, bounding boxes, and pose skeletons, into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning.

Type

Preprint

Publication

arXiv preprint, 2025