Gordon Qian is a research scientist at Snap Research Creative Vision team, working on Generative Models. He earned his Ph.D. in Computer Science from KAUST, where he was fortunate to be advised by Prof. Bernard Ghanem.
He has authored 19 top-tier conference and journal papers, including one first-authored work with over 1,100 citations and three first-authored works with over 300 citations each.
In total, his publications have received over 3200 citations, and his current h-index is 19.
His representative work includes PointNeXt (NeurIPS, >1100 cites, >900 GitHub stars), Magic123 (ICLR, >450 cites, >1.6K GitHub stars) and Omni-ID (CVPR'25, products integrated into Snapchat).
He also serves as area chair for ICLR starting from 2025.
If you are interested in working in image/video generative models with me, please reach out at guocheng.qian [at] outlook.com

Ph.D. in CS
KAUST , 2019 - 2023

B.Eng in ME
XJTU , 2014 - 2018
Selected projects below; * / † denote equal contribution / corresponding author. See Full publication list .
This work can isolate a specific attribute from any image and merge those selected attributes from multiple images into a coherent generation.
Canvas-to-Image introduces a unified framework that consolidates heterogeneous controls (subject references, bounding boxes, pose skeletons) into a single canvas interface for high-fidelity compositional image generation.
LayerComposer enables Photoshop-like control for multi-subject text-to-image generation, allowing users to naturally compose scenes by intuitively placing, resizing, and locking elements in a layered canvas with high fidelity.
We prevent shortcuts in adapter training by explicitly providing the shortcuts during training, forcing the model to learn more robust representations.
ComposeMe is a human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control.
ThinkDiff enables multimodal in-context reasoning in diffusion models by aligning vision-language models to LLM decoders, transferring reasoning capabilities without requiring complex reasoning-based datasets.
Omni-ID is a novel facial representation tailored for generative tasks, encoding identity features from unstructured images into a fixed-size representation that captures diverse expressions and poses.
WonderLand is a video-latent based approach for single-image 3D reconstruction in large-scale scenes.
AC3D studies when and how you should condition camera signals into a video diffusion model for a better camera control and a higher video quality.
Magic123 is a coarse-to-fine image-to-3D pipeline that produces high-quality high-resolution 3D content from a single unposed image by the guidance of both 2D and 3D priors.
Pix4Point shows that image pretraining siginificantly improves point cloud understanding.
PointNeXt boosts the performance of PointNet++ to the state-of-the-art level with improved training and scaling strategies.
ASSANet makes PointNet++ faster and more accurate.