Welcome! I am a research scientist at Snap Research Creative Vision team, working on Generative AI. I earned my Ph.D. in Computer Science from KAUST, where I was fortunate to be advised by Prof. Bernard Ghanem.
Prior to that, I received my B.Eng degree from Xi'an Jiaotong University (XJTU), China with the university’s highest undergraduate honor.
My primary research interests lie in computer vision and generative models.
My representative work includes PointNeXt (NeurIPS, >1000 cites, >900 GitHub stars), Magic123 (ICLR, >400 cites, >1.6K GitHub stars) and Omni-ID (CVPR'25, products integrated into Snapchat).
I serve as area chair for ICLR from 2025.
If you are interested in working in generative models (video, MLLM, RL, agents) with us, please drop me a message through guocheng.qian [at] outlook.com

Ph.D. in CS
KAUST , 2019 - 2023

B.Eng in ME
XJTU , 2014 - 2018
Selected projects below; * / † denote equal contribution / corresponding author. See Full publication list .
Canvas-to-Image introduces a unified framework that consolidates heterogeneous controls (subject references, bounding boxes, pose skeletons) into a single canvas interface for high-fidelity compositional image generation.
LayerComposer enables Photoshop-like control for multi-subject text-to-image generation, allowing users to naturally compose scenes by intuitively placing, resizing, and locking elements in a layered canvas with high fidelity.
We prevent shortcuts in adapter training by explicitly providing the shortcuts during training, forcing the model to learn more robust representations.
ComposeMe is a human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control.
WonderLand is a video-latent based approach for single-image 3D reconstruction in large-scale scenes.
Omni-ID is a novel facial representation tailored for generative tasks, encoding identity features from unstructured images into a fixed-size representation that captures diverse expressions and poses.
AC3D studies when and how you should condition camera signals into a video diffusion model for a better camera control and a higher video quality.
AToM trains a single text-to-mesh model on many prompts using 2D diffusion without 3D supervision, yileds high-quality textured meshes under a second, and generalizes to unseen prompts.
Magic123 is a coarse-to-fine image-to-3D pipeline that produces high-quality high-resolution 3D content from a single unposed image by the guidance of both 2D and 3D priors.
Pix4Point shows that image pretraining siginificantly improves point cloud understanding.
ZeroSeg trains open-vocabulary zero-shot semantic segmentation models using only CLIP Vision Encoder
PointNeXt boosts the performance of PointNet++ to the state-of-the-art level with improved training and scaling strategies.
ASSANet makes PointNet++ faster and more accurate.