Gordon Qian is a research scientist at Snap Research Creative Vision team, leading the unified multimodal generation and its end-to-end R&D. He earned his Ph.D. in Computer Science from KAUST, where he was fortunate to be advised by Prof. Bernard Ghanem.
He has authored 19 top-tier conference and journal papers and his publications have received over 3300 citations, and his current h-index is 19.
His representative work includes PointNeXt (NeurIPS, >1100 cites, >900 GitHub stars), Magic123 (ICLR, >450 cites, >1.6K GitHub stars) and Omni-ID (CVPR'25). His leading authored papers SR-Training (NeurIPS), Omni-ID, and ComposeMe (Siggraph Asia) have been integrated into Snapchat products, serving 400 million monthly active users, with 6 filed patents. He also serves as area chair for ICLR 2026.
If you are interested in working in image/video generative models with me, please reach out at guocheng.qian [at] outlook.com

Ph.D. in CS
KAUST , 2019 - 2023

B.Eng in ME
XJTU , 2014 - 2018
Selected projects below; * / † denote equal contribution / corresponding author. See Full publication list .
EasyV2V: A high-quality instruction-based video editing framework that enables intuitive video manipulation through natural language instructions.
Omni-Attribute can isolate a specific attribute, whether it is an abstract concept or not, from any image and merge those selected attributes from multiple images into a coherent generation.
We prevent shortcuts in adapter training by explicitly providing the shortcuts during training, forcing the model to learn more robust representations.
ComposeMe is a human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control.
ThinkDiff enables multimodal in-context reasoning in diffusion models by aligning vision-language models to LLM decoders, transferring reasoning capabilities without requiring complex reasoning-based datasets.
Omni-ID is a novel facial representation tailored for generative tasks, encoding identity features from unstructured images into a fixed-size representation that captures diverse expressions and poses.
WonderLand is a video-latent based approach for single-image 3D reconstruction in large-scale scenes.
AC3D studies when and how you should condition camera signals into a video diffusion model for a better camera control and a higher video quality.
Magic123 proposes a hybrid score distillation algorightm and a coarse-to-fine image-to-3D pipeline that produces high-quality high-resolution 3D content from a single unposed image.
Pix4Point shows that image pretraining siginificantly improves point cloud understanding.
PointNeXt boosts the performance of PointNet++ to the state-of-the-art level with improved training and scaling strategies.
ASSANet makes PointNet++ faster and more accurate.

