Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

Abstract

We present “Magic123”, a two-stage coarse-to-fine solution for high-quality, tex-tured 3D meshes generation from a single unposed image in the wild using both 2D and 3D priors. In the first stage, we optimize a coarse neural radiance field and focus on learning geometry. In the second stage, a memory-efficient differentiable mesh representation is adopted to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by both 2D and 3D diffusion priors. A tradeoff parameter between the 2D and 3D priors controls the exploration (more imaginative) and exploitation (more precise) of the generated geometry. We further leverage textual inversion to encourage consistent appearances across views. Monocular depth estimation is used to constrain the 3D reconstruction and avoid collapsed solutions, e.g. flat geometry. Our Magic123 approach outperforms prior image-to-3D techniques by a large margin, as demonstrated through extensive experiments on various real images in the wild and on synthetic benchmarks. Our code, models, and generated 3D assets are available at https://guochengqian.github.io/project/magic123/ .

Publication
International Conference on Learning Representations, 2024