1

EasyV2V: A High-quality Instruction-based Video Editing Framework

EasyV2V presents a high-quality instruction-based video editing framework that enables intuitive and precise video manipulation through natural language instructions. Our approach combines state-of-the-art video generation models with advanced …

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which …

Preventing Shortcuts in Adapter Training via Providing the Shortcuts

Adapter modules have emerged as a parameter-efficient method for fine-tuning large pre-trained models to downstream tasks. However, adapter training can suffer from shortcut learning, where the model exploits spurious correlations in the training …

ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation

Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference …

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

This paper presents ThinkDiff, a novel alignment paradigm that enables multimodal in-context understanding and reasoning capabilities in text-to-image diffusion models by integrating the capabilities of vision-language models (VLMs). Directly …

Omni-ID: Holistic Identity Representation Designed for Generative Tasks

We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It …

Wonderland: Navigating 3D Scenes from a Single Image

This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene …

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos …

AToM: Amortized Text-to-Mesh using 2D Diffusion

We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh framework optimized across multiple text prompts simultaneously. In contrast to existing text-to-3D methods that often entail time-consuming per-prompt optimization and commonly …

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

We present “Magic123”, a two-stage coarse-to-fine solution for high-quality, tex-tured 3D meshes generation from a single unposed image in the wild using both 2D and 3D priors. In the first stage, we optimize a coarse neural radiance field and focus …