MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification
Under submission
Abstract
The field of text-to-3D content generation has made significant progress in generating realistic 3D objects, with existing methodologies like Score Distillation Sampling (SDS) offering promising guidance. However, these methods often encounter the "Janus" problem—multi-face ambiguities due to imprecise guidance. Additionally, while recent advancements in 3D gaussian splatting have shown its efficacy in representing 3D volumes, optimization of this representation remains largely unexplored. This paper introduces a unified framework for text-to-3D content generation that addresses these critical gaps. Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model, progressively enhancing detail and accuracy. We also introduce a novel densification algorithm that aligns gaussians close to the surface, optimizing the structural integrity and fidelity of the generated models. Extensive experiments validate our approach, demonstrating that it produces high-quality visual outputs with minimal time cost. Notably, our method achieves high-quality results within half an hour of training, offering a substantial efficiency gain over most existing methods, which require hours of training time to achieve comparable results.
Architecture overview
Overview of our MVGaussian framework: Our approach begins with the random initialization of Gaussians within a unit sphere, refined iteratively using an SDS-based optimization strategy. Gaussians are optimized near the true surface, moving toward the pseudo surface while pruning those farther away. Each iteration renders four views with random azimuth angles, encoded into the latent space. Gaussian noise is added and denoised using a UNET model to compute the loss \(\mathcal{L}_{sds}\). The optimization gradient \(\nabla \mathcal{L}_{sds}\) updates the Gaussians, incorporating a feedback loop with fused point cloud data and voxel downsampling to enhance accuracy.
Generated 3D assets from textual prompts
"An armored green-skin orc warrior riding a vicious hog."
"A forbidden castle high up in the mountains."
"A flying dragon, highly detailed, realistic, majestic."
"A 3D model of an adorable cottage with a thatched roof"
"A blue jay sitting on a willow basket of macarons"
"Medieval soldier with shield and sword, fantasy, game, character, highly detailed, photorealistic, 4K, HD"
"Jack Sparrow wearing sunglasses, head, photorealistic, 8k, HD, raw."
"A peacock standing on a surfing board, highly detailed, majestic."
Additional results from MVGaussian