The problem
Two existing ways to predict 3D geometry. Both blur the world.
A single image maps to many plausible 3D scenes. The two dominant paradigms each pay a price for that ambiguity, and the cost shows up as lost geometric detail.
Same image, three predictions.
Watch the chair spindles: the baselines smear them, PointDiT keeps them.
MoGe-2 averages the geometry; GeometryCrafter loses it in the latent space; PointDiT keeps the spindles crisp, in both the 3D point map and the depth.
Why the baselines blur.
Predicts the mean.
An encoder and a regression head map the image straight to a point cloud. Facing ambiguity, the network averages plausible answers, smoothing thin structures and transparent objects away.
Compresses through a lossy VAE.
Point maps are first squeezed into a latent space by a VAE, and diffusion runs there. The VAE is lossy, so fine detail is gone before generation even begins.
The latent bottleneck, in 3D.
Stage 1 of latent diffusion on its own: encode a ground-truth point cloud and decode it straight back. Nothing is generated yet, and detail is already gone.
The reconstruction already rounds off edges and destroys fine structures, the diffusion model is then asked to recover from a degraded starting point.
So we removed the bottleneck. Diffusion runs straight on raw point maps.
The method
A plain ViT, denoising point maps in data space.
The noisy point map is patchified into tokens, exactly like an image but with XYZ coordinates as the channels instead of RGB. The clean input image is encoded by a frozen DINOv3. We combine the noisy point tokens with the clean image tokens, then a plain Transformer denoises them and recovers the clean point map. No VAE, no two-stage training.
Input image
Results
Explore the reconstructions in 3D.
Pick a sample, then drag any reconstruction to orbit all four together.
Watch the thin structures, transparent objects, and relative global scales: the baselines smooth or distort them, while PointDiT stays closest to the ground truth, in both the point map and the depth.
More comparisons are provided in the gallery.
Controlled comparison
Generative flow matching vs. deterministic regression.
Same architecture, same data, same training. Only the formulation changes.
Existing methods differ in training data, architecture, and implementation, so a direct comparison cannot isolate the effect of the generative formulation. We hold all of that fixed and change only the formulation: replacing PointDiT's noise and timestep with deterministic zeros turns it into a one-pass deterministic regressor, which we compare against flow matching under identical conditions.
The deterministic regressor converges faster at first but soon overfits, while the generative model trains stably and reaches lower error.
Structural details
Transparent objects
The generative model recovers sharper boundaries, thin structures, and transparent objects than the deterministic regressor. Overall, the generative formulation improves the boundary metric BF1 from 10.90 to 13.92 under this controlled comparison.
In summary
Takeaways
-
01
Simplicity is enough
A plain ViT denoising raw point-map patches matches or beats complex hybrid regressors and latent-diffusion models. No VAE, no hybrid architectures.
-
02
Stay in data space
Diffusing directly on point-map patches skips the lossy VAE compression that blurs fine geometry, so thin structures and sharp boundaries survive into the output.
-
03
Unified formulation
Casting geometry prediction as generation recovers the sharp detail that regressors average away. The formulation is general: one generative model can unify reconstruction and generation.
Looking ahead Since the backbone is just a ViT operating in data space, the same recipe should extend with minimal change: jointly predicting appearance alongside geometry, and richer conditioning such as camera parameters or multiple views. We see pixel-space diffusion as a promising step toward VAE-free, end-to-end 3D and 4D reconstruction and generation.