PointDiT

Pixel-Space Diffusion for Monocular Geometry Estimation

Haofei Xu^1,2,3 Rundi Wu¹ Philipp Henzler¹ Nikolai Kalischek¹ Michael Oechsle¹
Fabian Manhardt¹ Marc Pollefeys^2,4 Andreas Geiger^3,5 Federico Tombari^1,6 Michael Niemeyer¹

ICML 2026

¹Google ²ETH Zurich ³University of Tübingen, Tübingen AI Center ⁴Microsoft ⁵KE:SAI ⁶TUM

Paper Gallery

3Ddrag · scroll · right-drag

teaser/noise_init.ply3D point cloud

Gaussian noisez₀ ~ N(0, I)

→PointDiT · 1 step

3Ddrag · scroll · right-drag

teaser/pointdit_1step.ply3D point map

3D point mapsingle step

single step

From a single image and pure Gaussian noise, PointDiT generates a dense 3D point map in one step.

Predicted depth at the selected number of steps — Depth map1 step

3Ddrag · scroll · right-drag

pointdit_1step.ply3D point map

3D point map1 step

more steps

Running more sampling steps with the same network refines fine detail and sharpens the geometry.

1 step

1234

The problem

Two existing ways to predict 3D geometry. Both blur the world.

A single image maps to many plausible 3D scenes. The two dominant paradigms each pay a price for that ambiguity, and the cost shows up as lost geometric detail.

Same image, three predictions.

Watch the chair spindles: the baselines smear them, PointDiT keeps them.

Input RGB image of a bedroom desk and chair — Input image

MoGe-2regression

3Ddrag · scroll · right-drag

moge2_point_pred.plypoint map

point map

MoGe-2 predicted depth, over-smoothed — depth

GeometryCrafterlatent diffusion

3Ddrag · scroll · right-drag

geometrycrafter_point_pred.plypoint map

point map

GeometryCrafter predicted depth, detail lost through the VAE — depth

PointDiTpixel-space diffusion

3Ddrag · scroll · right-drag

pointdit_point_pred.plypoint map

point map

PointDiT predicted depth, sharp detail preserved — depth

MoGe-2 averages the geometry; GeometryCrafter loses it in the latent space; PointDiT keeps the spindles crisp, in both the 3D point map and the depth.

Why the baselines blur.

Deterministic regression pipeline: image to encoder to head to regressed point cloud

regression.png

Deterministic regression · MoGe-2

Predicts the mean.

An encoder and a regression head map the image straight to a point cloud. Facing ambiguity, the network averages plausible answers, smoothing thin structures and transparent objects away.

Latent diffusion pipeline: a VAE reconstruction stage and a latent diffusion generation stage

ldm.png

Latent diffusion · GeometryCrafter

Compresses through a lossy VAE.

Point maps are first squeezed into a latent space by a VAE, and diffusion runs there. The VAE is lossy, so fine detail is gone before generation even begins.

The latent bottleneck, in 3D.

Stage 1 of latent diffusion on its own: encode a ground-truth point cloud and decode it straight back. Nothing is generated yet, and detail is already gone.

GTdrag · scroll · right-drag

paradigms/vae_gt.plyground-truth point cloud

Ground truthinput point cloud

VAEdrag · scroll · right-drag

paradigms/vae_recon.plyVAE reconstruction

VAE reconstructionencode → decode, no diffusion

The reconstruction already rounds off edges and destroys fine structures, the diffusion model is then asked to recover from a degraded starting point.

So we removed the bottleneck. Diffusion runs straight on raw point maps.

The method

A plain ViT, denoising point maps in data space.

The noisy point map is patchified into tokens, exactly like an image but with XYZ coordinates as the channels instead of RGB. The clean input image is encoded by a frozen DINOv3. We combine the noisy point tokens with the clean image tokens, then a plain Transformer denoises them and recovers the clean point map. No VAE, no two-stage training.

PointDiT architecture (Figure 1): the noisy point map is patchified and the input image is encoded by a frozen DINOv3 into token rows, which are linearly embedded, processed by transformer blocks, predicted, and unpatchified into the clean point map. — Click the figure to step through each block interactively.

Results

Explore the reconstructions in 3D.

Pick a sample, then drag any reconstruction to orbit all four together.

Input RGB image for the selected sample — Input image

Ground truthreference

GTdrag · scroll · right-drag

point_gt.plypoint map

point map

MoGe-2regression

3Ddrag · scroll · right-drag

point_moge2.plypoint map

point map

GeometryCrafterlatent diffusion

3Ddrag · scroll · right-drag

point_geometrycrafter.plypoint map

point map

PointDiTpixel-space diffusion

3Ddrag · scroll · right-drag

point_pointdit.plypoint map

point map

Watch the thin structures, transparent objects, and relative global scales: the baselines smooth or distort them, while PointDiT stays closest to the ground truth, in both the point map and the depth.

More comparisons are provided in the gallery.

Controlled comparison

Generative flow matching vs. deterministic regression.

Same architecture, same data, same training. Only the formulation changes.

Existing methods differ in training data, architecture, and implementation, so a direct comparison cannot isolate the effect of the generative formulation. We hold all of that fixed and change only the formulation: replacing PointDiT's noise and timestep with deterministic zeros turns it into a one-pass deterministic regressor, which we compare against flow matching under identical conditions.

Validation curves · hover to read values

The deterministic regressor converges faster at first but soon overfits, while the generative model trains stably and reaches lower error.

Input image: a classroom wall with a slatted air vent

Deterministic regression depth: the thin vent slats are smoothed away

The generative model recovers sharper boundaries, thin structures, and transparent objects than the deterministic regressor. Overall, the generative formulation improves the boundary metric BF1 from 10.90 to 13.92 under this controlled comparison.

In summary

Takeaways

01
Simplicity is enough

A plain ViT denoising raw point-map patches matches or beats complex hybrid regressors and latent-diffusion models. No VAE, no hybrid architectures.
02
Stay in data space

Diffusing directly on point-map patches skips the lossy VAE compression that blurs fine geometry, so thin structures and sharp boundaries survive into the output.
03
Unified formulation

Casting geometry prediction as generation recovers the sharp detail that regressors average away. The formulation is general: one generative model can unify reconstruction and generation.

Looking ahead Since the backbone is just a ViT operating in data space, the same recipe should extend with minimal change: jointly predicting appearance alongside geometry, and richer conditioning such as camera parameters or multiple views. We see pixel-space diffusion as a promising step toward VAE-free, end-to-end 3D and 4D reconstruction and generation.

PointDiT

Two existing ways to predict 3D geometry. Both blur the world.

Same image, three predictions.

Why the baselines blur.

The latent bottleneck, in 3D.

A plain ViT, denoising point maps in data space.

Input image

Explore the reconstructions in 3D.

Generative flow matching vs. deterministic regression.

Takeaways

Simplicity is enough

Stay in data space

Unified formulation