PointDiT

Pixel-Space Diffusion for Monocular Geometry Estimation

ICML 2026

1Google   2ETH Zurich   3University of Tübingen, Tübingen AI Center   4Microsoft   5KE:SAI   6TUM
Input RGB image
teaser/image.pnginput
Input imagecondition
3Ddrag · scroll · right-drag
teaser/noise_init.ply3D point cloud
Gaussian noisez₀ ~ N(0, I)
PointDiT · 1 step
3Ddrag · scroll · right-drag
teaser/pointdit_1step.ply3D point map
3D point mapsingle step

single step

From a single image and pure Gaussian noise, PointDiT generates a dense 3D point map in one step.

Predicted depth at the selected number of steps
pointdit_depth_1step.pngdepth
zoom Zoomed-in detail
zoomin_1step.png
Depth map1 step
3Ddrag · scroll · right-drag
pointdit_1step.ply3D point map
3D point map1 step

more steps

Running more sampling steps with the same network refines fine detail and sharpens the geometry.

1 step
1234

The problem

Two existing ways to predict 3D geometry. Both blur the world.

A single image maps to many plausible 3D scenes. The two dominant paradigms each pay a price for that ambiguity, and the cost shows up as lost geometric detail.

Same image, three predictions.

Watch the chair spindles: the baselines smear them, PointDiT keeps them.

Input RGB image of a bedroom desk and chair
input.pnginput
Input image
MoGe-2regression
3Ddrag · scroll · right-drag
moge2_point_pred.plypoint map
point map
MoGe-2 predicted depth, over-smoothed
moge2_depth_pred.pngdepth
depth
GeometryCrafterlatent diffusion
3Ddrag · scroll · right-drag
geometrycrafter_point_pred.plypoint map
point map
GeometryCrafter predicted depth, detail lost through the VAE
geometrycrafter_depth_pred.pngdepth
depth
PointDiTpixel-space diffusion
3Ddrag · scroll · right-drag
pointdit_point_pred.plypoint map
point map
PointDiT predicted depth, sharp detail preserved
pointdit_depth_pred.pngdepth
depth

MoGe-2 averages the geometry; GeometryCrafter loses it in the latent space; PointDiT keeps the spindles crisp, in both the 3D point map and the depth.

Why the baselines blur.

Deterministic regression pipeline: image to encoder to head to regressed point cloud
regression.png
Deterministic regression · MoGe-2

Predicts the mean.

An encoder and a regression head map the image straight to a point cloud. Facing ambiguity, the network averages plausible answers, smoothing thin structures and transparent objects away.

Latent diffusion pipeline: a VAE reconstruction stage and a latent diffusion generation stage
ldm.png
Latent diffusion · GeometryCrafter

Compresses through a lossy VAE.

Point maps are first squeezed into a latent space by a VAE, and diffusion runs there. The VAE is lossy, so fine detail is gone before generation even begins.

The latent bottleneck, in 3D.

Stage 1 of latent diffusion on its own: encode a ground-truth point cloud and decode it straight back. Nothing is generated yet, and detail is already gone.

GTdrag · scroll · right-drag
paradigms/vae_gt.plyground-truth point cloud
Ground truthinput point cloud
VAEdrag · scroll · right-drag
paradigms/vae_recon.plyVAE reconstruction
VAE reconstructionencode → decode, no diffusion

The reconstruction already rounds off edges and destroys fine structures, the diffusion model is then asked to recover from a degraded starting point.

So we removed the bottleneck. Diffusion runs straight on raw point maps.

The method

A plain ViT, denoising point maps in data space.

The noisy point map is patchified into tokens, exactly like an image but with XYZ coordinates as the channels instead of RGB. The clean input image is encoded by a frozen DINOv3. We combine the noisy point tokens with the clean image tokens, then a plain Transformer denoises them and recovers the clean point map. No VAE, no two-stage training.

Click the figure to step through each block interactively.

Results

Explore the reconstructions in 3D.

Pick a sample, then drag any reconstruction to orbit all four together.

Input RGB image for the selected sample
input_image.pnginput
Input image
Ground truthreference
GTdrag · scroll · right-drag
point_gt.plypoint map
point map
Ground-truth depth
depth_gt.pngdepth
depth
MoGe-2regression
3Ddrag · scroll · right-drag
point_moge2.plypoint map
point map
MoGe-2 predicted depth
depth_moge2.pngdepth
depth
GeometryCrafterlatent diffusion
3Ddrag · scroll · right-drag
point_geometrycrafter.plypoint map
point map
GeometryCrafter predicted depth
depth_geometrycrafter.pngdepth
depth
PointDiTpixel-space diffusion
3Ddrag · scroll · right-drag
point_pointdit.plypoint map
point map
PointDiT predicted depth, sharp detail preserved
depth_pointdit.pngdepth
depth

Watch the thin structures, transparent objects, and relative global scales: the baselines smooth or distort them, while PointDiT stays closest to the ground truth, in both the point map and the depth.

More comparisons are provided in the gallery.

Controlled comparison

Generative flow matching vs. deterministic regression.

Same architecture, same data, same training. Only the formulation changes.

Existing methods differ in training data, architecture, and implementation, so a direct comparison cannot isolate the effect of the generative formulation. We hold all of that fixed and change only the formulation: replacing PointDiT's noise and timestep with deterministic zeros turns it into a one-pass deterministic regressor, which we compare against flow matching under identical conditions.

Validation curves · hover to read values

The deterministic regressor converges faster at first but soon overfits, while the generative model trains stably and reaches lower error.

The generative model recovers sharper boundaries, thin structures, and transparent objects than the deterministic regressor. Overall, the generative formulation improves the boundary metric BF1 from 10.90 to 13.92 under this controlled comparison.

In summary

Takeaways

  1. 01

    Simplicity is enough

    A plain ViT denoising raw point-map patches matches or beats complex hybrid regressors and latent-diffusion models. No VAE, no hybrid architectures.

  2. 02

    Stay in data space

    Diffusing directly on point-map patches skips the lossy VAE compression that blurs fine geometry, so thin structures and sharp boundaries survive into the output.

  3. 03

    Unified formulation

    Casting geometry prediction as generation recovers the sharp detail that regressors average away. The formulation is general: one generative model can unify reconstruction and generation.

Looking ahead Since the backbone is just a ViT operating in data space, the same recipe should extend with minimal change: jointly predicting appearance alongside geometry, and richer conditioning such as camera parameters or multiple views. We see pixel-space diffusion as a promising step toward VAE-free, end-to-end 3D and 4D reconstruction and generation.