UC Berkeley seal
University of California, Berkeley
Berkeley Artificial Intelligence Research
CVPR 2026 · Highlight

Video-GMAE Point tracking from Gaussian motion

Tracking by Predicting3-D Gaussians Over Time

Tanish Baranwal Himanshu Gaurav Singh Jathushan Rajasegaran Jitendra Malik
Tracking emerges for free.

A masked autoencoder learns video as moving 3-D Gaussians, and a point tracker falls out with zero tracking labels.

Video-GMAE overview figure
Zero-shot tracks from Gaussian motion on DAVIS.
95%masked video tokens
16frames per clip
256Gaussians per frame
0tracking labels
Kinetics · Kubric · DAVISunlabeled pretraining video

01 The idea

Video as moving 3-D Gaussians

Each primitive keeps its identity

Motion is residual ΔG, so a Gaussian at frame t is the same Gaussian at t+1.

Correspondence as a 3-D prior

Encoding video as moving Gaussians makes the SSL task harder and pushes latents toward long-range correspondence.

Only part of each primitive moves

The 3-D mean μ and color r receive per-frame residuals; scale, rotation, and opacity stay fixed.

02 Method pipeline

Pretrain, read out, fine-tune

Pretraining pipeline with Gaussian splatting
Video masked autoencoding through Gaussian splatting: the decoder predicts a moving Gaussian scene, then reconstruction closes the loop.
Zero-shot tracking schematic
Render Gaussian displacement into image-plane flow, then advect query points.
Zero-shot trackno labels · no flow · no boxes
Project\(x_i^{(t)} = \Pi(K[R|t], \mu_i^{(t)})\)
Displace\(\Delta x_i^{(t+1)} = x_i^{(t+1)} - x_i^{(t)}\)
Render flow\(F^{(t)}(u) = \sum_{i=1}^{n}\alpha_i^{(t)}(u)\cdot\Delta x_i^{(t)}\)
  1. 1Predict Gaussian means through time with \(\mu_i^{(t+1)}=\mu_i^{(t)}+\Delta\mu_i^{(t)}\), then project the same primitives at adjacent frames into image coordinates.
  2. 2Replace each Gaussian's color with its 2-D displacement \((\Delta x_{i,x}^{(t)}, \Delta x_{i,y}^{(t)}, 0)\) and splat those values with the renderer's opacity weights to obtain dense flow \(F^{(t)}\).
  3. 3For each query point, bilinearly sample flow and opacity at \(p^{(t)}\). The pure flow proposal is \(a^{(t)}=p^{(t)}+F^{(t)}(p^{(t)})\).
  4. 4Fix a top-k anchor set \(\mathcal{S}\) from frame 0, compute anchor mass \(\omega^{(t)}=\sum_{i\in\mathcal{S}}\alpha_i^{(t)}(p^{(t)})\), and renormalize anchor weights \(\tilde{\pi}_i^{(t)}\).
  5. 5Use the anchor proposal \(s^{(t+1)}=\sum_{i\in\mathcal{S}}\tilde{\pi}_i^{(t)}(x_i^{(t)}+\Delta x_i^{(t+1)})\) to preserve identity around occlusion.
  6. 6If \(\omega^{(t)}\ge\tau_{\mathrm{vis}}\), update \(p^{(t+1)}=(1-\beta)a^{(t)}+\beta s^{(t+1)}\) and mark visible; otherwise set \(p^{(t+1)}=s^{(t+1)}\) and mark occluded.
k = 8τvis = 0.5β = 0.3
Fine-tuning cross-attention readout
Fine-tuning adds a supervised cross-attention readout over frozen or trainable encoder latents to sharpen localization and occlusion handling.

03 Results map

Coherent across motion and occlusion

Zero-shot tracks emerge directly from Gaussian motion, and light supervision sharpens the readout.

Zero-shot

No labels, no flow, no boxes

Stable tracks emerge directly from Gaussian motion.

Watch zero-shot
Comparisons

Stable over time

Video-GMAE tends to preserve temporal coherence; GMRW-C can keep tiny fast details tighter.

Compare methods
Fine-tuned

Supervised readout

Kubric labels sharpen trajectories and occlusion handling.

View fine-tuned
Reconstructions

What the Gaussians capture

Rendered pretraining outputs show coarse structure, motion, and the limits of the Gaussian budget.

Watch recon

04 Tracks across benchmarks

Zero-shot tracking

Correspondence emerges directly from the moving Gaussians. The tracker below uses no tracking labels.

DAVIS zero-shot: animals, crowds, vehicles, and occlusions.

Kinetics zero-shot: everyday action clips with varied motion.

Failure cases: camera motion and fine detail can stress the Gaussian budget.

05 Comparisons

Video-GMAE vs. GMRW-C

Video-GMAE is more temporally stable in many clips; GMRW-C can better preserve tiny, fast details.

TAP-Vid Kinetics comparison.

TAP-Vid DAVIS comparison.

06 Fine-tuning

Fine-tuned point tracks

Tuning on labeled tracks from Kubric improves tracking further.

Fine-tuned: TAP-Vid DAVIS, two looped sequences.

Fine-tuned: TAP-Vid Kinetics, two looped sequences.

07 Reconstructions

Pretraining renders

Rendered Gaussian trajectories during pretraining capture coarse structure and motion; fine detail is limited by the 256-Gaussian budget.

Dynamic reconstruction from Gaussians.

Dynamic reconstruction from Gaussians.

Dynamic reconstruction from Gaussians.

08 Benchmarks

Zero-shot and fine-tuned results

Reading motion straight off the Gaussians tops prior SSL trackers; Kubric labels lift the readout into supervised tracker range.

AJ ↑ · no labelsPrior SSLOursΔ OA
Kubric54.254.3+9.3
DAVIS41.841.3+6.9
Kinetics46.960.1+8.9

Zero-shot 60.1 AJ on Kinetics with no trained tracking readout.

Frozen encoder, Kinetics AJ · supervised readout
MAE-ST42.3
VideoMAE46.9
Video-GMAE65.1

Fine-tuned Reaches 74.0 / 75.1 AJ on Kubric / Kinetics, matching state-of-the-art supervised trackers.

09 Limits

Where it breaks

10 Citation

BibTeX

@InProceedings{Baranwal_2026_CVPR,
  author    = {Baranwal, Tanish and Singh, Himanshu Gaurav and Rajasegaran, Jathushan and Malik, Jitendra},
  title     = {Tracking by Predicting 3-D Gaussians Over Time},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {42527-42537}
}