Supplementary Videos:
Tracking by Predicting 3-D Gaussians Over Time

Authors are anonymous for CVPR submission.

Zero-shot Tracking

These clips highlight qualitative behavior: stable tracks that persist once visible and conservative occlusion handling that avoids flicker.

Small, fast-moving structures reveal the trade-off of our 256-Gaussian budget—Video-GMAE stays smooth but can miss fine detail, while GMRW-C better preserves tiny parts at the cost of more jitter and occasional early occlusions or re-activations.

Failure modes: complex, moving backgrounds with camera motion absorb Gaussian capacity and reduce foreground fidelity.

Zero-shot tracking on TAP-Vid DAVIS; two sequences play in succession and loop.

Zero-shot tracking on TAP-Vid Kinetics; two sequences play in succession and loop.

Zero-shot tracking failure cases in complex scenes.

Qualitative Comparisons vs. GMRW-C

Hand-picked TAP-Vid Kinetics and TAP-Vid DAVIS sequences contrast Video-GMAE zero-shot (green border) against GMRW-C. Our method shows smoother, longer-lived tracks with fewer identity flips, while GMRW-C can better maintain very small, detailed regions that move quickly.

On DAVIS, visibility remains more stable for Video-GMAE even when spatial precision is slightly softer on tiny details. This is consistent with the Average Jaccard (AJ) of our method being similar to GMRW; AJ is occlusion aware, so Video-GMAE having better occlusion accuracy while having the same AJ means that the tracking precision for it is a bit lower.

TAP-Vid Kinetics comparison. Green frame: Video-GMAE zero-shot; neighboring views: GMRW-C.

TAP-Vid DAVIS comparison. Occlusion-aware AJ matches GMRW-C while visibility is more stable for Video-GMAE.

Fine-tuned Tracking Performance

After fine-tuning on small-scale labeled datasets, Video-GMAE out performs current baselines on various point tracking benchmarks. The following video showcase the precision and robustness of our fine-tuned models.

Fine-tuned tracking on TAP-Vid DAVIS; two sequences play in succession and loop.

Fine-tuned tracking on TAP-Vid Kinetics; two sequences play in succession and loop.

Video-GMAE Pre-training Process Samples

Our model is pretrained by reconstructing videos from masked patches using dynamic Gaussian primitives. These visualizations illustrate the model's ability to predict and render Gaussian trajectories over time, forming the basis of its correspondence understanding.

Reconstruction of dynamic scenes with predicted Gaussians.

Acknowledgements: We borrow this website template from MonST3R, which itself was inspired by templates like SD+DINO, and originally DreamBooth. We thank the creators of these templates for making their work publicly available.