Supplementary Videos:
Tracking by Predicting 3-D Gaussians Over Time

Authors are anonymous for NeurIPS submission.

Zero-shot Tracking Examples

Video-GMAE learns strong correspondence priors, enabling impressive zero-shot tracking performance without any task-specific labels. Observe how points are consistently tracked across various dynamic scenes and object deformations below. Zero-shot tracking generally fails when the background scene is complex, combined with camera motion. This is because the model is pretrained assuming a fixed static camera, and we are using only 256 gaussians during pretraining. In high frequency background scenes, many of the gaussians are used to model the background, leading to insufficient capacity to model the object motion. This leads to the model being unable to track the object.

Zero-shot tracking in scenes from the DAVIS dataset.

Zero-shot tracking in scenes from the Kinetics dataset.

Zero-shot tracking failure modes.

Fine-tuned Tracking Performance

After fine-tuning on small-scale labeled datasets, Video-GMAE out performs current baselines on various point tracking benchmarks. The following video showcase the precision and robustness of our fine-tuned models.

High-precision tracking on the DAVIS dataset.

Tracking challenging sequences from Kinetics.

Video-GMAE Pre-training Process Samples

Our model is pretrained by reconstructing videos from masked patches using dynamic Gaussian primitives. These visualizations illustrate the model's ability to predict and render Gaussian trajectories over time, forming the basis of its correspondence understanding.

Reconstruction of dynamic scenes with predicted Gaussians.

Acknowledgements: We borrow this website template from MonST3R, which itself was inspired by templates like SD+DINO, and originally DreamBooth. We thank the creators of these templates for making their work publicly available.