TAPVid-360
Tracking Any Point in 360 from Narrow Field of View Video

1University of York

Abstract


Humans excel at constructing panoramic mental models of their surroundings, maintaining object permanence and inferring scene structure beyond visible regions. In contrast, current artificial vision systems struggle with persistent, panoramic understanding, often processing scenes egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP) task, where existing methods fail to track 2D points outside the field of view. To address this, we introduce TAP-Vid 360, a novel task that requires predicting the 3D direction to queried scene points across a video sequence, even when far outside the narrow field of view of the observed video. This task fosters learning allocentric scene representations without needing dynamic 4D ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision, resampling them into narrow field-of-view perspectives while computing ground truth directions by tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and benchmark, TAP360-10k comprising 10k perspective videos with ground truth directional point tracking. Our baseline adapts CoTracker3 to predict per-point rotations for direction updates, outperforming existing TAP and TAP-Vid 3D methods.

Task


We introduce the TAPVid-360 task. Here, given query points as pixel coordinates in the first frame, the goal is to track the 3D direction (in the camera coordinate frame) to the scene point corresponding to the query point. Intuitively, we are asking the model to persistently predict in which direction a point is but (unlike TAPVid 3D) not its distance. This corresponds to a human being able to approximately point to where they believe a chair is behind them without knowing exactly how far away it is.

TAPVid360-10k Dataset

Examples


Perspective Video (What the Model Sees)
Equirectangular With Tracking Points For Visualisation (Full 360 View)

Creation Methodology

We have curated a dataset (TAPVid360-10k). The pipeline begins by employing Lang-SAM on the initial frame of a pre-filtered 360° video, utilising specific text prompts (e.g., 'person', 'car') to segment dynamic objects of interest. These segmentation masks are then temporally propagated across the full video sequence via SAM 2. We subsequently extract dynamic 2D perspective viewports centred on each object mask, within which CoTracker3 is applied to generate robust point tracks. These 2D tracks are then inversely projected back onto the 360° spherical domain. Finally, we simulate novel camera trajectories to generate new 2D perspective sequences, using the associated ground truth 3D directional rays as training data for our model.


CoTracker360: A TAPVid-360 Baseline


As a baseline model, we modify the recent CoTracker3 method to predict directions instead of point estimates. The original CoTracker3 predicts point displacements relative to the first frame. This is an easier representation for the model to reason about compared to directly regressing absolute point position at each frame. We follow the same approach except that we apply a rotation to the direction at the first frame. Accordingly, we replace the last layer of the CoTracker3 decoder with a linear layer with 9 outputs and linear activation. We reshape this to a 3 × 3 matrix and project to the closest rotation matrix using special orthogonal Procrustes orthonormalization. To produce output, we convert the initial query point positions from pixel coordinates to directions using the intrinsic parameters of the camera. These unit vector directions are then rotated using the rotation matrices predicted for this point for each frame

Benchmark Results

We benchmark TAPVid-360 using our CoTracker360 baseline and show it outperforms adapted state-of-the-art TAP and TAPVid 3D methods.
(Scroll to view more results)

tapvid_logo

BibTeX

@inproceedings{hudson2025tapvid,
      title={TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video},
      author={Hudson, Finlay and Gardner, James AD and Smith, William AP},
      booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
      year={2025}
    }