Abstract
Humans excel at constructing panoramic mental models of their surroundings, maintaining object
permanence and inferring scene structure beyond visible regions. In contrast, current artificial
vision systems struggle with persistent, panoramic understanding, often processing scenes
egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP)
task, where existing methods fail to track 2D points outside the field of view. To address this, we
introduce TAP-Vid 360, a novel task that requires predicting the 3D direction to queried scene
points across a video sequence, even when far outside the narrow field of view of the observed
video. This task fosters learning allocentric scene representations without needing dynamic 4D
ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision,
resampling them into narrow field-of-view perspectives while computing ground truth directions by
tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and
benchmark, TAP360-10k comprising 10k perspective videos with ground truth directional point
tracking. Our baseline adapts CoTracker3 to predict per-point rotations for direction updates,
outperforming existing TAP and TAP-Vid 3D methods.