Extreme Rotation Estimation using
Dense Correlation Volumes
How can we estimate relative rotation between images in extreme non-overlapping cases? (Hover over the images to reveal some implicit cues!)
Above we show two non-overlapping image pairs capturing an urban street scene (left) and a church (right). Possible cues revealing their relative geometric relationship include sunlight and direction of shadows in outdoor scenes and parallel lines and vanishing points in manmade scenes.
In this work, we present an approach for reasoning about such "hidden" cues for estimating the relative rotation between a pair of (possibly) non-overlapping images.
We present a technique for estimating the relative 3D rotation of an RGB image pair in an extreme setting, where the images have little or no overlap. We observe that, even when images do not overlap, there may be rich hidden cues as to their geometric relationship, such as light source directions, vanishing points, and symmetries present in the scene. We propose a network design that can automatically learn such implicit cues by comparing all pairs of points between the two input images. Our method therefore constructs dense feature correlation volumes and processes these to predict relative 3D rotations. Our predictions are formed over a fine-grained discretization of rotations, bypassing difficulties associated with regressing 3D rotations. We demonstrate our approach on a large variety of extreme RGB image pairs, including indoor and outdoor images captured under different lighting conditions and geographic locations. Our evaluation shows that our model can successfully estimate relative rotations among non-overlapping images without compromising performance over overlapping image pairs.
Overview of our Method
Given a pair of images, a shared-weight Siamese encoder extracts feature maps. We compute a 4D correlation volume using the inner product of features, from which our model predicts the relative rotation (here, as distributions over Euler angles).
A 4D correlation volume is
calculated from a pair of image feature maps. Given a feature vector
from Image 1, we compute the dot product with all feature vectors
in Image 2, and build up a 2D slice of size H x W. Combining
all 2D slices over all feature vectors in Image 1, we obtain a 4D
correlation volume of size H x W x H x W.
Our correlation volumes are implicitly assigned a dual role which emerges through training on both overlapping and non-overlapping pairs. When the input image pair contains significant overlap, pointwise correspondence can be computed and transferred onward to the rotation prediction module. When the input image pair contains little to no overlap, the correlation volume can assume the novel role of detecting implicit cues.
Predicted Rotation Results on Indoor Scenes
Hover over the images to see the full panoramas with the ground-truth perspective images marked in red. We show our predicted viewpoints (in yellow) and the result obtained by a regression model predicting a continuous representation in 6D (in blue).
Predicted Rotation Results on Outdoor Scenes and Generalization to New Cities
We show results on images from unseen panoramas in Manhattan, Pittsburgh and London, all obtained from a model trained on images from Manhattan only.
This work was supported in part by the National Science Foundation (IIS-2008313) and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program and the Zuckerman STEM leadership program.