2016 IEEE Winter Conference on Applications of Computer Vision (WACV)
Download PDF

Abstract

Where do we focus our attention in an image? Humans have an amazing ability to cut through the clutter to the parts of an image most relevant to the task at hand. Consider the task of geo-localizing tourist photos by retrieving other images taken at that location. Such photos naturally contain friends and family, and perhaps might even be nearly filled by a person's face if it is a selfie. Humans have no trouble ignoring these ‘distractions’ and recognizing the parts that are indicative of location (e.g., the towers of Neuschwanstein Castle instead of their friend's face, a tree, or a car). In this paper, we investigate learning this ability automatically. At training-time, we learn how informative a region is for localization. At test-time, we use this learned model to determine what parts of a query image to use for retrieval. We introduce a new dataset, People at Landmarks, that contains large amounts of clutter in query images. Our system is able to outperform the existing state of the art approach to retrieval by more than 10% mAP, as well as improve results on a standard dataset without heavy occluders (Oxford5K).
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles