Abstract
Many computer vision tasks such as large-scale image retrieval and nearest-neighbor classification perform similarity searches using Approximate Nearest Neighbor (ANN) indexes. These applications rely on the quality of ANN retrieval for success. Popular indexing methods for ANN queries include forests of kd-trees (KDT) and hierarchical k-means (HKM). The dominance of these two methods has led to implementations in many popular computer vision tool kits, perhaps leading some to assume that ANN indexing is a solved problem. This paper introduces a new method, the Proximity Forest, which is based on a forest of randomized metric trees. We present evidence that forests of metric trees significantly outperform KDT and HKM when evaluated on real-world data, controlling for dimensionality, data set size, distance function, and the number of neighbors returned by the query. Randomized metric trees are simple to implement, are generic in regards to the chosen distance function, and can be applied to a wider class of data representations.