Abstract
Most Web videos are captured in uncontrolled environments (e.g. videos captured by freely-moving cameras with low resolution); this makes automatic video annotation very difficult. To address this problem, we present a robust moving foreground object detection method followed by the integration of features collected from heterogeneous domains. We advance SIFT feature matching and present a probabilistic framework to construct consensus foreground object templates (CFOT). The CFOT can detect moving foreground objects of interest across video frames, and this allows us to extract visual features from foreground regions of interest. Together with the use of audio features, we are able to improve resulting annotation accuracy. We conduct experiments and achieve promising results on a Web video dataset collected from YouTube.