Abstract
In this paper, we propose a framework to label soccer video shots with pre-defined visual keywords to bridge the semantic gap between low-level features and semantic understanding. Based on these visual keywords, further structure analysis and event detection are facilitated. In our system, a MPEG-1 soccer video stream is partitioned into shots, from which each P-Frame is converted into color-based and edge-based binary maps. Then, we detect the playing field and segment the regions of interest (ROIs) inside the playing field. Finally two support vector machine classifiers and some decision rules are applied to the properties of the ROIs such as size, position, texture ratio, etc., and the position of the playing field to label the video shot with visual keyword. We have applied the proposed method to 3495 soccer video shots and achieved 94.3% and 97.2% average precision and recall, respectively.