Abstract
We propose a method for describing human activities from video images by tracking human skin regions: facial and hand regions. To detect skin regions robustly, three kinds of probabilistic information are extracted and integrated using Dempster-Shafer theory. On the other hand, main difficulty in transforming video images into textual descriptions is how to bridge a semantic gap between them. By associating visual features of head and hand motion with natural language concepts, appropriate syntactic components such as verbs, objects, etc. are determined and translated into natural language.