Abstract
This contribution addresses the approach to creating smooth and coherent description of video streams. Firstly conventional image processing techniques are applied to extract high level features from individual video frames. Natural language description of the frame contents is produced based on high level features. In order to extend the approach to description of video streams, we introduce units of features and overview how units can be used to present coherent, smooth and well phrased descriptions by incorporating spatial and temporal information. The approach is evaluated by calculating overlap similarity score between human authored and machine generated descriptions.