2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)
Download PDF

Abstract

One of the challenges of using techniques such as convolutional neural networks and deep learning for automated object recognition in images and video is to be able to generate sufficient quantities of labeled training image data in a cost-effective way. It is generally preferred to tag hundreds of thousands of frames for each category or label, and a human being tagging images frame by frame might expect to spend hundreds of hours creating such a training set. One alternative is to use video as a source of training images. A human tagger notes the start and stop time in each clip for the appearance of objects of interest. The video is broken down into component frames using software such as ffmpeg. The frames that fall within the time intervals for objects of interest are labeled as “targets,” and the remaining frames are labeled as “non-targets.” This separation of categories can be automated. The time required by a human viewer using this method would be around ten hours, at least 1–2 orders of magnitude lower than a human tagger labeling frame by frame. The false alarm rate and target detection rate can by optimized by providing the system unambiguous training examples.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles