Extracting Text from WWW Images
In this paper, we examine the problem of locating and extracting text from in-line images of World Wide Web pages. We described a text detection algorithm which is based on color clustering and connected component analysis. The algorithm first quantizes the color space of the input image into a number of color classes using a parameter-free clustering procedure. It then identifies text-like connected components in each color class based on their shapes. Finally, a post-processing procedure aligns text-like components into textlines. The experimental results show that our text extraction algorithm works well on a variety of test images.
Index Terms:
text detection, information retrieval, World Wide Web.
Citation:
Jiangying Zhou, Daniel Lopresti, "Extracting Text from WWW Images," icdar,pp.248, Fourth International Conference Document Analysis and Recognition (ICDAR'97), 1997