Abstract
Text zone classification is a vital step in the digitization process, without which OCR systems perform poorly. Prior methods to document zone classification have relied on large sets of hand-crafted features for training zone classifiers. Such features are usually database-dependent, and their computation is time consuming. In this work we propose a novel method for text zone classification that relies on the approach of unsupervised feature learning. Within our method, feature vectors of document zones are automatically learned by patches extraction, encoding and pooling, where feature encoding is based on a codebook of visual words. The training phase of the text classifier takes into consideration the unbalance between text zones and non-text zones of all types. The proposed method has been tested on publicly available standard databases, and achieved competitive or better results compared to state-of-the-art methods. The results show that our approach matches well the task of text classification, and is robust to zone shapes, orientations and size.