loading...
Text and Layout Information Extraction from Document Files of Various Formats Based on the Analysis of Page Description Language
Ninth International Conference on Doc ...
 This Article 
 
PDF
HTML
IEEE Xplore Subscribers
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Takashi Hirano, Mitsubishi Electric Corporation, Information Technology R&D Center
Yuichi Okano, Mitsubishi Electric Corporation, Information Technology R&D Center
Yasuhiro Okada, Mitsubishi Electric Corporation, Information Technology R&D Center
Fumio Yoda, Mitsubishi Electric Corporation, Information Technology R&D Center
We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the Page Description Language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.
Citation:
Takashi Hirano, Yuichi Okano, Yasuhiro Okada, Fumio Yoda, "Text and Layout Information Extraction from Document Files of Various Formats Based on the Analysis of Page Description Language," icdar,pp.262-266, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1, 2007
Usage of this product signifies your acceptance of the Terms of Use.


Click here to go to beta feedback form