|
Published Articles >> Table of Contents >> Abstract
9th International Database Engineering & Application Symposium (IDEAS'05)
pp. 105-114
Automatically Maintaining Wrappers for Web Sources
Juan Raposo, University of A Coruña
Alberto Pan, University of A Coruña
Manuel Álvarez, University of A Coruña
Justo Hidalgo, Denodo Technologies Inc.
Full Article Text:

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/IDEAS.2005.13
Send link to a friend
| Abstract |
|
A substantial subset of the web data follows some
kind of underlying structure. Nevertheless, HTML does
not contain any schema or semantic information about
the data it represents. A program able to provide
software applications with a structured view of those
semi-structured web sources is usually called a
wrapper. Wrappers are able to accept a query against
the source and return a set of structured results, thus
enabling applications to access web data in a similar
manner to that of information from databases. A
significant problem in this approach arises because
web sources may experiment changes that invalidate
the current wrappers. In this paper, we present novel
heuristics and algorithms to address this problem. Our
approach is based on collecting some query results
during wrapper operation. Then, when the source
changes, they are used to generate a set of labeled
examples that are then provided as input to a wrapper
induction algorithm able to regenerate the wrapper.
We have tested our methods in several real-world web
data extraction domains, obtaining high accuracy in
all the steps of the process.
|
Additional Information
|
Citation:
Juan Raposo, Alberto Pan, Manuel Álvarez, Justo Hidalgo,
"Automatically Maintaining Wrappers for Web Sources,"
ideas,
pp. 105-114,
9th International Database Engineering & Application Symposium (IDEAS'05),
2005
|
|