|
Published Articles >> Table of Contents >> Abstract
Fourth Latin American Web Congress (LA-WEB'06)
pp. 127-134
Where and How Duplicates Occur in the Web
Alvaro Pereira Jr, Federal Univ. of Minas Gerais, Brazil
Ricardo Baeza-Yates, Yahoo! Research, Spain & Chile
Nivio Ziviani, Federal Univ. of Minas Gerais, Brazil
Full Article Text:

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/LA-WEB.2006.39
Send link to a friend
| Abstract |
|
In this paper we study duplicates on the Web, using collections
containing documents of all sites under the .cl domain
that represent accurate and representative subsets of
the Web. We identify duplicate and near-duplicate documents
in our collections, studying the distribution of documents
in clusters of duplicates. We also study the occurrence
of duplicates in both parts of our Web graphs --
connected and disconnected component -- aiming to identify
where duplicates occur more frequently. We originally
show that the number of duplicates in the Web is expressively
greater than the number of duplicates in the connected
component of the Web graph. Works that previously
estimated the number of duplicates in the Web used collections
of connected components of the Web. In those cases
the sample of the Web was biased.
|
Additional Information
|
Citation:
Alvaro Pereira Jr, Ricardo Baeza-Yates, Nivio Ziviani,
"Where and How Duplicates Occur in the Web,"
la-web,
pp. 127-134,
Fourth Latin American Web Congress (LA-WEB'06),
2006
|
|