Advanced Search
CS Search Google Search
Subscribers, please login

Published Articles >> Table of Contents >> Abstract

Fourth Latin American Web Congress (LA-WEB'06)   pp. 127-134
Where and How Duplicates Occur in the Web

Full Article Text: Download PDF of full textBuy this article

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/LA-WEB.2006.39
Send link to a friend

Abstract
In this paper we study duplicates on the Web, using collections containing documents of all sites under the .cl domain that represent accurate and representative subsets of the Web. We identify duplicate and near-duplicate documents in our collections, studying the distribution of documents in clusters of duplicates. We also study the occurrence of duplicates in both parts of our Web graphs -- connected and disconnected component -- aiming to identify where duplicates occur more frequently. We originally show that the number of duplicates in the Web is expressively greater than the number of duplicates in the connected component of the Web graph. Works that previously estimated the number of duplicates in the Web used collections of connected components of the Web. In those cases the sample of the Web was biased.
Additional Information

Citation:  Alvaro Pereira Jr, Ricardo Baeza-Yates, Nivio Ziviani, "Where and How Duplicates Occur in the Web," la-web, pp. 127-134,  Fourth Latin American Web Congress (LA-WEB'06),  2006

Similar Articles

Abstract Contents
Abstract
Citation




Free access to

  • Abstracts
  • Selected PDFs

Electronic subscribers login to:

  • Access HTML/PDFs of full text articles

Subscription information

Get a Web account

Peer Review Notice

Give us Feedback