Abstract
Search services on hyperlinked data are becoming popular among users because of the huge amount of data available and the consequent difficulty in retrieving and filtering relevant documents. Traditional Term-Based search engines are not very useful for this purpose since the resulting ranking depends on the users' precision in expressing the query. Current research, instead, takes a different approach, called Topic Distillation, which consists in finding documents related to the query topic, but these do not necessarily contain the query string. Current algorithms for topic distillation first compute a base set containing all the relevant pages and then apply an iterative procedure to obtain the authoritative pages. In this paper we present STED, a system for topic distillation and enumeration (i.e.identification of different communities) of web documents. The system is based on a technique which computes the authoritative pages by analyzing the structure of the base set. More specifically, the system applies a statistic approach to the co-citation matrix associated with the base set, to find the most co-cited pages and analyzes both the link structure and the content of pages. Several experiments have demonstrated the effectiveness and efficiency of the system.