Proximity Estimation and Hardness of Short-Text Corpora
In this work, we investigate the relative hardness of short-text corpora in clustering problems and how this hardness relates to traditional similarity measures. Our approach basically attempts to establish a connection between the hardness of a corpus and the precisionlevel exhibited by similarity measures, according to the results obtainedwith different cluster validity measures on the "ideal" clustering ofeach corpus. Moreover, we also propose a new validity measure, namedcontiguity error that allowed us to observe this connection in a consistentway in all the collections considered.
Index Terms:
clustering, short-text corpora, proximity estimation, cluster validity measures
Citation:
Marcelo Luis Errecalde, Diego Ingaramo, Paolo Rosso, "Proximity Estimation and Hardness of Short-Text Corpora," dexa,pp.15-19, 2008 19th International Conference on Database and Expert Systems Application, 2008