Improving Web Clustering by Cluster Selection
TitleImproving Web Clustering by Cluster Selection
PublicationThe 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)
AuthorsDaniel Crabtree, Xiaoying Gao, Peter Andreae
Date19 – 22 September 2005
VenueCompiègne, France
Pages172 – 178
FilesDownload PDF (139 kB), Download Data Set (4.64 MB)
Web page clustering is a technology that puts semantically related web pages into groups and is useful for categorizing, organizing, and refining search results. When clustering using only textual information, Suffix Tree Clustering (STC) outperforms other clustering algorithms by making use of phrases and allowing clusters to overlap. One problem of STC and other similar algorithms is how to select a small set of clusters to display to the user from a very large set of generated clusters. The cluster selection method used in STC is flawed in that it does not handle overlapping clusters appropriately. This paper introduces a new cluster scoring function and a new cluster selection algorithm to overcome the problems with overlapping clusters, which are combined with STC to make a new clustering algorithm ESTC. This paper’s experiments show that ESTC significantly outperforms STC and that even with less data ESTC performs similarly to a commercial clustering search engine.
Data Set

This paper was accompanied by two test data sets, which can be downloaded here (4.64 MB).

The “jaguar” data set consists of 210 documents assigned to 34 topics and the “salsa” data set consists of 198 documents assigned to 21 topics. The data sets are in the format of MySQL SQL statements and the data sets include the search engine snippets, the full document text, the documents position in the search results, the document's url, and the document's manually assigned topic.

If you make use of these data sets, please cite this paper.

    AUTHOR =    {Daniel Crabtree and Xiaoying Gao and Peter Andreae},
    TITLE =     {Improving Web Clustering by Cluster Selection},
    BOOKTITLE = {The 2005 IEEE/WIC/ACM International Conference 
                 on Web Intelligence (WI'05)},
    YEAR =      {2005},
    PAGES =     {172--178}