Daniel Crabtree - PhD Thesis

Techniques for Improving Web Search by Understanding Queries

Title	Techniques for Improving Web Search by Understanding Queries
Publication	PhD Thesis
Author	Daniel Crabtree
Date	May 2011
Institution	School of Engineering and Computer Science Victoria University of Wellington New Zealand
Pages	379
Files	Download PDF (64.9 MB), Download QC4 Code (16 kB)

Abstract

This thesis investigates the refinement of web search results with a special focus on the use of clustering and the role of queries. It presents a collection of new methods for evaluating clustering methods, performing clustering effectively, and for performing query refinement.

The thesis identifies different types of query, the situations where refinement is necessary, and the factors affecting search difficulty. It then analyses hard searches and argues that many of them fail because users and search engines have different query models.

The thesis identifies best practice for evaluating web search results and search refinement methods. It finds that none of the commonly used evaluation measures for clustering meet all of the properties of good evaluation measures. It then presents new quality and coverage measures that satisfy all the desired properties and that rank clusterings correctly in all web page clustering situations.

The thesis argues that current web page clustering methods work well when different interpretations of the query have distinct vocabulary, but still have several limitations and often produce incomprehensible clusters. It then presents a new clustering method that uses the query to guide the construction of semantically meaningful clusters. The new clustering method significantly improves performance.

Finally, the thesis explores how searches and queries are composed of different aspects and shows how to use aspects to reduce the distance between the query models of search engines and users. It then presents fully automatic methods that identify query aspects, identify underrepresented aspects, and predict query difficulty. Used in combination, these methods have many applications — the thesis describes methods for two of them. The first method improves the search results for hard queries with underrepresented aspects by automatically expanding the query using semantically orthogonal keywords related to the underrepresented aspects. The second method helps users refine hard ambiguous queries by identifying the different query interpretations using a clustering of a diverse set of refinements. Both methods significantly outperform existing methods.

Errata

p147 — QC4 also ignores outliers when computing coverage.

The last sentence in section 4.6.1 should read:

QC4 ignores the outlier topic and ignores clusters that contain a majority of outliers when computing quality and coverage. Specifically, the outlier topic is ignored in AC (p148), WC (p148), and L (p149), and clusters that contain a majority of outliers are ignored in AQ (p147), WQ (p147), and IDC (p172).

p174 — The formula and definition of rank(n) is incorrect.

The last paragraph of section 4.8.5.1 should read:

The rank of a tranche, rank(n), is defined to be the rank of its smallest cluster when all clusters are ordered by size. For example, if tranche one (with the smallest clusters) had 10 clusters, tranche two had 15 clusters, and tranche three (with the largest clusters) had 5 clusters, then the rank of tranche three would be 5, the rank of tranche two would be 20, and the rank of tranche one would be 30. $$rank(n) = \sum_{i=n}^{\infty} |\{ c \in C | tranche(c) = i \}|$$

p299 — The column labels have been transposed.

In table 6.11, column 1 should be titled Spearman, column 2 should be titled Pearson, and column 3 should be titled Kendall.

QC4 Implementation Code

I have released a Java implementation of the QC4 evaluation method. It can be downloaded here (16 kB).

The code includes a demonstration of how to use the qc4 package to evaluate different clusterings. In particular, it constructs all of the clusterings and ideal clusterings used in the synthetic tests from chapter 4 of my PhD thesis, evaluates those clusterings using QC4, and outputs the results to System.out. It also demonstrates how to specify outlier topics.

If you make use of this code, please cite my PhD Thesis.

BibTeX

@PHDTHESIS{IWS11,
    AUTHOR =  {Daniel Crabtree},
    TITLE =   {Techniques for Improving Web Search 
               by Understanding Queries},
    SCHOOL =  {School of Engineering and Computer Science,
               Victoria University of Wellington},
    ADDRESS = {Wellington, New Zealand},
    MONTH =   {May},    
    YEAR =    {2011}
}