CMU 11642 的 project。

项目介绍

简介

数据：ClueWeb09 dataset，共 553,202 篇文档，用 Lucene 建立的索引。
部分框架是现成的，有 api 文档
我们要做的是实现部分 operator 以及 ranking algorithm。

实现一个个性化的搜索引擎，具有以下能力：

diversification
query expansion
learning to rank

支持的 operator:

1	#OR, #AND, #SYN, #NEAR/n, #WINDOW/n, #SUM #AND, #WAND, #WSUM, #WINDOW

支持的 fields:

1	'url', 'keywords' , 'title', 'body', 'inlink'

支持的 ranking algorithm
Unranked/Ranked boolean, Okapi BM25, Indri, Le2R(use SVM), diversification algorithm(xQuAD & PM25), and etc.

输入

程序输入: one parameter (name of parameter file)

parameter file 必须包括以下参数：

- queryFilePath= The path to the query file.
- indexPath= The path to the Lucene index directory. Typically this will be something like "indexPath=index".
- trecEvalOutputPath= The path to the file where your software will write its output for trec_eval.
- retrievalAlgorithm= "UnrankedBoolean" ／ "RankedBoolean" ／ "BM25" / "Indri"

可选参数
用于 “BM25” / “Indri” 模型。

- BM25:k_1= Values are real numbers >= 0.0.
- BM25:b= Values are real numbers between 0.0 and 1.0.
- BM25:k_3= Values are real numbers >= 0.0.
- Indri:mu= Values are integers >= 0.
- Indri:lambda= Values are real numbers between 0.0 and 1.0

用于 query expansion。

- fb= Acceptable values are "true" and "false". This value controls whether query expansion is performed (fb=true).
- fbDocs= Acceptable values are integers > 0. This value determines the number of documents to use for query expansion.
- fbTerms= Acceptable values are integers > 0. This value determines the number of terms that are added to the query.
- fbMu= Acceptable values are integers >= 0. This value determines the amount of smoothing used to calculate p(r|d).
- fbOrigWeight= Acceptable values are between 0.0 and 1.0. This value determines the weight on the original query. The weight on the expanded query is (1-fbOrigWeight).
- fbInitialRankingFile= The value is a string that contains the name of a file (in trec_eval input format) that contains an initial document ranking for the query.
- fbExpansionQueryFile= The value is a string that contains the name of a file where your software must write its expansion query. The file format is described below.

用于 Le2R:（解释待修正）

letor:trainingQueryFile= HW4-train.qry.txt
letor:trainingQrelsFile= HW4-train.qrel.txt
letor:trainingFeatureVectorsFile= HW4-train.vec.txt
letor:pageRankFile= HW4.pk.txt
letor:svmRankLearnPath= svm_rank/svm_rank_learn
letor:svmRankClassifyPath= svm_rank/svm_rank_classify
letor:svmRankParamC= 0.001
letor:svmRankModelFile= HW4-svm.model.txt
letor:testingFeatureVectorsFile= HW4-test.vec.txt
letor:testingDocumentScores= HW4-test.rank.txt

用于 diversification:（解释待修正）

diversity=true
diversity:maxInputRankingsLength=100
diversity:maxResultRankingLength=50
diversity:algorithm=xquad
diversity:intentsFile=q.intents.txt
diversity:lambda=1

输出

程序输出：
在 trecEvalOutputPath 指定的文件中：

QueryID	Q0	DocID	                    Rank	Score	RunID
10	    Q0	clueweb09-enwp03-35-1378	1	    16	  run-1
10	    Q0	clueweb09-enwp00-78-1360	2	    11	  run-1
10	    Q0	clueweb09-enwp00-67-0958	3	    9	    run-1
:	      :	  :	                        :	    :	    :
11	    Q0	clueweb09-enwp00-63-1141	1	    18	  run-1

如果有 query expansion，在 fbExpansionQueryFile 指定的文件中：

qid1: query1
qid2: query2
 :      :
eg.
1: #wand (0.73 obama 0.43 family 0.40 white 0.65 tree 0.33 politics ...)
2: #wand (0.69 french 0.83 lick 0.76 indiana ...)

基本策略

要求

从 query file 中逐条读取 query
将 query parse 为 query tree，internal nodes 是 operators，leaves 是 index terms
- 如果一个 query 没有 explicit operator，默认为 #OR
- 如果一个 query 没有 explicit field，默认为 body
- 对 query term 进行 stemming 和 stopwords 处理
评估 query，用 DAAT 策略，对 leaf node 的 evaluation 就是如果这个 term 的 inverted list 存在，就获取它，注意有些 query term 是没有 inverted list 的。
对所有文档按文档分数降序排序，如果分数相同，按 external document id 升序排序。

%E9%A1%B9%E7%9B%AE%E5%AE%9E%E6%88%98-%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E/query_tree.jpg

Operator

不同模型支持的 operator 各有不同
系统需要支持的 Operator 有 #OR, #AND, #SYN, #NEAR/n, #WINDOW/n, 对 BM25 模型来说，还需要支持 #SUM，对 Indri 模型来说，还需要支持 #AND, #WAND, #WSUM, #WINDOW

Fields

系统支持的 fields 有 ‘url’, ‘keywords’ (from the html ‘meta’ tag), ‘title’, ‘body’, 和 ‘inlink’ 5 种，query 形式为 apple.title。

Query

#Operator( term_1.field term_2.field ... term_n.field )
apples
#AND (apple bananas)
#OR (apple bananas)
#NEAR/3 (apple pie)
#NEAR/5 (pie apple)

排序模型

Exact-match

Boolean retrieval 需要支持的 Operator 有 #OR, #AND, #SYN, #NEAR/n, #WINDOW/n

#OR 只要有一个 query term 在文档中出现，就算 match，在 ranked boolean retrieval 中分数为所有匹配的 query term 的 tf 的最大值。
#AND 只有在所有 query term 都在文档中出现时，才算 match，在 ranked boolean retrieval 中分数为所有 query term 的 tf 的最小值。
#NEAR/n 如果每对相邻两个 query term 之间的距离小于 n，才算 match，在 ranked boolean retrieval 中分数为 match 的次数。（For example, #NEAR/2(a b c) matches “a b c”, “a x b c”, “a b x c”, and “a x b x c”, but not “a x x b c”）。
#WINDOW/n 和 #NEAR/n 类似，但是不要求顺序。

Search Engines笔记 - Exact-match retrieval