Search Engines笔记 - Pseudo Relevance Feedback

CMU 11642 的课程笔记。怎样产生更好的 query 来得到更多的相关文档?从用户角度看,用户一开始会用 short query 来进行检索,在看到结果文档后通过增加或减少 term 以及调整 term weight 的方式进一步优化 query。而对系统而言,能自动产生更好的 query 的方式莫过于机器学习算法。

relevance feedback 其实是一个有监督的机器学习的问题,理想中我们要学习的是 f(document)–>{relevant, not relevant},然而一般我们学习的是 f(document)–>score。训练集的大小一般来说 10-20 页是 good, 100-200 页就 great 了。

relevance feedback 并不经常被使用。一方面是因为用户不喜欢给评价(因训练数据会很少,准确度也不一定高),另一方面是这种评价有风险,如果评估的文档很少,结果是 highly variable 的,stability 和 consistency 可能会受到影响。所以一般我们用的是 Pseudo-relevance feedback,一种无监督的机器学习方法。

Pseudo-relevance feedback

基本逻辑是把原始查询当做分类起,用它来给部分数据打标签,得到 top-ranked documents,然后用 labeled data 来产生更优的 classifier。基本过程:

  1. 用原始 query 检索文档
  2. 取结果的前 N 篇文档作为训练集,这些文档相关度可能不高,然而我们的目的是学习 vocabulary pattern。
  3. 应用 relevance feedback algorithm 选取 term 和 term weight
  4. 组成新的 query 来检索文档

Okapi BM25

过程:

  1. 用原始 query 检索文档
  2. 取前 N 篇文档的 term 作为 potential expansion terms
  3. 为每个 potential expansion term 计算分数
  4. 用前 m 个 term 创建新的 $query_{learned}$
  5. 用新的 query 检索文档

Inference networks (Indri)

过程:

  1. 用原始 query 检索文档
  2. 取前 N 篇文档的 term 作为 potential expansion terms
  3. 为每个 potential expansion term 计算分数
  4. 用前 m 个 term 创建新的 $Q_{learned}$
  5. 合并 $Q_{original}$ 和 $Q_{learned}$ 创建 $Q_{expanded}$
  6. 用新的 query 检索文档

对每个 expansion term,计算 p(t|I)

并没有对文档集合里的常见词做出惩罚,所以加上一个类似 idf 对 weight

最后的 expanded query 是
$$Q_{expanded} = \#wand(wQ_{original}, (1-w)Q_{learned})$$

需要的参数:

  • fbdocs: number of judged documents
  • fbterms: number of terms to add to the query, indri’s default is 10
  • $\mu$: smoothing weight to use for new terms, indri’s default is 0
  • $w$: weight of the original query, indri’s default is 0.5

How many terms is enough
标准答案来了: It depends! 因 query 而异。

Corpus
其实原始查询和最终的查询语句可以在不同的语料上跑,比如说原始查询在 wikipedia 上跑,产生高质量的 expansion term,然后用扩充的 query 在 web 上跑,这能够显著提高 MAP 和 P@10。

直接上代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/**
* @param score_list
* @return
* @throws IOException
*/
private static String expandQuery(ScoreList score_list) throws IOException {
double fbMu = Double.parseDouble(parameters.get("fbMu"));
int fbDocs = Integer.parseInt(parameters.get("fbDocs"));
int fbTerms = Integer.parseInt(parameters.get("fbTerms"));
int docNum = Math.min(fbDocs, score_list.size());
Map<String, ArrayList<Integer>> invertedList = new HashMap();
// map<term, score>
Map<String, Double> termScore = new HashMap();
// get expanded term
for (int i = 0; i < docNum; i++) {
int doc_id = score_list.getDocid(i);
TermVector vec = new TermVector(doc_id, "body");
// termVecMap.put(doc_id, vec);
double docScore = score_list.getDocidScore(i);
double docLen = Idx.getFieldLength("body", doc_id);
// for each term
for (int j = 1; j < vec.stemsLength(); j++) {
String term = vec.stemString(j);
// ignore any candidate expansion term that contains a period
// ('.') or a comma (',')
if (term.contains(".") || term.contains(",")) {
continue;
}
// update inverted list for current term
if (invertedList.containsKey(term)) {
ArrayList<Integer> cur_inverted_list = invertedList.get(term);
cur_inverted_list.add(doc_id);
invertedList.put(term, cur_inverted_list);
} else {
ArrayList<Integer> cur_inverted_list = new ArrayList();
cur_inverted_list.add(doc_id);
invertedList.put(term, cur_inverted_list);
}
// score potential expansion term for current doc
long tf = vec.stemFreq(j);
long ctf = vec.totalStemFreq(j);
double mle = ctf / (double) Idx.getSumOfFieldLengths("body");
double Ptd = (tf + fbMu * mle) / (docLen + fbMu);
double idf = Math.log(1 / mle);
double cur_doc_score = Ptd * docScore * idf;
if (termScore.containsKey(term)) {
termScore.put(term, termScore.get(term) + cur_doc_score);
} else {
termScore.put(term, cur_doc_score);
}
}
}
// get top k terms
PriorityQueue<Map.Entry<String, Double>> termScorePq = new PriorityQueue<Map.Entry<String, Double>>(
termScore.size(), new Comparator<Map.Entry<String, Double>>() {
@Override
public int compare(Map.Entry<String, Double> m1, Map.Entry<String, Double> m2) {
return m2.getValue().compareTo(m1.getValue());
}
});
termScorePq.addAll(termScore.entrySet());
// get new query
String learnedQuery = "#wand ( ";
for (int i = 0; i < fbTerms; i++) {
String score = String.format("%.4f", termScorePq.peek().getValue());
String term = termScorePq.peek().getKey();
learnedQuery = learnedQuery + " " + score + " " + term;
termScorePq.poll();
}
learnedQuery += " )";
System.out.println("learnedQuery " + learnedQuery);
return learnedQuery;
}

处理 query file。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/**
* Process the query file.
*
* @param queryFilePath
* @param model
* @throws Exception
*/
static void processQueryFile(String queryFilePath, String trecEvalOutputPath, RetrievalModel model)
throws Exception {
BufferedReader input = null;
BufferedWriter output = null;
BufferedWriter bw = null;
try {
String qLine = null;
input = new BufferedReader(new FileReader(queryFilePath));
output = new BufferedWriter(new FileWriter(trecEvalOutputPath));
bw = new BufferedWriter(new FileWriter(parameters.get("fbExpansionQueryFile")));
// Each pass of the loop processes one query.
while ((qLine = input.readLine()) != null) {
int d = qLine.indexOf(':');
if (d < 0) {
throw new IllegalArgumentException("Syntax error: Missing ':' in query line.");
}
printMemoryUsage(false);
String qid = qLine.substring(0, d);
String query = qLine.substring(d + 1);
System.out.println("Query " + qLine);
ScoreList r = null;
String defaultOp = model.defaultQrySopName();
query = defaultOp + "(" + query + ")";
// if not expand query
if (!(parameters.containsKey("fb") && parameters.get("fb").equals("true"))) {
r = processQuery(query, model);
} else { // if expand query
// check parameters
if (!(parameters.containsKey("fbTerms") && parameters.containsKey("fbMu")
&& parameters.containsKey("fbOrigWeight")
&& parameters.containsKey("fbExpansionQueryFile"))) {
throw new IllegalArgumentException("Required parameters were missing from the parameter file.");
}
// check if there's ranking file
if (!parameters.containsKey("fbInitialRankingFile")) {
r = processQuery(query, model);
r.sort();
} else {
Map<Integer, ScoreList> score_list_map = readRankingFile(
parameters.get("fbInitialRankingFile"));
if (!score_list_map.containsKey(Integer.parseInt(qid))) {
throw new Exception("No query " + qid + " in ranking file!");
}
r = score_list_map.get(Integer.parseInt(qid));
}
// r.sort();
String expandedQuery = expandQuery(r);
printExpandedQuery(bw, qid, expandedQuery);
double fbOrigWeight = Double.parseDouble(parameters.get("fbOrigWeight"));
String newQuery = "#wand (" + String.valueOf(fbOrigWeight) + " " + query + " "
+ String.valueOf(1 - fbOrigWeight) + " " + expandedQuery + " )";
// System.out.println(" new Query " + newQuery);
r = processQuery(newQuery, model);
}
if (r != null) {
printResults(qid, r, output);
}
}
} catch (IOException ex) {
ex.printStackTrace();
} finally {
input.close();
output.close();
bw.close();
}
}

如果有 initial ranking file。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/**
*
* @param fbInitialRankingFile
* @return
*/
private static Map<Integer, ScoreList> readRankingFile(String fbInitialRankingFile) {
// System.out.println("filename "+fbInitialRankingFile);
Map<Integer, ScoreList> scoreList_map = new HashMap<>();
try (BufferedReader br = new BufferedReader(new FileReader(fbInitialRankingFile))) {
String str;
int last_qry = -1;
ScoreList score_list = new ScoreList();
while ((str = br.readLine()) != null) {
String[] data = str.split(" ");
int cur_qry = Integer.parseInt(data[0].trim());
if (last_qry == -1) {
last_qry = cur_qry;
}
if (cur_qry != last_qry) {
scoreList_map.put(last_qry, score_list);
last_qry = cur_qry;
score_list = new ScoreList();
}
score_list.add(Idx.getInternalDocid(data[2].trim()), Double.parseDouble(data[4].trim()));
}
// add the last query and scorelist
scoreList_map.put(last_qry, score_list);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return scoreList_map;
}

Effectiveness

  • Query expansion 平均能使 MAP 提高 20%
  • 但同时也有可能让 1/3 的用户感到 annoy

所以通常来说,query expansion 会用在召回率很重要的场景,或者 average performance 很重要的场景,比如 legal retrieval, TREC, research paper 等。

徐阿衡 wechat
欢迎关注:徐阿衡的微信公众号
客官,打个赏呗~