CMU 11642 的课程笔记。怎样产生更好的 query 来得到更多的相关文档？从用户角度看，用户一开始会用 short query 来进行检索，在看到结果文档后通过增加或减少 term 以及调整 term weight 的方式进一步优化 query。而对系统而言，能自动产生更好的 query 的方式莫过于机器学习算法。

relevance feedback 其实是一个有监督的机器学习的问题，理想中我们要学习的是 f(document)–>{relevant, not relevant}，然而一般我们学习的是 f(document)–>score。训练集的大小一般来说 10-20 页是 good, 100-200 页就 great 了。

relevance feedback 并不经常被使用。一方面是因为用户不喜欢给评价(因训练数据会很少，准确度也不一定高)，另一方面是这种评价有风险，如果评估的文档很少，结果是 highly variable 的，stability 和 consistency 可能会受到影响。所以一般我们用的是 Pseudo-relevance feedback，一种无监督的机器学习方法。

Pseudo-relevance feedback

基本逻辑是把原始查询当做分类起，用它来给部分数据打标签，得到 top-ranked documents，然后用 labeled data 来产生更优的 classifier。基本过程：

用原始 query 检索文档
取结果的前 N 篇文档作为训练集，这些文档相关度可能不高，然而我们的目的是学习 vocabulary pattern。
应用 relevance feedback algorithm 选取 term 和 term weight
组成新的 query 来检索文档

Okapi BM25

过程：

用原始 query 检索文档
取前 N 篇文档的 term 作为 potential expansion terms
为每个 potential expansion term 计算分数
用前 m 个 term 创建新的 $query_{learned}$
用新的 query 检索文档

Inference networks (Indri)

过程：

用原始 query 检索文档
取前 N 篇文档的 term 作为 potential expansion terms
为每个 potential expansion term 计算分数
用前 m 个 term 创建新的 $Q_{learned}$
合并 $Q_{original}$ 和 $Q_{learned}$ 创建 $Q_{expanded}$
用新的 query 检索文档

对每个 expansion term，计算 p(t|I)

并没有对文档集合里的常见词做出惩罚，所以加上一个类似 idf 对 weight

最后的 expanded query 是
$$Q_{expanded} ＝ \#wand(wQ_{original}, (1-w)Q_{learned})$$

需要的参数:

fbdocs: number of judged documents
fbterms: number of terms to add to the query, indri’s default is 10
$\mu$: smoothing weight to use for new terms, indri’s default is 0
$w$: weight of the original query, indri’s default is 0.5

How many terms is enough
标准答案来了: It depends! 因 query 而异。

Corpus
其实原始查询和最终的查询语句可以在不同的语料上跑，比如说原始查询在 wikipedia 上跑，产生高质量的 expansion term，然后用扩充的 query 在 web 上跑，这能够显著提高 MAP 和 P@10。

直接上代码：


/**
 * @param score_list
 * @return
 * @throws IOException
 */
private static String expandQuery(ScoreList score_list) throws IOException {
	double fbMu = Double.parseDouble(parameters.get("fbMu"));
	int fbDocs = Integer.parseInt(parameters.get("fbDocs"));
	int fbTerms = Integer.parseInt(parameters.get("fbTerms"));
	int docNum = Math.min(fbDocs, score_list.size());
	Map<String, ArrayList<Integer>> invertedList = new HashMap();
	// map<term, score>
	Map<String, Double> termScore = new HashMap();
	// get expanded term
	for (int i = 0; i < docNum; i++) {
		int doc_id = score_list.getDocid(i);
		TermVector vec = new TermVector(doc_id, "body");
		// termVecMap.put(doc_id, vec);
		double docScore = score_list.getDocidScore(i);
		double docLen = Idx.getFieldLength("body", doc_id);
		// for each term
		for (int j = 1; j < vec.stemsLength(); j++) {
			String term = vec.stemString(j);
			// ignore any candidate expansion term that contains a period
			// ('.') or a comma (',')
			if (term.contains(".") || term.contains(",")) {
				continue;
			}
			// update inverted list for current term
			if (invertedList.containsKey(term)) {
				ArrayList<Integer> cur_inverted_list = invertedList.get(term);
				cur_inverted_list.add(doc_id);
				invertedList.put(term, cur_inverted_list);
			} else {
				ArrayList<Integer> cur_inverted_list = new ArrayList();
				cur_inverted_list.add(doc_id);
				invertedList.put(term, cur_inverted_list);
			}
			// score potential expansion term for current doc
			long tf = vec.stemFreq(j);
			long ctf = vec.totalStemFreq(j);
			double mle = ctf / (double) Idx.getSumOfFieldLengths("body");
			double Ptd = (tf + fbMu * mle) / (docLen + fbMu);
			double idf = Math.log(1 / mle);
			double cur_doc_score = Ptd * docScore * idf;
			if (termScore.containsKey(term)) {
				termScore.put(term, termScore.get(term) + cur_doc_score);
			} else {
				termScore.put(term, cur_doc_score);
			}
		}
	}
	// get top k terms
	PriorityQueue<Map.Entry<String, Double>> termScorePq = new PriorityQueue<Map.Entry<String, Double>>(
			termScore.size(), new Comparator<Map.Entry<String, Double>>() {
				@Override
				public int compare(Map.Entry<String, Double> m1, Map.Entry<String, Double> m2) {
					return m2.getValue().compareTo(m1.getValue());
				}
			});
	termScorePq.addAll(termScore.entrySet());
	// get new query
	String learnedQuery = "#wand ( ";
	for (int i = 0; i < fbTerms; i++) {
		String score = String.format("%.4f", termScorePq.peek().getValue());
		String term = termScorePq.peek().getKey();
		learnedQuery = learnedQuery + " " + score + " " + term;
		termScorePq.poll();
	}
	learnedQuery += " )";
	System.out.println("learnedQuery " + learnedQuery);
	return learnedQuery;
}

处理 query file。

/**
 * Process the query file.
 *
 * @param queryFilePath
 * @param model
 * @throws Exception
 */
static void processQueryFile(String queryFilePath, String trecEvalOutputPath, RetrievalModel model)
    throws Exception {
  BufferedReader input = null;
  BufferedWriter output = null;
  BufferedWriter bw = null;
  try {
    String qLine = null;
    input = new BufferedReader(new FileReader(queryFilePath));
    output = new BufferedWriter(new FileWriter(trecEvalOutputPath));
    bw = new BufferedWriter(new FileWriter(parameters.get("fbExpansionQueryFile")));
    // Each pass of the loop processes one query.
    while ((qLine = input.readLine()) != null) {
      int d = qLine.indexOf(':');
      if (d < 0) {
        throw new IllegalArgumentException("Syntax error:  Missing ':' in query line.");
      }
      printMemoryUsage(false);
      String qid = qLine.substring(0, d);
      String query = qLine.substring(d + 1);
      System.out.println("Query " + qLine);
      ScoreList r = null;
      String defaultOp = model.defaultQrySopName();
      query = defaultOp + "(" + query + ")";
      // if not expand query
      if (!(parameters.containsKey("fb") && parameters.get("fb").equals("true"))) {
        r = processQuery(query, model);
      } else { // if expand query
        // check parameters
        if (!(parameters.containsKey("fbTerms") && parameters.containsKey("fbMu")
            && parameters.containsKey("fbOrigWeight")
            && parameters.containsKey("fbExpansionQueryFile"))) {
          throw new IllegalArgumentException("Required parameters were missing from the parameter file.");
        }
        // check if there's ranking file
        if (!parameters.containsKey("fbInitialRankingFile")) {
          r = processQuery(query, model);
          r.sort();
        } else {
          Map<Integer, ScoreList> score_list_map = readRankingFile(
              parameters.get("fbInitialRankingFile"));
          if (!score_list_map.containsKey(Integer.parseInt(qid))) {
            throw new Exception("No query " + qid + " in ranking file!");
          }
          r = score_list_map.get(Integer.parseInt(qid));
        }
//					r.sort();
        String expandedQuery = expandQuery(r);
        printExpandedQuery(bw, qid, expandedQuery);
        double fbOrigWeight = Double.parseDouble(parameters.get("fbOrigWeight"));
        String newQuery = "#wand (" + String.valueOf(fbOrigWeight) + " " + query + " "
            + String.valueOf(1 - fbOrigWeight) + " " + expandedQuery + " )";
        // System.out.println(" new Query " + newQuery);
        r = processQuery(newQuery, model);
      }
      if (r != null) {
        printResults(qid, r, output);
      }
    }
  } catch (IOException ex) {
    ex.printStackTrace();
  } finally {
    input.close();
    output.close();
    bw.close();
  }
}

如果有 initial ranking file。

/**
 *
 * @param fbInitialRankingFile
 * @return
 */
private static Map<Integer, ScoreList> readRankingFile(String fbInitialRankingFile) {
  // System.out.println("filename "+fbInitialRankingFile);
  Map<Integer, ScoreList> scoreList_map = new HashMap<>();
  try (BufferedReader br = new BufferedReader(new FileReader(fbInitialRankingFile))) {
    String str;
    int last_qry = -1;
    ScoreList score_list = new ScoreList();
    while ((str = br.readLine()) != null) {
      String[] data = str.split(" ");
      int cur_qry = Integer.parseInt(data[0].trim());
      if (last_qry == -1) {
        last_qry = cur_qry;
      }
      if (cur_qry != last_qry) {
        scoreList_map.put(last_qry, score_list);
        last_qry = cur_qry;
        score_list = new ScoreList();
      }
      score_list.add(Idx.getInternalDocid(data[2].trim()), Double.parseDouble(data[4].trim()));
    }
    // add the last query and scorelist
    scoreList_map.put(last_qry, score_list);
  } catch (Exception e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
  }
  return scoreList_map;
}

Effectiveness

Query expansion 平均能使 MAP 提高 20%
但同时也有可能让 1/3 的用户感到 annoy

所以通常来说，query expansion 会用在召回率很重要的场景，或者 average performance 很重要的场景，比如 legal retrieval, TREC, research paper 等。