CMU 11642 的课程笔记。这一章开始的很多内容都是从论文的研究报告而来。重点是第四部分，怎样对 search log 划分 session，有哪些 feature 可以作为划分 session 的依据。

User information collection

大多数搜索引擎会保存每一次搜索的信息，主要有下面的内容：

query
timestamp
IP address of the search client
possibly an id recorded in a cookie or obtained another way
information about the operating system and browser
clickthrough information

目前公开的 web search logs

The Excite log(1997)
18,113 users, 51,473 queries
The AOL Log(2006)
$>$ 650,000 users, $>$ 20 million queries
Alta Vista(1999)
285 million users, 1 billion queries
Alta Vista(2001)
$>$ 7 million queries

Query characteristics

Query Length: Average terms per query: 3.5 terms

Who queries

Query Frequency: Query Frequency Follows a Power Law

Power Law
$$Frequency(q) = K*Rank(q)^{- \alpha}$$

K: positive constant
Rank(q): popularity rank(r=1 is the most popular)
$\alpha$: constant, about 2.4 for the Excite query log

和 Zipf’s law 一样的 shape，但不一样的斜率。

一小部分的 query 非常常见，而大部分的 query 非常少见。 => 通过缓存来提高搜索效率。

Vary over time

最常见的 query 总是在不断变化

来自 google 的统计数据

每天的 query 中有 20% 在之前从未出现过。(占每天所有 unique query 的 50%)
8% 的 query 是名字(names)

Model query

Query topics: What
比如说在雅虎目录下的分类。
User demographics: Who
如用户的年龄、性别（用户提供）、收入、教育程度（从邮编等信息推断）等。
Session characteristics: How
eg. session length, number of queries/session, % of queries with low/high click entropy

Represent users

需要用到的工具: Pseudo documents, clustering, control vocabulary
How to represent users？

在 “what” 维度上对用户进行聚类
在 (“who”,”how”) 维度上进一步 investigate groups
根据 group 的显著特征对其打上标签

根据用户的 query type 来对用户进行聚类：

从日志中得到 (user,query) pairs
为用户创建 pseudo documents
标题: user id
内容: 每条 query 的前 10 篇文档对应的 Yahoo! 分类目录
计算相似度
e.g. JS divergence, cosine

Types of users

这一部分主要是介绍不同类型的用户特征，这些聚类结果有什么用？=> 为下一步的个性化搜索引擎做准备。

Informational Users

How:
• More likely to issue non-navigational queries
• Less likely to have single-click sessions
• More likely to use query suggestions

What:
• Wide range of topics
• Little interest in adult content

Who:
• More likely to be well-educated
• More likely to have above-average income

Navigational Users

How:
• More likely to issue navigational queries
• More likely to have single-click sessions
• Less likely to use query suggestions

What:
• Dominated by popular websites (Facebook, YouTube, Craigslist) Who
• Mostly representative of the entire population

Transactional Users

How:
• Somewhat similar to navigational users. But, multiple sites can perform the transaction
• Diverse clicks
• Short interaction with search engine What
• Shopping, adult content, gaming

Who:
• Depends heavily on the type of transaction
• Topic “recreation/games” associated with low income & education

聚类结果 - Selected groups

Baby boomers:
• 50 years old
• Interested in finance
• Simple navigational queries related to online banking

Challenged youth:
• Average age of 34
• Low-income neighborhoods with low-level of education
• Interested in music
• Navigational sessions

Liberal females:
• Mostly female from areas that voted Democratic
• Shopping queries
• Long sessions (browsing and comparison)

White conservatives:
• Mostly male from areas that voted Republican
• Interested in automotive, business, home & garden

Older users: Health / diseases & conditions, gambling, travel
People in their late 20s: Health / fitness, reproductive health
Younger people: Games, education
Low income: Music, comics & animation, military
Asian descent: Computers & internet, programming & development

Some topics typically receive few clicks:
• News & media, society & culture, computers & internet People are more likely to click on suggestions for some topics
• Health, science, arts

People with higher educational levels:
• Tend to have shorter sessions
• Click on query suggestions less often
• Are more likely to submit tail queries

Segmenting search logs into sessions

通常，我们会把 search log 划分为一个个基于 information need 的 dialogue
dialogue 形式，实际就是用户发出初始查询，搜索引擎给出结果，用户不满意，重新修改查询语句再次搜索，然后得到新的结果，不断循环的过程。

Person: query
Engine: search results
Person: reformulated query
Engine: new search results ...

那么问题来了，怎么来区分这些不同的 dialogues 呢？
我们定义 information need = a search session(dialogue)，最简单粗暴划分 session 的方法就是通过时间，比如说三十分钟内一系列的用户行为就算作一个 session。

划分 session 可以考虑的因素：

Time: Same session iff |timestamp (q2) – timestamp| (q1) < ∆
often ∆ = 30 minutes
Common term: Same session iff q1 ∩ q2 ≠ Ø
Probably high Precision, low Recall
Rewrite classes: Common reformulation patterns
E.g., term added, deleted, or replaced
Probably high Precision, low Recall
Edit distance
Co-occurrence
e.g., PMI, Chi-square of queries in a query log
Same clicks
queries have co-occurring clicks in a query log
Document categories
ODP or Yahoo page category overlap of top 10 results
Same/overlap retrieved documents
JSD similarity of top 10 results
cosine distance among results
Classifiers
…

训练一个分类器是最好的方法，对任意一对 queries 大概有 95-97% 的准确率。然而 heuristics 也不差，尤其是 edit distance 和 cosine distance among results 都非常有效，时间尺度通常会和其他 heuristic 合用，效果更佳。

各种 heuristic 都可以作为分类器的 Features

Temporal
– ≤ {5, 30, 60, 120} minutes, ∆ time, are_sequential
Edit distance
– Several character and token-based metrics
Query log
– Various types of (q1, q2) co-occurrence in a larger query log
Web search
– Cosine distance of top 50 search results for each query
(“prisma”)

我们训练分类器一般需要完成两个任务

Boundary task: Given a pair of sequential queries
Same task: Given a pair of queries

Challenges

用户的信息需求可能是跨越了几天甚至是几周的，比如用户写论文、找学校信息等，可能出现的情况是：

用户有交叉的任务
如查 paper 的时候顺便查下中午吃什么
用户把一个任务分成若干个子任务分阶段完成
比如今天查为什么，明天查怎么做
子任务可能看起来是独立的实际确实相关的