ParseTree操作若干-Tregex and Stanford CoreNLP

网上教程太少,只能自己摸索了。关于怎么用 python 来调用 Stanford Parser。–持续更新中–

Tregex 用来做句子层面的识别及操作,简单理解就是关于 tree 的 regex。一些语法知识见The Wonderful World of Tregex。用 java 来调用 API 更简单一点,然而项目需要,所以这一篇讲怎么用 python 来调用。

Stanford CoreNLP

Stanford NLP 的工具还可以有 Server 端!简直是 python 使用者一大福利。
下载安装CoreNLP Server

先测试一下

1
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt

ok,可以运行,然后开启 server

1
2
# Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

之后就可以通过 python 发送 http 请求来调用接口啦~ Python 代码

1
2
3
4
5
6
7
import requests
url = "http://localhost:9000/tregex"
request_params = {"pattern": "(NP[$VP]>S)|(NP[$VP]>S\\n)|(NP\\n[$VP]>S)|(NP\\n[$VP]>S\\n)"}
text = "Pusheen and Smitha walked along the beach."
r = requests.post(url, data=text, params=request_params)
print r.json()

结果

1
{u'sentences': [{u'0': {u'namedNodes': [], u'match': u'(NP (NNP Pusheen)\n (CC and)\n (NNP Smitha))\n'}}]}

Tregex 的基本语法

之后再慢慢补充吧。
1.jpg

示例

假定已经安装好了 nltk, stanford nlp 各类包,并设置好了路径。

Parse Tree from NLTK

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from __future__ import division, unicode_literals
import nltk
from nltk.parse.stanford import StanfordParser
parser = StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
def getParserTree(line):
'''
return parse tree of the string
:param line: string
:return: list of tree nodes
'''
return list(parser.raw_parse(line))
# get parse tree
text = 'Harry Potter, a young boy, is very famous in US'
testTree = getParserTree(text)
print testTree

输出

1
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('NNP', ['Harry']), Tree('NNP', ['Potter'])]), Tree(',', [',']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['young']), Tree('NN', ['boy'])]), Tree(',', [','])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('RB', ['very']), Tree('JJ', ['famous']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NNP', ['US'])])])])])])])]

Parse Tree from CoreNLP Server

1
2
3
4
5
6
7
import requests
#"annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"
url = 'http://localhost:9000/?properties={"annotators": "parse", "outputFormat": "text"}'
text='Harry Potter, a young boy, is very famous in US'
r = requests.post(url, data=text)
print r.content

输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Sentence #1 (12 tokens):
Harry Potter, a young boy, is very famous in US
[Text=Harry CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP]
[Text=Potter CharacterOffsetBegin=6 CharacterOffsetEnd=12 PartOfSpeech=NNP]
[Text=, CharacterOffsetBegin=12 CharacterOffsetEnd=13 PartOfSpeech=,]
[Text=a CharacterOffsetBegin=14 CharacterOffsetEnd=15 PartOfSpeech=DT]
[Text=young CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=JJ]
[Text=boy CharacterOffsetBegin=22 CharacterOffsetEnd=25 PartOfSpeech=NN]
[Text=, CharacterOffsetBegin=25 CharacterOffsetEnd=26 PartOfSpeech=,]
[Text=is CharacterOffsetBegin=27 CharacterOffsetEnd=29 PartOfSpeech=VBZ]
[Text=very CharacterOffsetBegin=30 CharacterOffsetEnd=34 PartOfSpeech=RB]
[Text=famous CharacterOffsetBegin=35 CharacterOffsetEnd=41 PartOfSpeech=JJ]
[Text=in CharacterOffsetBegin=42 CharacterOffsetEnd=44 PartOfSpeech=IN]
[Text=US CharacterOffsetBegin=45 CharacterOffsetEnd=47 PartOfSpeech=NNP]
(ROOT
(S
(NP
(NP (NNP Harry) (NNP Potter))
(, ,)
(NP (DT a) (JJ young) (NN boy))
(, ,))
(VP (VBZ is)
(ADJP (RB very) (JJ famous)
(PP (IN in)
(NP (NNP US)))))))
root(ROOT-0, famous-10)
compound(Potter-2, Harry-1)
nsubj(famous-10, Potter-2)
punct(Potter-2, ,-3)
det(boy-6, a-4)
amod(boy-6, young-5)
appos(Potter-2, boy-6)
punct(Potter-2, ,-7)
cop(famous-10, is-8)
advmod(famous-10, very-9)
case(US-12, in-11)
nmod:in(famous-10, US-12)

annotators 可以加其他 parameter,得到更多的 ner, lemma 等信息,输出也可以设定为 json 或 html 等格式。

同位语

得到 parse tree 的同位语部分,规则如下,第一个 NP 是 parent,后面两个 NP 是 sisters,中间由逗号隔开,这是同位语的基本形式。

1
NP=n1 < (NP=n2 $.. (/,/ $.. NP=n3))

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from __future__ import division, unicode_literals
import nltk
from nltk.parse.stanford import StanfordParser
import requests
APPOSITION = "NP=n1 < (NP=n2 $.. (/,/ $.. NP=n3))"
def getAppositions(tree):
url = "http://localhost:9000/tregex"
request_params = {"pattern": APPOSITION}
r = requests.post(url, data=text, params=request_params)
return r.json()
text = 'Harry Potter, a young boy, is very famous in US'
print getAppositions(text)

输出

1
{u'sentences': [{u'0': {u'namedNodes': [{u'n1': u'(NP\n (NP (NNP Harry) (NNP Potter))\n (, ,)\n (NP (DT a) (JJ young) (NN boy))\n (, ,))\n'}, {u'n2': u'(NP (NNP Harry) (NNP Potter))\n'}, {u'n3': u'(NP (DT a) (JJ young) (NN boy))\n'}], u'match': u'(NP\n (NP (NNP Harry) (NNP Potter))\n (, ,)\n (NP (DT a) (JJ young) (NN boy))\n (, ,))\n'}}]}

再进一步处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def getAppositions(tree):
url = "http://localhost:9000/tregex"
request_params = {"pattern": APPOSITION}
r = requests.post(url, data=text, params=request_params)
js = r.json()
if js['sentences'][0] and '0' in js['sentences'][0] and 'namedNodes' in js['sentences'][0]['0']:
return js['sentences'][0]['0']['namedNodes']
return None
text = 'Harry Potter, a young boy, is very famous in US'
testTree = getParserTree(text)
res = getAppositions(testTree)
if res:
for c in res:
print c

输出:

1
2
3
{u'n1': u'(NP\n (NP (NNP Harry) (NNP Potter))\n (, ,)\n (NP (DT a) (JJ young) (NN boy))\n (, ,))\n'}
{u'n2': u'(NP (NNP Harry) (NNP Potter))\n'}
{u'n3': u'(NP (DT a) (JJ young) (NN boy))\n'}

徐阿衡 wechat
欢迎关注:徐阿衡的微信公众号
客官,打个赏呗~