Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
313 views
in Technique[技术] by (71.8m points)

tree - Python and NLTK: How to analyze sentence grammar?

I have this code which should show the syntactic structure of the sentence according to defined grammar. However it is returning an empty []. What am I missing or doing wrong?

import nltk

grammar = nltk.parse_cfg("""
S -> NP VP 
PP -> P NP
NP -> Det N | Det N PP 
VP -> V NP | VP PP
N -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
P -> 'or' | 'and'
""")

def main():
    sent = "Kim arrived or Dana left and everyone cheered".split()
    parser = nltk.ChartParser(grammar)
    trees = parser.nbest_parse(sent)
    for tree in trees:
        print tree

if __name__ == '__main__':
    main()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Let's do some reverse engineering:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

Seems like the rules can't recognize even the first work as NP. So let's try injecting NP -> N

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP | N
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('NP', [Tree('N', ['Kim'])])]

So now it's working, let's continue Kim arrived or Dana and:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | VP PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

Seem like there is no way to get the VP with or without the P, since V requires either an NP after, or it has to go up the tree to be a VP before taking a P, so it's relax the rules and say VP -> V PP instead of VP -> VP PP:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('S', [Tree('NP', [Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['arrived']), Tree('PP', [Tree('P', ['or']), Tree('NP', [Tree('N', ['Dana'])])])])])]

Okay, we are getting closer, but seems like the next word broke the cfg rules again:

>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana left".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> sent = "Kim arrived or Dana left and".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or Dana left and everyone".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or Dana left and everyone cheered".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

So I hope the above example shows you that trying to change the rules to incorporate language phenomenon from left to right is hard.

Instead of doing it from left to right, and achieve

[[[[[[[[Kim] arrived] or] Dana] left] and] everyone] cheered]

why don't you try to make more linguistically sound rules to achieve:

  1. [[[Kim arrived] or [Dana left]] and [everyone cheered]]
  2. [[Kim arrived] or [[Dana left] and [everyone cheered]]]

Try this instead:

import nltk
grammar = nltk.parse_cfg("""
S -> CP | VP 
CP -> VP C VP | CP C VP | VP C CP
VP -> NP V 
NP -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
C -> 'or' | 'and'
""")

print "======= Kim arrived ========="
sent = "Kim arrived".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t

print "
======= Kim arrived or Dana left ========="
sent = "Kim arrived or Dana left".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t 

print "
=== Kim arrived or Dana left and everyone cheered ===="
sent = "Kim arrived or Dana left and everyone cheered".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t

[out]:

======= Kim arrived =========
(S (VP (NP Kim) (V arrived)))

======= Kim arrived or Dana left =========
(S (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left))))

=== Kim arrived or Dana left and everyone cheered ====
(S
  (CP
    (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left)))
    (C and)
    (VP (NP everyone) (V cheered))))
(S
  (CP
    (VP (NP Kim) (V arrived))
    (C or)
    (CP
      (VP (NP Dana) (V left))
      (C and)
      (VP (NP everyone) (V cheered)))))

The above solution show how your CFG rules needs to be robust enough to not only capture the full sentence but also part of the sentence too.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...