Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
322 views
in Technique[技术] by (71.8m points)

python-2.7 - 从Python NLTK或其他模块中的任何单词获取音素?(Get phonemes from any word in Python NLTK or other modules?)

Python NLTK has cmudict that spits out phonemes of recognized words.

(Python NLTK的命令会吐出已识别单词的音素。)

For example 'see' -> [u'S', u'IY1'], but for words that are not recognized it gives an error.

(例如'see'-> [u'S',u'IY1'],但是对于无法识别的单词会给出错误。)

For example 'seasee' -> error.

(例如'seasee'->错误。)

import nltk

arpabet = nltk.corpus.cmudict.dict()

for word in ('s', 'see', 'sea', 'compute', 'comput', 'seesea'):
    try:
        print arpabet[word][0]
    except Exception as e:
        print e

#Output
[u'EH1', u'S']
[u'S', u'IY1']
[u'S', u'IY1']
[u'K', u'AH0', u'M', u'P', u'Y', u'UW1', u'T']
'comput'
'seesea'

Is any there any module that doesn't have that limitation but able to find/guess phonemes of any real or made-up words?

(是否有没有那个限制但能够找到/猜测任何真实或虚构单词的音素的模块?)

If there is none, is there any way I can program it out?

(如果没有,我有什么办法可以对其编程?)

I am thinking about doing loops to test increasing portion of the word.

(我正在考虑做循环以测试单词的递增部分。)

For example in 'seasee', the first loop takes "s", next loop takes 'se', and third takes 'sea'... etc and run the cmudict.

(例如,在“ seasee”中,第一个循环使用“ s”,下一个循环使用“ se”,第三个循环使用“ sea” ...等等,然后运行命令。)

Though the problem is I don't know how to signal it's the right phoneme to consider.

(尽管问题是我不知道该如何发信号,但这是需要考虑的正确音素。)

For example, both 's' and 'sea' in 'seasee' will output some valid phonemes.

(例如,“ seasee”中的“ s”和“ sea”都将输出一些有效音素。)

Working progress:

(工作进程:)

import nltk

arpabet = nltk.corpus.cmudict.dict()

for word in ('s', 'see', 'sea', 'compute', 'comput', 'seesea', 'darfasasawwa'):
    try:
        phone = arpabet[word][0]
    except:
        try:
            counter = 0
            for i in word:
                substring = word[0:1+counter]
                counter += 1
                try:
                    print substring, arpabet[substring][0]
                except Exception as e:
                    print e
        except Exception as e:
            print e

#Output
c [u'S', u'IY1']
co [u'K', u'OW1']
com [u'K', u'AA1', u'M']
comp [u'K', u'AA1', u'M', u'P']
compu [u'K', u'AA1', u'M', u'P', u'Y', u'UW0']
comput 'comput'
s [u'EH1', u'S']
se [u'S', u'AW2', u'TH', u'IY1', u'S', u'T']
see [u'S', u'IY1']
sees [u'S', u'IY1', u'Z']
seese [u'S', u'IY1', u'Z']
seesea 'seesea'
d [u'D', u'IY1']
da [u'D', u'AA1']
dar [u'D', u'AA1', u'R']
darf 'darf'
darfa 'darfa'
darfas 'darfas'
darfasa 'darfasa'
darfasas 'darfasas'
darfasasa 'darfasasa'
darfasasaw 'darfasasaw'
darfasasaww 'darfasasaww'
darfasasawwa 'darfasasawwa'
  ask by KubiK888 translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I encountered the same issue, and I solved it by partitioning unknown recursively (see wordbreak )

(我遇到了同样的问题,并通过递归方式对未知对象进行分区来解决它(请参见wordbreak ))

import nltk
from functools import lru_cache
from itertools import product as iterprod

try:
    arpabet = nltk.corpus.cmudict.dict()
except LookupError:
    nltk.download('cmudict')
    arpabet = nltk.corpus.cmudict.dict()

@lru_cache()
def wordbreak(s):
    s = s.lower()
    if s in arpabet:
        return arpabet[s]
    middle = len(s)/2
    partition = sorted(list(range(len(s))), key=lambda x: (x-middle)**2-x)
    for i in partition:
        pre, suf = (s[:i], s[i:])
        if pre in arpabet and wordbreak(suf) is not None:
            return [x+y for x,y in iterprod(arpabet[pre], wordbreak(suf))]
    return None

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...