Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
728 views
in Technique[技术] by (71.8m points)

nlp - Stanford coreNLP - split words ignoring apostrophe

I'm trying to split a sentence into words using Stanford coreNLP . I'm having problem with words that contains apostrophe.

For example, the sentence: I'm 24 years old.

Splits like this: [I] ['m] [24] [years] [old]

Is it possible to split it like this using Stanford coreNLP?: [I'm] [24] [years] [old]

I've tried using tokenize.whitespace, but it doesn't split on other punctuation marks like: '?' and ','

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

How about if you just re-concatenate tokens that are split by an apostrophe?

Here's an implementation in Java:

public static List<String> tokenize(String s) {
    PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<CoreLabel>(
            new StringReader(s), new CoreLabelTokenFactory(), "");
    List<String> sentence = new ArrayList<String>();
    StringBuilder sb = new StringBuilder();
    for (CoreLabel label; ptbt.hasNext();) {
        label = ptbt.next();
        String word = label.word();
        if (word.startsWith("'")) {
            sb.append(word);
        } else {
            if (sb.length() > 0)
                sentence.add(sb.toString());
            sb = new StringBuilder();
            sb.append(word);
        }
    }
    if (sb.length() > 0)
        sentence.add(sb.toString());
    return sentence;
}

public static void main(String[] args) {
    System.out.println(tokenize("I'm 24 years old."));  // [I'm, 24, years, old, .]
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...