Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
937 views
in Technique[技术] by (71.8m points)

nlp - Custom sentence segmentation using Spacy

I am new to Spacy and NLP. I'm facing the below issue while doing sentence segmentation using Spacy.

The text I am trying to tokenise into sentences contains numbered lists (with space between numbering and actual text), like below.

import spacy
nlp = spacy.load('en_core_web_sm')
text = "This is first sentence.
Next is numbered list.
1. Hello World!
2. Hello World2!
3. Hello World!"
text_sentences = nlp(text)
for sentence in text_sentences.sents:
    print(sentence.text)

Output (1.,2.,3. are considered as separate lines) is:

This is first sentence.
  
Next is numbered list.
    
1.
Hello World!
 
2.
Hello World2!
  
3.
Hello World!

But if there is no space between numbering and actual text, then sentence tokenisation is fine. Like below:

import spacy
nlp = spacy.load('en_core_web_sm')
text = "This is first sentence.
Next is numbered list.
1.Hello World!
2.Hello World2!
3.Hello World!"
text_sentences = nlp(text)
for sentence in text_sentences.sents:
    print(sentence.text)

Output(desired) is:

This is first sentence.
    
Next is numbered list.
   
1.Hello World!
    
2.Hello World2!
    
3.Hello World!

Please suggest whether we can customise sentence detector to do this.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

When you use a pretrained model with spacy, the sentences get splitted based on training data that were provided during the training procedure of the model.

Of course, there are cases like yours, that may somebody want to use a custom sentence segmentation logic. This is possible by adding a component to spacy pipeline.

For your case, you can add a rule that prevents sentence splitting when there is a {number}. pattern.

A workaround for your problem:

import spacy
import re

nlp = spacy.load('en')
boundary = re.compile('^[0-9]$')

def custom_seg(doc):
    prev = doc[0].text
    length = len(doc)
    for index, token in enumerate(doc):
        if (token.text == '.' and boundary.match(prev) and index!=(length - 1)):
            doc[index+1].sent_start = False
        prev = token.text
    return doc

nlp.add_pipe(custom_seg, before='parser')
text = u'This is first sentence.
Next is numbered list.
1. Hello World!
2. Hello World2!
3. Hello World!'
doc = nlp(text)
for sentence in doc.sents:
    print(sentence.text)

Hope it helps!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...