Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
593 views
in Technique[技术] by (71.8m points)

named entity recognition - How do I use IOB tags with Stanford NER?

There seem to be a few different settings:

iobtags
iobTags
entitySubclassification (IOB1 or IOB2?)
evaluateIOB

Which setting do I use, and how do I use it correctly?

I tried labelling like this:

1997    B-DATE
volvo   B-BRAND
wia64t  B-MODEL
highway B-TYPE
tractor I-TYPE

But on the training output, it seemed to think that B-TYPE and I-TYPE were different classes.

I am using the 2013-11-12 release.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

How this can be done is currently (2013 releases) a bit of a mess, since there are two different sets of flags for two different DocumentReaderAndWriter implementations. Sorry.

The most flexible support for different IOB styles is found in CoNLLDocumentReaderAndWriter. You can have it map any IOB/IOE/... annotation done by hyphenated prefixes like your examples (B-BRAND) to any other while it is reading files with the flag:

-entitySubclassification IOB2

The resulting label set is then used for training and classification. The options are documented in the entitySubclassify() method of CoNLLDocumentReaderAndWriter: IOB1, IOB2, IOE1, IOE2, SBIEO, IO. You can find a discussion of IOB1 vs. IOB2 in Tjong Kim Sang and Veenstra 1999. By default the representation is mapped back to IOB1 on output, since that is the default used in the CoNLL conlleval program, but you can keep it as what you mapped it to with the flag:

-retainEntitySubclassification

To use this DocumentReaderAndWriter, you can give a training command like:

java8 -mx6g edu.stanford.nlp.ie.crf.CRFClassifier -prop conll.crf.chris2009.prop -readerAndWriter edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter -entitySubclassification iob2

Alternatively, ColumnDocumentReaderAndWriter is the default DocumentReaderAndWriter which we use in the distributed models. The options you get with it are different and slightly more limited. You have these two flags:

  • -mergeTags will take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and map them down to a prefix-less IO label ("BRAND") and use that for training and classifying.
  • -iobTags can take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and maps them to IOB2.

In a sequence model, for any of the labeling schemes like IOB2, the labels are different classes. That is how these labeling schemes work. The special interpretation of "I-", "B-", etc. is left to the human observer and entity-level evaluation software. The included evaluation software will work with IOB1, IOB2, or prefixless IO encoding only.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...