Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
811 views
in Technique[技术] by (71.8m points)

nlp - Weka ignoring unlabeled data

I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When I test the model obtained from my labeled training data on an independent set of unlabeled test data, Weka ignores all the unlabeled instances. Can anybody please guide me how to solve this? Someone has already asked this question here before but there wasn't any appropriate solution provided. Here is a sample test file:

@relation referents
@attribute feature1      NUMERIC
@attribute feature2      NUMERIC
@attribute feature3      NUMERIC
@attribute feature4      NUMERIC
@attribute class{1 -1}
@data
1, 7, 1, 0, ?
1, 5, 1, 0, ?
-1, 1, 1, 0, ?
1, 1, 1, 1, ?
-1, 1, 1, 1, ?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The problem is that when you specify a training set -t train.arff and a test set test.arff, the mode of operation is to calculate the performance of the model based on the test set. But you can't calculate a performance of any kind without knowing the actual class. Without the actual class, how will you know if your prediction if right or wrong?

I used the data you gave as train.arff and as test.arff with arbitrary class labels assigned by me. The relevant output lines are:

=== Error on training data ===

Correctly Classified Instances           4               80      %
Incorrectly Classified Instances         1               20      %
Kappa statistic                          0.6154
Mean absolute error                      0.2429
Root mean squared error                  0.4016
Relative absolute error                 50.0043 %
Root relative squared error             81.8358 %
Total Number of Instances                5     


=== Confusion Matrix ===

 a b   <-- classified as
 2 1 | a = 1
 0 2 | b = -1

and

=== Error on test data ===

Total Number of Instances                0     
Ignored Class Unknown Instances                  5     


=== Confusion Matrix ===

 a b   <-- classified as
 0 0 | a = 1
 0 0 | b = -1

Weka can give you those statistics for the training set, because it knows the actual class labels and the predicted ones (applying the model on the training set). For the test set, it can't get any information about the performance, because it doesn't know about the true class labels.

What you might want to do is:

java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t train.arff -T test.arff -p 1-4

which in my case would give you:

=== Predictions on test data ===

 inst#     actual  predicted error prediction (feature1,feature2,feature3,feature4)
     1        1:?        1:1       1 (1,7,1,0)
     2        1:?        1:1       1 (1,5,1,0)
     3        1:?       2:-1       0.786 (-1,1,1,0)
     4        1:?       2:-1       0.861 (1,1,1,1)
     5        1:?       2:-1       0.861 (-1,1,1,1)

So, you can get the predictions, but you can't get a performance, because you have unlabeled test data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...