Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
543 views
in Technique[技术] by (71.8m points)

mapreduce - Remove Duplicates from MongoDB

hi I have a ~5 million documents in mongodb (replication) each document 43 fields. how to remove duplicate document. I tryed

db.testkdd.ensureIndex({
        duration  : 1 , protocol_type  : 1 , service  : 1 ,
        flag  : 1 , src_bytes  : 1 , dst_bytes  : 1 ,
        land  : 1 , wrong_fragment  : 1 , urgent  : 1 ,
        hot  : 1 , num_failed_logins  : 1 , logged_in  : 1 ,
        num_compromised  : 1 , root_shell  : 1 , su_attempted  : 1 ,
        num_root  : 1 , num_file_creations  : 1 , num_shells  : 1 ,
        num_access_files  : 1 , num_outbound_cmds  : 1 , is_host_login  : 1 ,
        is_guest_login  : 1 , count  : 1 ,  srv_count  : 1 ,
        serror_rate  : 1 , srv_serror_rate  : 1 , rerror_rate  : 1 ,
        srv_rerror_rate  : 1 , same_srv_rate  : 1 , diff_srv_rate  : 1 ,
        srv_diff_host_rate  : 1 , dst_host_count  : 1 , dst_host_srv_count  : 1 ,
        dst_host_same_srv_rate  : 1 , dst_host_diff_srv_rate  : 1 ,
        dst_host_same_src_port_rate  : 1 ,  dst_host_srv_diff_host_rate  : 1 ,
        dst_host_serror_rate  : 1 , dst_host_srv_serror_rate  : 1 ,
        dst_host_rerror_rate  : 1 , dst_host_srv_rerror_rate  : 1 , lable  : 1 
    },
    {unique: true, dropDups: true}
)

run this code i get a error "errmsg" : "namespace name generated from index ..

{
    "ok" : 0,
    "errmsg" : "namespace name generated from index name "project.testkdd.$duration_1_protocol_type_1_service_1_flag_1_src_bytes_1_dst_bytes_1_land_1_wrong_fragment_1_urgent_1_hot_1_num_failed_logins_1_logged_in_1_num_compromised_1_root_shell_1_su_attempted_1_num_root_1_num_file_creations_1_num_shells_1_num_access_files_1_num_outbound_cmds_1_is_host_login_1_is_guest_login_1_count_1_srv_count_1_serror_rate_1_srv_serror_rate_1_rerror_rate_1_srv_rerror_rate_1_same_srv_rate_1_diff_srv_rate_1_srv_diff_host_rate_1_dst_host_count_1_dst_host_srv_count_1_dst_host_same_srv_rate_1_dst_host_diff_srv_rate_1_dst_host_same_src_port_rate_1_dst_host_srv_diff_host_rate_1_dst_host_serror_rate_1_dst_host_srv_serror_rate_1_dst_host_rerror_rate_1_dst_host_srv_rerror_rate_1_lable_1" is too long (127 byte max)",
    "code" : 67
}

how can solve the problem ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The "dropDups" syntax for index creation has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0. It is not a very good idea in most cases to use this as the "removal" is arbitrary and any "duplicate" could be removed. Which means what gets "removed" may not be what you really want removed.

Anyhow, you are running into an "index length" error since the value of the index key here would be longer that is allowed. Generally speaking, you are not "meant" to index 43 fields in any normal application.

If you want to remove the "duplicates" from a collection then your best bet is to run an aggregation query to determine which documents contain "duplicate" data and then cycle through that list removing "all but one" of the already "unique" _id values from the target collection. This can be done with "Bulk" operations for maximum efficiency.

NOTE: I do find it hard to believe that your documents actually contain 43 "unique" fields. It is likely that "all you need" is to simply identify only those fields that make the document "unique" and then follow the process as outlined below:

var bulk = db.testkdd.initializeOrderedBulkOp(),
    count = 0;

// List "all" fields that make a document "unique" in the `_id`
// I am only listing some for example purposes to follow
db.testkdd.aggregate([
    { "$group": {
        "_id": {
           "duration" : "$duration",
          "protocol_type": "$protocol_type", 
          "service": "$service",
          "flag": "$flag"
        },
        "ids": { "$push": "$_id" },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$gt": 1 } } }
],{ "allowDiskUse": true}).forEach(function(doc) {
    doc.ids.shift();     // remove first match
    bulk.find({ "_id": { "$in": doc.ids } }).remove();  // removes all $in list
    count++;

    // Execute 1 in 1000 and re-init
    if ( count % 1000 == 0 ) {
       bulk.execute();
       bulk = db.testkdd.initializeOrderedBulkOp();
    }
});

if ( count % 1000 != 0 ) 
    bulk.execute();

If you have a MongoDB version "lower" than 2.6 and don't have bulk operations then you can try with standard .remove() inside the loop as well. Also noting that .aggregate() will not return a cursor here and the looping must change to:

db.testkdd.aggregate([
   // pipeline as above
]).result.forEach(function(doc) {
    doc.ids.shift();  
    db.testkdd.remove({ "_id": { "$in": doc.ids } });
});

But do make sure to look at your documents closely and only include "just" the "unique" fields you expect to be part of the grouping _id. Otherwise you end up removing nothing at all, since there are no duplicates there.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...