Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
495 views
in Technique[技术] by (71.8m points)

solr - SolrCloud Deduplication Overwrite isn't working

I've been struggling to get Deduplication to work in SolrCloud (version 8.6). My solrconfig.xml contains:

<updateRequestProcessorChain name="dedupeOn">
       <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">dedupeId</str>
         <bool name="overwriteDupes">true</bool>
         <str name="fields">journal_doi,internal_pmid</str>
         <str name="signatureClass">solr.processor.Lookup3Signature</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.DistributedUpdateProcessorFactory"/>
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

and

 <requestHandler name="/update" class="solr.UpdateRequestHandler" >
  <lst name="defaults">
          <str name="update.chain">dedupeOn</str>
  </lst>
  </requestHandler>

my managedschema contains:

<field name="dedupeId" type="string" indexed="true" stored="true" multiValued="false" />

In my test, I add 1000 documents, and commit manually. I see the "dedupeId" is created with the hash.
I then add 10 more documents that I know are duplicates, and again commit manually. These 10 rows are added, and the original document with the matching dedupeId is not overwritten. For example:

  "response":{"numFound":2,"start":0,"maxScore":2.1554677,"numFoundExact":true,"docs":[
      {
        "internal_pmid":"13367837",
        "dedupeId":"7f0306ecd909a68e",
        "journal_doi":"10.1097/00005053-195603000-00006"},
      {
        "internal_pmid":"13367837",
        "dedupeId":"7f0306ecd909a68e",
        "journal_doi":"10.1097/00005053-195603000-00006"}]
  }}

I'm not sure if its significant, but in the solr logs, I see some "add" entries that contain, in part:

webapp=/solr path=/update params={update.distrib=TOLEADER&update.chain=dedupeOn&distrib.from=*(shard path)*/&wt=javabin&version=2}{add=[00001hLxMb (1690871781072568320)]} 0 2

but other add entries do not contain the update.chain property e.g.

webapp=/solr path=/update params={wt=javabin&version=2}{add=[00000sta0n (1690871780667817984)]} 0 2

Any help would be greatly appreciated.

question from:https://stackoverflow.com/questions/66067082/solrcloud-deduplication-overwrite-isnt-working

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...