Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
508 views
in Technique[技术] by (71.8m points)

lucene - Solr language detection not work with pdf

I have configured Solr 7.7.3 to detect English and Japanese documents. It can work normally with text based files like docx, xlsx ... but when I convert to pdf Solr can't detect or sometimes output wrong language(I used Microsoft Office 2019 to convert docx to pdf).

I also tried methods: Tika, LangDetect and OpenNLP from this page https://lucene.apache.org/solr/guide/7_7/detecting-languages-during-indexing.html

Please help me. Thank you so much!!!

solrconfig.xml

   <updateRequestProcessorChain name="langid">
     <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
       <str name="langid.fl">_text_</str>
       <str name="langid.langField">language</str>
       <str name="langid.langsField">languages</str>
       <str name="langid.fallback">fr</str>
       <str name="langid.threshold">0.7</str>
       <str name="langid.model">langdetect-183.bin</str>
       <str name="langid.whitelist">en-US,en-GB,en,ja</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
   </updateRequestProcessorChain>
    <requestHandler name="/update/extract"
                    startup="lazy"
                    class="solr.extraction.ExtractingRequestHandler" >
    <lst name="invariants">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
      <str name="update.chain">langid</str>
    </lst>
    </requestHandler>

managed-schema

  <field name="language" type="string" indexed="true" stored="true"/>
  <field name="languages" type="string" multiValued="true" indexed="true" stored="true"/>

Log when I tried with file Test.docx

2021-01-27 07:25:57.177 DEBUG (qtp1571967156-58) [ x:doc_analyzer] o.a.s.u.p.LanguageIdentifierUpdateProcessor Language fallback to value fr 2021-01-27 07:25:57.178 DEBUG (qtp1571967156-58) [
x:doc_analyzer] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Appending field text 2021-01-27 07:25:57.184 DEBUG (qtp1571967156-58) [ x:doc_analyzer] o.a.s.u.p.LanguageIdentifierUpdateProcessor Language detected ja with certainty 0.9999999780558492 2021-01-27 07:25:57.185 DEBUG (qtp1571967156-58) [ x:doc_analyzer] o.a.s.u.p.LanguageIdentifierUpdateProcessor Detected main document language from fields [text]: ja 2021-01-27 07:25:57.185 DEBUG (qtp1571967156-58) [ x:doc_analyzer] o.a.s.u.p.LogUpdateProcessorFactory PRE_UPDATE add{,id=Test.docx,commitWithin=1000} {langid.whitelist=en-US,en-GB,en,ja&update.chain=langid&df=text&commitWithin=1000&langid.langField=language&literal.id=Test.docx&fmap.meta=ignored_&lowernames=true&langid.model=langdetect-183.bin&langid.fallback=fr&langid.threshold=0.7&fmap.content=text&langid.langsField=languages&langid.fl=text&overwrite=true&wt=json}

Log when I tried with file Test.pdf

2021-01-27 07:30:56.643 DEBUG (qtp1571967156-19) [ x:doc_analyzer] o.a.s.u.p.LanguageIdentifierUpdateProcessor Language fallback to value fr 2021-01-27 07:30:56.643 DEBUG (qtp1571967156-19) [ x:doc_analyzer] o.a.s.u.p.LanguageIdentifierUpdateProcessor Language detected en-US with certainty 1.0 2021-01-27 07:30:56.643 DEBUG (qtp1571967156-19) [ x:doc_analyzer] o.a.s.u.p.LanguageIdentifierUpdateProcessor Field language already contained value en-US, not overwriting. 2021-01-27 07:30:56.644 DEBUG (qtp1571967156-19) [ x:doc_analyzer] o.a.s.u.p.LogUpdateProcessorFactory PRE_UPDATE add{,id=Test.pdf,commitWithin=1000} {langid.whitelist=en-US,en-GB,en,ja&update.chain=langid&df=text&commitWithin=1000&langid.langField=language&literal.id=Test.pdf&fmap.meta=ignored_&lowernames=true&langid.model=langdetect-183.bin&langid.fallback=fr&langid.threshold=0.7&fmap.content=text&langid.langsField=languages&langid.fl=text&overwrite=true&wt=json}

@Files: https://drive.google.com/drive/folders/1igD_XCEGsIm08shLShXJ7IMV4qFscdGh?usp=sharing

question from:https://stackoverflow.com/questions/65914813/solr-language-detection-not-work-with-pdf

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...