There is probably more than one way to do this, but I suggest using the FastVectorHighlighter
, as it gives you access to position and offset data.
Indexing Requirements
To use this approach, you need to ensure your indexed data uses a field which stores term vector data, when the index is created:
final String fieldName = "body";
// a shorter version of the input data in the question, for testing:
final String content = "State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY";
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorOffsets(true);
doc.add(new Field(fieldName, content, fieldType));
writer.addDocument(doc);
(This may significantly increase the size of your indexed data, if you are not already capturing term vectors.)
Library Requirements
The fast vector highlighter is part of the lucene-highlighter
library:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>8.9.0</version>
</dependency>
Search Example
Assume the following query:
final String searchTerm = ""War Force"~1";
We expect this to find War WORD1 Force
from our test data.
The first part of the process performs a standard query execution, using the classic query parser:
Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
IndexSearcher indexSearcher = new IndexSearcher(dirReader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(fieldName, analyzer);
Query query = parser.parse(searchTerm);
TopDocs topDocs = indexSearcher.search(query, 100);
ScoreDoc[] hits = topDocs.scoreDocs;
for (ScoreDoc hit : hits) {
handleHit(hit, query, dirReader, indexSearcher);
}
The handleHit()
method (shown below) is where we use the FastVectorHighlighter
.
If you only want to perform highlighting (and do not need position/offset data), you can use:
FastVectorHighlighter fvh = new FastVectorHighlighter();
fvh.getBestFragment(fieldQuery, dirReader, docId, fieldName, fragCharSize)
But to access the extra data we need, you can do the following:
FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
fragListBuilder, fragmentsBuilder);
This builds a FastVectorHighlighter
which contains a FieldPhraseList
, which will be populated by the highlighter.
The getBestFragment
method now becomes:
// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};
Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
fieldName, fragCharSize, maxNumFragments, fragListBuilder,
fragmentsBuilder, preTags, postTags, encoder);
And finally we can use the fieldPhraseList
to access the data we need:
// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
int phraseEndOffset = weightedPhraseInfo.getEndOffset(); // 34
weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
String term = termInfo.getText(); // "war" "force"
int termPosition = termInfo.getPosition() + 1; // 4 6
int termStartOffset = termInfo.getStartOffset(); // 19 29
int termEndOffset = termInfo.getEndOffset(); // 22 34
});
});
The phraseStartOffset
and phraseEndOffset
are character counts telling us where the whole phrase can be found in the source document:
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY
So, in our case, this is the string from offsets 19 through 34 (offset 0 is the position on the left hand side of the first "S").
Then, for each specific term ("war", and "force") in the search query, we can access their offsets, and also their word positions (termPosition
). Position 0 is the forst word, so I add 1 to this index to give "war" at position 4 and "force" at position 6 in the original document:
1 2 3 4 5 6 7 8 9 10
State WEAPONRY 123 War WORD1 Force And Book4 Book WEAPONRY
Here is the complete code for reference:
import java.io.IOException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.DefaultEncoder;
import org.apache.lucene.search.highlight.Encoder;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.FieldPhraseList;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.FieldTermStack;
import org.apache.lucene.search.vectorhighlight.FragListBuilder;
import org.apache.lucene.search.vectorhighlight.FragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragmentsBuilder;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class VectorIndexHighlighterDemo {
final String indexPath = "./index";
final String fieldName = "body";
final String searchTerm = ""War Force"~1";
public void doDemo() throws IOException, ParseException {
Directory dir = FSDirectory.open(Paths.get(indexPath));
try ( DirectoryReader dirReader = DirectoryReader.open(dir)) {
IndexSearcher indexSearcher = new IndexSearcher(dirReader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(fieldName, analyzer);
Query query = parser.parse(searchTerm);
System.out.println();
System.out.println("Search term: [" + searchTerm + "]");
System.out.println("Parsed query: [" + query.toString() + "]");
TopDocs topDocs = indexSearcher.search(query, 100);
ScoreDoc[] hits = topDocs.scoreDocs;
for (ScoreDoc hit : hits) {
handleHit(hit, query, dirReader, indexSearcher);
}
}
}
private void handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader,
IndexSearcher indexSearcher) throws IOException {
boolean phraseHighlight = Boolean.TRUE;
boolean fieldMatch = Boolean.TRUE;
FieldQuery fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, fieldMatch);
FieldTermStack fieldTermStack = new FieldTermStack(dirReader, hit.doc, fieldName, fieldQuery);
FieldPhraseList fieldPhraseList = new FieldPhraseList(fieldTermStack, fieldQuery);
FragListBuilder fragListBuilder = new SimpleFragListBuilder();
FragmentsBuilder fragmentsBuilder = new SimpleFragmentsBuilder();
FastVectorHighlighter fvh = new FastVectorHighlighter(phraseHighlight, fieldMatch,
fragListBuilder, fragmentsBuilder);
// use whatever you want for these settings:
int fragCharSize = 100;
int maxNumFragments = 100;
String[] preTags = new String[]{"-->"};
String[] postTags = new String[]{"<--"};
Encoder encoder = new DefaultEncoder();
// the fragments string array contains the highlighted results:
String[] fragments = fvh.getBestFragments(fieldQuery, dirReader, hit.doc,
fieldName, fragCharSize, maxNumFragments, fragListBuilder,
fragmentsBuilder, preTags, postTags, encoder);
// the following gives you access to positions and offsets:
fieldPhraseList.getPhraseList().forEach(weightedPhraseInfo -> {
int phraseStartOffset = weightedPhraseInfo.getStartOffset(); // 19
int phraseEndOffset = weightedPhraseInfo.getEndOffset(); // 34
weightedPhraseInfo.getTermsInfos().forEach(termInfo -> {
String term = termInfo.getText(); // "war" "force"
int termPosition = termInfo.getPosition() + 1; // 4 6
int termStartOffset = termInfo.getStartOffset(); // 19 29
int termEndOffset = termInfo.getEndOffset(); // 22 34
});
});
// get the scores, also, if needed:
BigDecimal score = new BigDecimal(String.valueOf(hit.score))
.setScale(3, RoundingMode.HALF_EVEN);
Document hitDoc = indexSearcher.doc(hit.doc);
}
}