I have already asked a similar question at Calculating Word Proximity in an inverted Index.
However i felt that the question was too general and not refined enough. So here goes.
I have a List which contains the location of tokens in a document. for each token it goes as
public List<int> hitLocation;
Lets say the the document is
Java programming language has a name similar to java island in Indonesia however
local language in java bears no resemblance to the programming language called java.
and the query is
java island language
So Say i lock on to the Java HitList and attempt to directly calculate the distance between the Java HisList, Island HitList and Language Hitlist.
Now the first problem is that there are 4 java tokens occurrences in the sentence. Which one do i select. Assuming i select the first one.
I go onto the island token list and after comparing find it that it adjacent to the second occurrence of java. So i change my selection and lock onto the second occurrence of java.
Proceeding to the third token language i find that it situated at quite a distance from our selection however i find it that it is quite near the first java occurrence.
So you see the dilemma here if now again revert back to the original selection i.e the first occurrence of java the distance to second token "island" increases and if i stay with my current selection the sheer distance of the second occurrence of the token "language" will make relevance busted.
Previously there was the suggestion of dot product however i am at loss on how to proceed forward with that option.
Any other solution would also be welcomed.
I Understand that this question is quite detailed. However i have searched long and hard and haven't found any question like this on this topic.
I feel if this question is answered it will be a great addition to the community and will make anybody who is designing anything related to relevancy quite happy.
Thank You.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…