Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
411 views
in Technique[技术] by (71.8m points)

sql server - Keyword to SQL search

Use Case

When a user goes to my website, they will be confronted with a search box much like SO. They can search for results using plan text. ".net questions", "closed questions", ".net and java", etc.. The search will function a bit different that SO, in that it will try to as much as possible of the schema of the database rather than a straight fulltext search. So ".net questions" will only search for .net questions as opposed to .net answers (probably not applicable to SO case, just an example here), "closed questions" will return questions that are closed, ".net and java" questions will return questions that relate to .net and java and nothing else.

Problem

I'm not too familiar with the words but I basically want to do a keyword to SQL driven search. I know the schema of the database and I also can datamine the database. I want to know any current approaches there that existing out already before I try to implement this. I guess this question is for what is a good design for the stated problem.

Proposed

My proposed solution so far looks something like this

  1. Clean the input. Just remove any special characters
  2. Parse the input into chunks of data. Break an input of "c# java" into c# and java Also handle the special cases like "'c# java' questions" into 'c# java' and "questions".
  3. Build a tree out of the input
  4. Bind the data into metadata. So convert stuff like closed questions and relate it to the isclosed column of a table.
  5. Convert the tree into a sql query.

Thoughts/suggestions/links?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I run a digital music store with a "single search" that weights keywords based on their occurrences and the schema in which Products appear, eg. with different columns like "Artist", "Title" or "Publisher".

Products are also related to albums and playlists, but for simpler explanation, I will only elaborate on the indexing and querying of Products' Keywords.

Database Schema

Keywords table - a weighted table for every word that could possibly be searched for (hence, it is referenced somewhere) with the following data for each record:

  • Keyword ID (not the word),
  • The Word itself,
  • A Soundex Alpha value for the Word
  • Weight

ProductKeywords table - a weighted table for every keyword referenced by any of a product's fields (or columns) with the following data for each record:

  • Product ID,
  • Keyword ID,
  • Weight

Keyword Weighting

The weighting value is an indication of how often the words occurs. Matching keywords with a lower weight are "more unique" and are more likely to be what is being searched for. In this way, words occurring often are automatically "down-weighted", eg. "the", "a" or "I". However, it is best to strip out atomic occurrences of those common words before indexing.

I used integers for weighting, but using a decimal value will offer more versatility, possibly with slightly slower sorting.

Indexing

Whenever any product field is updated, eg. Artist or Title (which does not happen that often), a database trigger re-indexes the product's keywords like so inside a transaction:

  1. All product keywords are disassociated and deleted if no longer referenced.
  2. Each indexed field (eg. Artist) value is stored/retrieved as a keyword in its entirety and related to the product in the ProductKeywords table for a direct match.
  3. The keyword weight is then incremented by a value that depends on the importance of the field. You can add, subtract weight based on the importance of the field. If Artist is more important than Title, Subtract 1 or 2 from its ProductKeyword weight adjustment.
  4. Each indexed field value is stripped of any non-alphanumeric characters and split into separate word groups, eg. "Billy Joel" becomes "Billy" and "Joel".
  5. Each separate word group for each field value is soundexed and stored/retrieved as a keyword and associated with the product in the same way as in step 2. If a keyword has already been associated with a product, its weight is simply adjusted.

Querying

  1. Take the input query search string in its entirety and look for a direct matching keyword. Retrieve all ProductKeywords for the keyword in an in-memory table along with Keyword weight (different from ProductKeyword weight).
  2. Strip out all non-alphanumeric characters and split query into keywords. Retrieve all existing keywords (only a few will match). Join ProductKeywords to matching keywords to in-memory table along with Keyword weight, which is different from the ProductKeyword weight.
  3. Repeat Step 2 but use soundex values instead, adjusting weights to be less relevant.
  4. Join retrieved ProductKeywords to their related Products and retrieve each product's sales, which is a measure of popularity.
  5. Sort results by Keyword weight, ProductKeyword weight and Sales. The final summing/sorting and/or weighting depends on your implementation.
  6. Limit results and return product search results to client.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...