Saturday, June 25, 2016

Full text search, languages and SQL

Regular SQL search engine has a limited support for searching text data. It supports the regular comparison operators (equals, greater, less than etc.) and prefix searches. There's also a limited support for substring search / pattern matching using LIKE queries, however it's next to impossible to create indexes that would allow running such queries efficiently.
Full text search is an attempt to address that limitation. A full text search engine breaks document text into words, then creates an index on each word it found.
A separate query language is used to search full text index. It enables searching for documents containing specified words or phrases.

Text language plays an important role, both in indexing and in searching. During indexing, language is used to determine the rules used for breaking the text into words. Also, each language defines certain words (known as stopwords) that should not be indexed, because they are too popular to provide value in searches.
During searching, language is used for analyzing the query provided by user. Again the query needs to be broken into words, and the rules of breaking should match the ones used for indexing. Otherwise the search engine would end up searching for words that were not indexed, and would happily return an empty list of documents.

Microsoft SQL enables storing documents written in different languages in XML column. The documents are indexed using language specified in XML attribute.
When querying XML column, language should be passed to full text engine in order to aid in parsing the query, and possibly limit the results. If the language specified during indexing is not reliable, it may be necessary to run the same query specifying different languages in order to get a complete list of matching documents.

Running multiple CONTAINSTABLE queries and merging results with UNION is not a good option; in a test setting with 42 languages a query took 15 minutes to execute. The same query with a single language took 3 seconds. Closer inspection revealed that all languages returned the exact same documents, so another approach was necessary.

MS SQL engine provides a function that translates the given query into a list of words to search. Using that function it is possible to check if two languages interpret the given query in the same way, and based on that information, query only for the languages that have a chance to return different results.
The function accepts 4 parameters; query and language were discussed already. The remaining ones are stoplist_id and accent_sensitivity. Stoplist ID is a property of full text index, so this parameter should match the value set for the queried index. Accent sensitivity is a property of full text catalog, and again, this parameter should match the queried catalog.

There's one caveat associated with the stoplists, waiting to ambush the unaware. By default, when a full text query contains stopwords, the query returns no results. Given that stopwords are generally expected to be present in most documents, not returning them is at least surprising. Fortunately this setting can be changed to the more intuitive option.

No comments:

Post a Comment