Saturday, June 25, 2016

Full text search, languages and SQL

Regular SQL search engine has a limited support for searching text data. It supports the regular comparison operators (equals, greater, less than etc.) and prefix searches. There's also a limited support for substring search / pattern matching using LIKE queries, however it's next to impossible to create indexes that would allow running such queries efficiently.
Full text search is an attempt to address that limitation. A full text search engine breaks document text into words, then creates an index on each word it found.
A separate query language is used to search full text index. It enables searching for documents containing specified words or phrases.

Text language plays an important role, both in indexing and in searching. During indexing, language is used to determine the rules used for breaking the text into words. Also, each language defines certain words (known as stopwords) that should not be indexed, because they are too popular to provide value in searches.
During searching, language is used for analyzing the query provided by user. Again the query needs to be broken into words, and the rules of breaking should match the ones used for indexing. Otherwise the search engine would end up searching for words that were not indexed, and would happily return an empty list of documents.

Microsoft SQL enables storing documents written in different languages in XML column. The documents are indexed using language specified in XML attribute.
When querying XML column, language should be passed to full text engine in order to aid in parsing the query, and possibly limit the results. If the language specified during indexing is not reliable, it may be necessary to run the same query specifying different languages in order to get a complete list of matching documents.

Running multiple CONTAINSTABLE queries and merging results with UNION is not a good option; in a test setting with 42 languages a query took 15 minutes to execute. The same query with a single language took 3 seconds. Closer inspection revealed that all languages returned the exact same documents, so another approach was necessary.

MS SQL engine provides a function that translates the given query into a list of words to search. Using that function it is possible to check if two languages interpret the given query in the same way, and based on that information, query only for the languages that have a chance to return different results.
The function accepts 4 parameters; query and language were discussed already. The remaining ones are stoplist_id and accent_sensitivity. Stoplist ID is a property of full text index, so this parameter should match the value set for the queried index. Accent sensitivity is a property of full text catalog, and again, this parameter should match the queried catalog.

There's one caveat associated with the stoplists, waiting to ambush the unaware. By default, when a full text query contains stopwords, the query returns no results. Given that stopwords are generally expected to be present in most documents, not returning them is at least surprising. Fortunately this setting can be changed to the more intuitive option.

Saturday, June 4, 2016

Story of a bug

Recently I've run into a problem with one of our processes getting stuck. This would only happen on a heavily loaded Windows 2008 machine - the same process is running well on Windows 2003, and under smaller load it is running fine on 2008 and 2012.
The process is picking up files from user directories on a network share, then notifying downstream processes of any files it found. Under the hood there are multiple threads constantly scanning the directories and moving found files out of the way.

The process would usually get stuck after around 12 hours of operation. Attempting to restart it resulted in another process being created, but the old process remained in memory. The new process would get stuck immediately, and processing would only resume after a reboot.

The hunt

Googling unkillable processes was not an easy task, but after some research I found Raymond Chen's page listing driver fault and open handles as the only reasons for processes that won't go away. Raymond knows Windows inside out, so my search ended here.

Now. Installing a kernel debugger (or any debugger for that matter) on the machine in question wasn't going well, so I tried using process explorer to find out what's happening in the hung processes. Initially I wanted to look for open handles to the process, but it turned out that the tool can do much more - in particular, it can display stack trace of all threads of a running application. I found that most threads got stuck in FindFirstFileA. This function is used to list contents of a directory, and is the central piece of all processing done by the application.

This was quite unexpected. FindFirstFile is a system function that is expected to return immediately. Since the scan was running on a network share, I wanted to check if other applications can access it. Sure enough, running dir \\machine\share from command prompt resulted in an unkillable command prompt.

It was apparent that the machines don't work well together. Out of curiosity I wanted to check if they would resume operation if I kill the connection between them. Using TCPView I closed the connection to port 445 on the remote machine. All dead processes disappeared immediately, and the surviving ones resumed operation. Dir returned an empty list, even though the share was not empty, so the result was not entirely correct.

Conclusions

Earlier I mentioned how SMB handling in 2008 is much better than in 2003. Apparently it's not all roses.
Using TCPView to kill connection whenever it gets stuck could be a short-term solution to the problem.
Similar problem was reported on MSDN forums, no solution found.
Since Windows 2008 is in extended support phase, I'm going to try Windows 2012 before reporting this to MS.

Edit
5 days have passed since updating to Windows 2012 R2. There were no incidents since.
The new system is still using SMB 2.02 for connecting to the share, as indicated by PowerShell command get-SmbConnection. Either the problem was in 2008 implementation of SMB2, or Win 2012 just does a better job of handling server-side problems.