Saturday, June 4, 2016

Story of a bug

Recently I've run into a problem with one of our processes getting stuck. This would only happen on a heavily loaded Windows 2008 machine - the same process is running well on Windows 2003, and under smaller load it is running fine on 2008 and 2012.
The process is picking up files from user directories on a network share, then notifying downstream processes of any files it found. Under the hood there are multiple threads constantly scanning the directories and moving found files out of the way.

The process would usually get stuck after around 12 hours of operation. Attempting to restart it resulted in another process being created, but the old process remained in memory. The new process would get stuck immediately, and processing would only resume after a reboot.

The hunt

Googling unkillable processes was not an easy task, but after some research I found Raymond Chen's page listing driver fault and open handles as the only reasons for processes that won't go away. Raymond knows Windows inside out, so my search ended here.

Now. Installing a kernel debugger (or any debugger for that matter) on the machine in question wasn't going well, so I tried using process explorer to find out what's happening in the hung processes. Initially I wanted to look for open handles to the process, but it turned out that the tool can do much more - in particular, it can display stack trace of all threads of a running application. I found that most threads got stuck in FindFirstFileA. This function is used to list contents of a directory, and is the central piece of all processing done by the application.

This was quite unexpected. FindFirstFile is a system function that is expected to return immediately. Since the scan was running on a network share, I wanted to check if other applications can access it. Sure enough, running dir \\machine\share from command prompt resulted in an unkillable command prompt.

It was apparent that the machines don't work well together. Out of curiosity I wanted to check if they would resume operation if I kill the connection between them. Using TCPView I closed the connection to port 445 on the remote machine. All dead processes disappeared immediately, and the surviving ones resumed operation. Dir returned an empty list, even though the share was not empty, so the result was not entirely correct.

Conclusions

Earlier I mentioned how SMB handling in 2008 is much better than in 2003. Apparently it's not all roses.
Using TCPView to kill connection whenever it gets stuck could be a short-term solution to the problem.
Similar problem was reported on MSDN forums, no solution found.
Since Windows 2008 is in extended support phase, I'm going to try Windows 2012 before reporting this to MS.

Edit
5 days have passed since updating to Windows 2012 R2. There were no incidents since.
The new system is still using SMB 2.02 for connecting to the share, as indicated by PowerShell command get-SmbConnection. Either the problem was in 2008 implementation of SMB2, or Win 2012 just does a better job of handling server-side problems.

No comments:

Post a Comment