Thursday, December 22, 2016

"Your password must be at least 18770 characters and cannot repeat any of your previous 30689 passwords"

I am bad at remembering words.

Seriously. Back at the school I could never remember any poems, or my role in play, or theorems. These always took me days to memorize, and I forgot them anyway. Well, at least today I don't have to remember these word-by-word. But there are still words I have to remember - passwords.

We all know the rules for good passwords. Use different passwords for every site, make sure your password is hard to guess and find, never write it down or share it. These are the basics. How does it look in practice?

At my job in TR I need access to systems protected by 9 distinct authentication databases on a daily basis. The requirements between them differ, but at least 4 have:
- Password needs to be changed every 90 days
- Password must have at least 15 characters
- Password must have characters from at least 3 of character categories (uppercase, lowercase, digits, punctuation)
- Password must not be similar to any of the previous passwords
Okay now. After 3 years in the company I've run out of ideas for memorable passwords. Seriously. I've spent an hour looking at the screen, trying different combinations, all I got was - too similar to one of the previous passwords.

My way out of this was - type a bunch of random characters, make sure the constraints are met, use the result as a password, write it down in a file. This is wrong for various reasons. I can't access the services from any machine without this file. If I lose the file, I'll lose access to the systems. If someone gets the file, I won't even notice.

Am I the only one with such problems? Well, probably not. NIST (US government agency) is formulating a set of password rules to help design authentication systems where passwords actually have a purpose. And yay, there it is - "No more expiration without reason". When that goes in, I may actually invent one more password worth remembering.

Also, if you were wondering about this post's title. It's an actual Microsoft error message that I found while reading The Best Interface is No Interface. That's one great book about user experience engineering, so if you are interested in that subject, go get it.

Tuesday, August 9, 2016

Web response times

Have you ever spent your time just sitting and waiting for your computer to finish something? I know I have. Heck, my machine is taking so much time to boot that I'm keeping it on as much as possible, just to avoid the wait.
As a developer, I spent a lot of time making sure that users of my applications won't complain about response times. As it turns out, response time limits are already well researched:
  • 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
  • 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.
  • 10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.
Now, as you may notice, these refer to wall clock time.
Loading speed for a web page depends on time required by web server to prepare the response, time needed to transmit it over the network and time needed to present it to the user. As a rule of thumb,

Make it as fast as possible...

If preparing the response is slow, no delivery mechanism can make the application instant again. Here the usual rules apply:
  • Profile your code to see which part is taking up the most time, then improve it until you're satisfied.
  • Alternatively, get more powerful hardware
If transmission is slow and user's connection is the problem, there are still a few things that can be done, for example:
If users are physically distant from your server, response times won't improve past certain point; the information travels over the internet with a limited speed. As a rule of thumb, every 100 kilometres of distance add at least 1 millisecond to round-trip (ping) time. Lookup charts are available - for example this page lists distances and ping times between most areas in the world.
Browser rendering speed can also be improved to some extent. YSlow can scan any page for areas that are known to cause slow rendering.

And then even faster

In the end, users care about how much time it takes them to complete their task. So even if all individual interactions are lightning-fast, they might still be unsatisfied with the end result. And conversely, even if their interaction takes hours to complete, they may be satisfied. Some examples:
  1. Windows installation used to follow these steps:
    • Initial configuration (hard disk choice, file system etc.)
    • File copying (slow)
    • Additional configuration (language, keyboard, time zone etc.)
    • More copying (slow)
    For the users it meant 2 slow interactions with the system. Today all configuration happens before the lengthy processes. From operator perspective the process is finished when the configuration is done, the rest happens automatically.
  2.  Sending mail with file attachments using web interfaces. The easiest way to implement this is by putting a file input on the form, then submitting the file together with the rest of the message. Unfortunately it has a few drawbacks:
    • Uploading file attachments is often a lengthy operation, especially on slow connections
    • In most implementations server can validate the submitted form only after it is fully uploaded. Therefore, user is informed that his message could not be sent only after he spent a lot of time on the upload.
    • Worse yet, after correcting the validation errors, user had to select and upload the file again.
    Today most (if not all) web mail interfaces implement uploading file attachments as a separate (preferably background) operation that happens in parallel with mail editing. Attachments can be validated before the message is written, and submitting the smaller form happens almost instantly.
  3. Obnoxious advertisements. If you're looking for certain content on a page, but every link you click directs you to yet another advertisement, loading times on these ads won't matter much - you won't like the experience.
Reducing the number of user interactions needed can be just as important as reducing the time needed to finish any individual interaction.

Tuesday, July 19, 2016

3 ways to break your real-time system

I recently came across three similar problems in three different systems.

Case 1: excessive buffering

First system was a message publisher copying data from SQL server to a message queue. Each row in a table was transformed into a message. The messages were sequentially numbered, and the system kept track of already processed rows by storing the number associated with the message it last sent.
The system polled for messages every minute, then sent any new messages received, then updated the sequential number.
Trouble started when the publisher was stopped for a week. After that every attempt to start it ended in a crash with no messages sent. Turned out that the application tried to load all pending messages into a dataset before sending anything. The number of unsent messages was sufficient to overflow available memory, causing the crash.

Case 2: thread-unsafe initialization

Second system was a message processor; it was processing messages from a number of message queues. Every queue was processed in a separate thread, and the threads were largely independent. Or at least that's what we thought, until the system was stopped long enough for messages to accumulate in many queues. After that, starting the system turned out to be impossible, as processing threads deadlocked during initialization.

Case 3: Excessive deadlocking

Third system was also a message processor; it processed messages from a single queue, potentially using multiple threads under higher load. The threads were supposed to speed up message processing, but ended up doing the opposite; when multiple threads tried to run a SQL query, many of them failed because of a deadlock, effectively nullifying possible gains and rejecting legitimate messages.

So, how should they behave?

All of these systems performed well under regular load, and all of them required developer intervention when running under full load.
Ideally, the amount of time and resources required to process a message should be independent of the backlog size.
This sort of problems can be detected by measuring the time needed to process a fixed number of messages depending on the load pattern. When all messages are available up front, the processing time should be the same or better than when new messages are added as and when the system processes the old ones.

Not tending to these problems usually results in a midnight call with the sysadmins. While this might be a fun experience once in a while, it won't be appreciated by customers, and can easily go out of hand if left unattended.

Saturday, June 25, 2016

Full text search, languages and SQL

Regular SQL search engine has a limited support for searching text data. It supports the regular comparison operators (equals, greater, less than etc.) and prefix searches. There's also a limited support for substring search / pattern matching using LIKE queries, however it's next to impossible to create indexes that would allow running such queries efficiently.
Full text search is an attempt to address that limitation. A full text search engine breaks document text into words, then creates an index on each word it found.
A separate query language is used to search full text index. It enables searching for documents containing specified words or phrases.

Text language plays an important role, both in indexing and in searching. During indexing, language is used to determine the rules used for breaking the text into words. Also, each language defines certain words (known as stopwords) that should not be indexed, because they are too popular to provide value in searches.
During searching, language is used for analyzing the query provided by user. Again the query needs to be broken into words, and the rules of breaking should match the ones used for indexing. Otherwise the search engine would end up searching for words that were not indexed, and would happily return an empty list of documents.

Microsoft SQL enables storing documents written in different languages in XML column. The documents are indexed using language specified in XML attribute.
When querying XML column, language should be passed to full text engine in order to aid in parsing the query, and possibly limit the results. If the language specified during indexing is not reliable, it may be necessary to run the same query specifying different languages in order to get a complete list of matching documents.

Running multiple CONTAINSTABLE queries and merging results with UNION is not a good option; in a test setting with 42 languages a query took 15 minutes to execute. The same query with a single language took 3 seconds. Closer inspection revealed that all languages returned the exact same documents, so another approach was necessary.

MS SQL engine provides a function that translates the given query into a list of words to search. Using that function it is possible to check if two languages interpret the given query in the same way, and based on that information, query only for the languages that have a chance to return different results.
The function accepts 4 parameters; query and language were discussed already. The remaining ones are stoplist_id and accent_sensitivity. Stoplist ID is a property of full text index, so this parameter should match the value set for the queried index. Accent sensitivity is a property of full text catalog, and again, this parameter should match the queried catalog.

There's one caveat associated with the stoplists, waiting to ambush the unaware. By default, when a full text query contains stopwords, the query returns no results. Given that stopwords are generally expected to be present in most documents, not returning them is at least surprising. Fortunately this setting can be changed to the more intuitive option.

Saturday, June 4, 2016

Story of a bug

Recently I've run into a problem with one of our processes getting stuck. This would only happen on a heavily loaded Windows 2008 machine - the same process is running well on Windows 2003, and under smaller load it is running fine on 2008 and 2012.
The process is picking up files from user directories on a network share, then notifying downstream processes of any files it found. Under the hood there are multiple threads constantly scanning the directories and moving found files out of the way.

The process would usually get stuck after around 12 hours of operation. Attempting to restart it resulted in another process being created, but the old process remained in memory. The new process would get stuck immediately, and processing would only resume after a reboot.

The hunt

Googling unkillable processes was not an easy task, but after some research I found Raymond Chen's page listing driver fault and open handles as the only reasons for processes that won't go away. Raymond knows Windows inside out, so my search ended here.

Now. Installing a kernel debugger (or any debugger for that matter) on the machine in question wasn't going well, so I tried using process explorer to find out what's happening in the hung processes. Initially I wanted to look for open handles to the process, but it turned out that the tool can do much more - in particular, it can display stack trace of all threads of a running application. I found that most threads got stuck in FindFirstFileA. This function is used to list contents of a directory, and is the central piece of all processing done by the application.

This was quite unexpected. FindFirstFile is a system function that is expected to return immediately. Since the scan was running on a network share, I wanted to check if other applications can access it. Sure enough, running dir \\machine\share from command prompt resulted in an unkillable command prompt.

It was apparent that the machines don't work well together. Out of curiosity I wanted to check if they would resume operation if I kill the connection between them. Using TCPView I closed the connection to port 445 on the remote machine. All dead processes disappeared immediately, and the surviving ones resumed operation. Dir returned an empty list, even though the share was not empty, so the result was not entirely correct.

Conclusions

Earlier I mentioned how SMB handling in 2008 is much better than in 2003. Apparently it's not all roses.
Using TCPView to kill connection whenever it gets stuck could be a short-term solution to the problem.
Similar problem was reported on MSDN forums, no solution found.
Since Windows 2008 is in extended support phase, I'm going to try Windows 2012 before reporting this to MS.

Edit
5 days have passed since updating to Windows 2012 R2. There were no incidents since.
The new system is still using SMB 2.02 for connecting to the share, as indicated by PowerShell command get-SmbConnection. Either the problem was in 2008 implementation of SMB2, or Win 2012 just does a better job of handling server-side problems.

Saturday, May 21, 2016

Eikon Research: ratings

Some research documents in Eikon are displayed with star ratings - one set of stars for Estimate rating, another set for Recommendation rating. These ratings are not an assessment of the content of the document - we are not in a position to tell whether following the recommendation to buy is going to make anyone rich or not. Instead, document rating is in fact the rating of the document's author, based on how well his past predictions matched the reality.
There's a number of systems involved in calculating the document ratings. First, we collect analyst estimates and recommendations. Some analysts deliver that information to us in computer-readable format. Others deliver research documents only, and we try and extract the estimates from that.
Next, Starmine system compares the past estimates and recommendations to the actual market data, and assigns ratings based on how well the estimates matched the market. The ratings are competitive - i. e. we do not assign the rating based on an absolute or relative error value, but rather the analysts with the best estimates get the best ratings. There are 2 types of ratings - single stock rating for estimating performance of an individual company, and overall rating based on all the companies rated by the analyst.
Finally, the analyst ratings are assigned to research documents. For that to happen we need to assign correct analyst to document (which is nontrivial sometimes) and determine the subject of the document. If the document is about a single company, we use single stock rating for that company, otherwise we use overall rating.

Analyst ratings change over time, as more estimates are added into the calculation. Document ratings are updated along with analyst ratings, so all documents by the same author and on the same subject will have the same rating.

If the document has multiple authors, we pick the rating of the first author.

The ratings are based on past performance, so earning a rating takes time. Recommendation ratings are based on performance over 24 months.

Thursday, April 7, 2016

Android file recovery with Linux

I just finished recovering deleted photos from an Android phone's internal memory; time to make some notes while I still remember how it went.
  • take the phone offline. Every write to the phone's internal memory carries a risk of overwriting a portion of the deleted data, making it unrecoverable. Automatic update of an application at this moment is not desirable.
  • install Android SDK on your computer; you will need adb tool to connect to the phone.
  • Enable USB debugging on the phone
  • Connect the phone to the computer and run adb shell
  • Review the disk contents.
One of the recovered photos - apparently my kid was the main user
At this point I found a hidden folder .thumbnails next to the usual photo location, DCIM. The folder still contained thumbnails for many of the deleted photos - some directly in jpg format, some in .thumbdata3 file.
Thumbnails from that file can easily be extracted using a python script. Unfortunately no undelete tools are available in default Android installation, so at that point you have 2 options - look for an Android app that can undelete files, or root the phone. Both options will use up some memory, potentially overwriting precious files. I decided to go with rooting.
There are a few options available for rooting; following a recommendation I used Kingo Root (requires Windows, that was probably the hardest part). It took a few minutes to finish and used up ~20 MB of memory. Next steps:
  • Find the device hosting /data partition: adb shell mount returns a list of partitions, with names starting from /dev/block.
  • Copy the contents of the partition to the computer. I found different suggestions:
    • Use cat /dev/block/mmcblk0p24
    • Use dd if=/dev/block/mmcblk0p9
These solutions are equivalent, with cat reported to be slightly faster.
Then next choice:
    • Redirect shell output
    • Use busybox with netcat
Busybox requires separate installation, so I decided to go with output redirection. After following the instructions I was not able to use the extracted data. Apparently redirecting adb shell converts LF to CRLF; fortunately the conversion is easily reversible. One answer suggested using adb exec-out, but this one always returned Error:closed for me.
After reversing the CRLF conversion I had the partition dump. The following tools can be used to recover data:
  • testdisk allows opening the dump file and recovering deleted files from the partition. I was able to recover 448 usable photos with it. There were more deleted entries, but their data was already unavailable.
  • photorec analyzes the entire partition, looking for files and validating their content. For me it recovered 1700 images, however many of these were from browser cache. It also recovered some photos that testdisk did not find.
  • mount the file as a regular partition and use other tools. Exact command was mount -o loop,ro,noexec,noload mmcblk0p24.dd mnt, with the extracted file and the mount point as parameters.
That's it.

Interestingly, many of the pages devoted to Android photo recovery listed kids as the cause of photo loss. Sometimes making a function too easy to use might not be the best thing to do.

Monday, March 7, 2016

Multitasking is bad and causes delays

I stumbled upon this interesting piece of information during a project management training. There was a practical exercise to exemplify the problem:
People were split into 2 groups. Group 1 was told to write "Multitasking is bad and causes delays", then number every letter of that text - 1 under M, 2 under U, and so on. Group 2 had the same task, but they were told to write the number under every letter after writing it, so they would write one letter, one number, one letter, and so on.

The result of the exercise was that almost everyone in group 1 was done before the fastest performer from group 2 announced having finished the task.
The exercise was part of a larger session devoted to Eliyahu Goldratt's critical chain project management. His book The Goal introduces theory of constraints, which is a very useful tool for resource planning, and has its applications in both project management and analysis of real-time system performance. The book is a great read, too.

Saturday, February 13, 2016

Efficient file transmission, part 3

While I was struggling with SCP/SFTP and FTP transmissions, I found that SMB, the protocol behind Windows file sharing, is also adversely affected by network latency.
SMB v1 is a block-level protocol, similar to what we saw in SFTP; SMB v1, as implemented in Windows versions up to XP/2003, does not support pipelining. As a result its performance deteriorates on long fat networks. It was virtually impossible to make full use of the available hardware using that protocol.
The situation was vastly improved in SMB v2, implemented in Vista and 2008. That protocol supports larger block sizes and multiple parallel requests, making better use of the available bandwidth. In order to make use of the new features, both client and server need to support SMB v2.
Recently I had a chance to compare the performance of both protocol versions; I used a tool that compressed 150k files downloaded over SMB. Running the tool on Windows 2003 took 6-8 hours. On 2008 it took 2 hours. Round-trip time was under 1 millisecond for both machines.