Tuesday, July 19, 2016

3 ways to break your real-time system

I recently came across three similar problems in three different systems.

Case 1: excessive buffering

First system was a message publisher copying data from SQL server to a message queue. Each row in a table was transformed into a message. The messages were sequentially numbered, and the system kept track of already processed rows by storing the number associated with the message it last sent.
The system polled for messages every minute, then sent any new messages received, then updated the sequential number.
Trouble started when the publisher was stopped for a week. After that every attempt to start it ended in a crash with no messages sent. Turned out that the application tried to load all pending messages into a dataset before sending anything. The number of unsent messages was sufficient to overflow available memory, causing the crash.

Case 2: thread-unsafe initialization

Second system was a message processor; it was processing messages from a number of message queues. Every queue was processed in a separate thread, and the threads were largely independent. Or at least that's what we thought, until the system was stopped long enough for messages to accumulate in many queues. After that, starting the system turned out to be impossible, as processing threads deadlocked during initialization.

Case 3: Excessive deadlocking

Third system was also a message processor; it processed messages from a single queue, potentially using multiple threads under higher load. The threads were supposed to speed up message processing, but ended up doing the opposite; when multiple threads tried to run a SQL query, many of them failed because of a deadlock, effectively nullifying possible gains and rejecting legitimate messages.

So, how should they behave?

All of these systems performed well under regular load, and all of them required developer intervention when running under full load.
Ideally, the amount of time and resources required to process a message should be independent of the backlog size.
This sort of problems can be detected by measuring the time needed to process a fixed number of messages depending on the load pattern. When all messages are available up front, the processing time should be the same or better than when new messages are added as and when the system processes the old ones.

Not tending to these problems usually results in a midnight call with the sysadmins. While this might be a fun experience once in a while, it won't be appreciated by customers, and can easily go out of hand if left unattended.