DJ's software: 2015

Sunday, December 20, 2015

Windows TCP buffering

After changing the send buffer size for FTP upload, I thought I would check if SFTP upload mentioned in part 1 also suffers from the same limitation. To my surprise, SFTP uploads were able to push data continuously, even though the send buffer size was not set anywhere in my code.
An experiment showed that the SendBufferSize property after SFTP upload usually contained a different value from the original 8KB; for FTP upload, the value always remained at 8KB. The only apparent difference in code was the application buffer size - SFTP application was sending ~32KB batches, while FTP application was sending 8KB batches all of the time.
I increased the batch size on FTP application side. Immediately it became apparent that the entire batch was transmitted over the wire before waiting for an ACK, which improved the send speed. Furthermore, with batch size of 32KB the SendBufferSize was also automagically raised behind the scenes, further improving the send speed.
I couldn't find any documentation for the observed SendBufferSize changes. The effect of application send buffer on transmission speed was described in this SO question and this MS Knowledge Base article. To summarize, if the application send buffer is the same or larger than the socket send buffer, only 2 pending send operations are allowed at a time. However, the socket can buffer whatever the application passes, even when the application buffer is larger than SendBufferSize, so using a larger application buffer can speed up the transfers the same way using a larger socket buffer can.

Saturday, December 19, 2015

Efficient file transmission, part 2

Right after I have fixed the SFTP performance problem mentioned in part 1, I found that another feed, this time using FTP, is gathering a backlog. Again the bandwidth was not an issue, latency was on the order of 40 ms, but somehow pushing even the smallest files took no less than 5 seconds. Equipped with Wireshark and a lot of enthusiasm, I started looking into the problem.
FTP is a much less sophisticated protocol than SFTP. It creates a dedicated TCP connection for every file transfer. The entire file is transferred over that connection, and closing the connection marks the end of file. All commands are exchanged over so-called control connection. Normal sequence of events for FTP upload is (after establishing the control connection & logging in):

Client sends PASV command
Server responds with an address, to which the client should upload
Client connects to that address and issues a STOR command
Client starts pushing the file through the data connection

The first thing I noticed when examining a wireshark trace was a two second gap between receiving PASV response and sending SYN packet; during that gap, two NBNS packets were sent to the server, with no response. NBNS stands for NetBIOS name service, and was completely out of place at that point, because PASV returns an IP address which does not require any lookups.
I launched the application in Visual Studio debugger, and during the 2 second pause I hit "Break all" to see what was happening. I found that the application was using DNS.Resolve to convert string to convert string to IPAddress object. The method is obsolete in .NET 2.0, and documentation suggests to use DNS.GetHostEntry instead. GetHostEntry (and Resolve) method has an interesting behavior that was not documented until .NET 3.5. When you pass an IP string to the method, it first performs a reverse name lookup to get host name, then it performs a name lookup to get IP addresses associated with the name. Since we only wanted an IPAddress object, I changed the code to use DNS.GetHostAddresses, which does not use name servers when passed an IP literal. This saved 2 seconds on every transfer.

Next interesting thing I noticed was that the client was sending data in 8KB batches; it sent a batch, then waited a long time for the server to acknowledge previous batch. Only then it sent another 8KB batch, which meant that most of the time the client was waiting for an acknowledgement from the server.
Receive window size captured in the traces showed plenty of free space; for a brief moment I suspected the congestion window, until I read that congestion window size doubles every round trip time as long as there are no packet losses, which were absent in the traces. Then I found that the default send buffer size in Windows is 8KB.
I modified the send buffer size to 1MB. After that the client was no longer waiting for acknowledgements, but instead was sending data until the receive window was full; then the server advertised a larger receive window, enabling the client to send even more data, until the network started dropping packets.
Final result: files with average size of 500KB that previously took 5+ seconds to upload, now were sent in less than a second over the same link.

Friday, December 18, 2015

Efficient file transmission, part 1

Recently I faced a problem where our tool used to push files over SFTP started gathering a backlog, even though there was plenty of bandwidth available. The link had a pretty high latency, so the first thing I checked was TCP window scaling.
TCP can limit the bandwidth available on long fat networks (LFNs). Receive window limits the amount of data that can be sent over the link, but not yet acknowledged. Receive window in the original specification was limited to 64 KB. Bandwidth on any TCP link that does not employ TCP window scaling is limited to 64KB divided by the network latency - so if the latency is 0.1 seconds, the maximum transfer with 64KB window size is 640KB/s. With window scaling, the receive window can be resized up to 1 GB.
I found that TCP window scaling was implemented in Windows 2000. Linux kernels from year 2004 implement dynamic receive buffers, which are based on TCP window scaling. Since we use more recent OSes, I was pretty certain that window scaling was not a problem here. Using Wireshark I was able to confirm that indeed, the SYN packets advertised a non-zero scale factor.
Quick look at the code revealed that the protocol implementation we had required an immediate acknowledgement of every write message; the SFTP specification says that it is allowable to send multiple messages, and receive the acknowledgements later. Instead of waiting for an acknowledgement of every message, I started sending the entire file, then receiving all acknowledgements. This resulted in a massive speed up.
Driven by the desire to measure the improvement, I created a really large file, uploaded it using the old method, started uploading it using the new method, and... the transfer deadlocked. TCP socket can only buffer a limited amount of data, and what happened was, since I never retrieved the acknowledgements, at some point the server blocked on sending one. When the server blocked, it stopped processing my messages, which in turn blocked the sender.
Unfortunately checking if there's an acknowledgement available for reading in the buffer using the library we had was not an easy task, so I ended up limiting the number of unacknowledged messages to a known safe value, which still yielded a decent performance improvement.

Thursday, December 3, 2015

Playtime: labyrinth

A simple labyrinth generator.
The above labyrinth is generated using regular depth-first search. The result does not present any real challenge, as it contains one long path with very short branches. It is perfectly suitable to teach your 4-year-old to use the keyboard arrows.

Thursday, November 19, 2015

Color blindness

Ishihara test, a color perception test
for red-green deficiencies

Recently I was tasked with redesigning a report in a way that would make it friendlier for people who can't tell apart red and green. The report provides information in a tabular format, using colors to provide visual cues about the data.
For the rest of us, all colors are made of 3 components - red, green and blue. Microsoft's page on color blindness shows how other colors are composed from the three. It also goes to show how different colors will be perceived by people with color deficiencies.
In order to adapt the existing web page, while preserving the user experience of existing users, I'm going to tweak the colors used.

There are some resources available on the web to aid in the task:

Choosing safe colors - this page provides a sample selection of 10 distinguishable colors, together with a hint how they could be perceived by people with color deficiencies
Websafe colors for color-blind people - this page lists a set of colors, and also links to Color Oracle, an open-source color blindness simulator
A nice theoretical background is provided on Designing For, and As, a Color-Blind Person

And finally, there's always the option of using other cues (instead of or together with color) to denote some properties of the displayed information.

Tuesday, November 17, 2015

ElasticSearch result merging

I was curious how ElasticSearch handles merging multiple result sets (one from each shard) into a single sorted response. In my last post I went over a few methods of merging multiple sorted streams. Then I went to search how ES handles this.
I assumed that ES uses a priority queue. Searching for PriorityQueue in Elasticsearch resulted in a bug report containing interesting stack trace.
It appears that SearchPhaseController.sortDocs is the method responsible for merging results.
In current version the sortDocs method uses Lucene's TopDocs.merge to do the sorting. The merge method uses priority queue containing one record for every input stream, removing and adding streams for every output record.
Quick look at the PriorityQueue implementation reveals updateTop method; this method's JavaDocs claim it to be "at least twice as fast" compared to removing & adding.
Modifying the method to use updateTop instead of pop/add was a rather trivial exercise. The new version is not twice as fast as the old one; the speed gains were on the order of 20-80%, depending on the input provided.
The patch was accepted, so next ElasticSearch's release is likely to contain the improved version.

Wednesday, October 7, 2015

Merge-sorting multiple sorted inputs

Reading ElasticSearch documentation, I found that every ES index shard is a Lucene index instance; user queries are first executed on each individual shard, and then results are merged before sending to the requester.

Given s lists of n elements each, sorted in descending order, we want to produce a sorted list that will contain top n items from the input lists, combined. The trivial algorithm is:

Take the first item from every list
Pick the largest item from these. Remove it from its input list, and output it
Repeat the above steps until we output n elements

Using a basic (linear) algorithm to find the largest element, repeated n times, we would end up with O(ns) computational complexity.

However, given that the list of items in point 2 is almost identical every time (only 1 item changes between iterations), we can achieve much better results using a heap-based priority queue. Updated algorithm:

Take the first item from every list, build a priority queue
Pick and remove the largest item from the queue. Output the item.
Add next item from the same list to the queue.
Repeat steps 2 and 3 until we output n elements

Removing and adding are both logarithmic, so we end up having O(n lg s) computational complexity

Now, knowing how the priority queues are implemented, we can save one operation by replacing the top element in queue instead of removing + adding. This way we rebuild heap only once.
Note: in rare cases this could result in doing more operations instead of less.

Sample implementation in C++

Monday, September 28, 2015

Migrating to current TRBC industry codes

As I mentioned in an earlier post, TRBC codes changed over time. This article lists the older codes and how to replace them with the current ones, known as TRBC 2012.

The earliest version of Thomson Reuters Business Classification was called Reuters Business Sector Scheme (RBSS). All RBSS codes were 5 digits long. In 2008 the classification was renamed to TRBC and codes were changed to their current form, where top-level codes have 2 digits, and each consecutive level adds 2 more digits.
In order to change a RBSS code to TRBC, apply these changes:

Add 0 after every digit, starting with the 3rd one
Remove all pairs of consecutive zeroes

Examples:

code 50000 becomes 50000000 after first step, then 50 after second
code 50131 becomes 50103010 after first step, no change in second

T-SQL code snippet to change the codes:
REPLACE(STUFF(STUFF(STUFF(RBSS_Code,6,0,'0'),5,0,'0'),4,0,'0'),'00','')

Also, some codes were changed during the migration to TRBC. After that these was another revision of TRBC codes in 2012, when 5th level of classification was added and more codes were changed. Summary of all changes (after changing codes from RBSS to TRBC using method outlined above):

RBSS codes starting with 9 (90000 / Governmental services, 90010 / Non-profit organizations, 90099 / Subject to further review) were removed from the classification
50104010 Renewable Energy Equipment & Services was changed to 50201010 / Renewable Energy Equipment & Services
50104020 Renewable Fuels was changed to 50201020 / Renewable Fuels
52201010 / Construction - Supplies and Fixtures changed to 53203020 / Construction Supplies & Fixtures
52203050 / Commercial Services and Supplies was discontinued
52401010 / Air Freight & Courier Services (4) changed to 52405010 / Air Freight & Courier Services
52402010 / Airlines changed to 52406010 / Airlines
52402020 / Airport Services changed to 52407010 / Airport Services
52403010 / Marine Transportation changed to 52405020 / Marine Freight & Logistics
52403020 / Marine Port Services changed to 52407020 / Marine Port Services
52404010 / Rails & Roads - Passengers changed to 52406020 / Passenger Transportation, Ground & Sea
52404020 / Rails & Roads - Freights changed to 52405030 / Ground Freight & Logistics
52404030 / Highways & Railtracks changed to 52407030 / Highways & Rail Tracks
53201010 / Homebuilding changed to 53203010 / Homebuilding
53201020 / Consumer Electronics was discontinued
53201030 / Appliances, Tools & Housewares changed to 53204030 / Appliances, Tools & Housewares
53201040 / Home Furnishing changed to 53204040 / Home Furnishings
53201050 / Leisure Products was split to 53205010 / Toys & Juvenile Products and 53205020 / Recreational Products
53204020 / Consumer Electronics was discontinued
53302090 / Media Diversified was discontinued
53401010 / Retail - Department Stores changed to 53402010 / Department Stores
53401020 / Retail - Discount Stores changed to 53402020 / Discount Stores
53401030 / Retail - Catalog & Internet Order was discontinued
53401040 / Retail - Apparel & Accessories changed to 53403040 / Apparel & Accessories Retailers
53401050 / Retail - Computers & Electronics changed to 53403050 / Computer & Electronics Retailers
53401060 / Retail - Specialty changed to 53403090 / Miscellaneous Specialty Retailers
55101040 / Investment Services was split under 551020 / Investment Banking & Investment Services
55102040 / Specialty Investment Services was discontinued
55103010 / Diversified Financial Services was discontinued
55201020 / Financial Services - Diversified (4) was discontinued
55301060 / Insurance Brokers was discontinued
55401010 / Real Estate Operations changed to 554020 / Real Estate Operations
55401020 / REIT - Residential & Commercial was split under 554030 / Residential & Commercial REITs
56201010 / Pharmaceuticals - Diversified was merged to 56201040 / Pharmaceuticals (4)
56201020 / Biotechnology changed to 56202010 / Biotechnology & Medical Research (4)
56201030 / Pharmaceuticals - Generic & Specialty was merged to 56201040 / Pharmaceuticals (4)
57103010 / Computer Hardware changed to 57106010 / Computer Hardware
57103020 / Office Equipment changed to 57105010 / Office Equipment (4)

Wednesday, September 9, 2015

Research: analyst data processing

Research documents are rated based on the StarMine rating of document author. If the author information is not processed correctly, document will have incorrect rating, or have no rating at all.
Historically research description file (HDM) specified analyst identifier only. The identifier had to be mapped to analyst name using other means - like another file, or appropriate service call.
HDM files can contain either contributor-provided identifiers or Thomson Reuters identifiers. If the provided identifier matches one in our database, the document is linked to an appropriate analyst, and all products are able to display analyst name and rating.
On the other hand, if the identifier does not match our DB, the analyst code is sent to a manual review. If the author name is available in the research document, analyst information is added to our database and document is mapped to that analyst.
RIXML files contain both analyst ID and name. We figured that we can save some effort by actually using that information. Now if we find a new analyst ID with a name, we will just add the analyst to the database.
With this approach there was a risk of duplicating analyst information if an analyst has multiple identifiers. As a precaution we check if we have another analyst by the same name, and if we do, we first check if the new analyst is the same as the old one before creating a new entry.
Now, with the extra information we decided we could also check if the analyst name in RIXML matches our records for the analyst ID specified. If the name does not match, we send it to a manual review.
The idea was decent, but it lost a lot in implementation. We only store one name for analyst, and that name was stored in Latin character set. This check helped catch some cases where users sent the same ID for different analysts. But it also created some serious trouble for users who used the IDs correctly, but for some reasons used different names than the one we had in our database.
Sample problematic cases:

Analyst name was entirely stored in "FamilyName" field
Analyst name contained diacritical characters that we don't store in the DB
Analyst name had multiple spellings, for example was spelled in English on English documents, and in Japanese on Japanese ones

These cases still go to manual review every time we get them.

In order to deal with these cases, and also cover the users who reuse the analyst IDs, we could identify the analysts using the entire set - ID, first name, last name. We're going to try that out when time permits.

Saturday, September 5, 2015

Thomson Reuters Business Classification

Thomson Reuters Business Classification (TRBC) is a hierarchical industry classification scheme. Since 2012 it has five levels of hierarchy: economic sector, business sector, industry group, industry and activity. Activities were added in 2012, before then the classification had four levels.

Each sector has an assigned code; the code describes a place in hierarchy, for example code 50101010 (Coal) is an industry located under industry group 501010 (Coal), business sector 5010 (Energy-Fossil Fuels) and economic sector 50 (Energy). Each level adds 2 digits to the code, so all industries have 8 digit codes.

TRBC codes can be used to search for research documents related to a particular industry. Search for a code will return all documents covering that code, plus codes lower in the hierarchy. For example, searching for Energy economic sector will return all documents about Energy, but also documents about Coal industry, Oil & Gas industry group, Uranium business group and others. On the other hand, search for Uranium will not return documents that cover Energy in general.

TRBC evolved from Reuters Business Classification Scheme (RBSS). Many of the codes in current TRBC specification are the same as in RBSS. However, some codes were discontinued, and some industries were moved to a different place in the hierarchy, so their codes changed.

The decision to offer the most recent TRBC codes in our search engine was a rather straightforward one; the change was almost unnoticeable to our GUI users, and the users of our APIs were forced to make a one-time change, as the old codes stopped working. However, on the collection side we have to deal with all versions now. Many of our contributors are still sending us the discontinued codes, some because they were not informed of the change, others because they are not well equipped to make the change in their end, and others because our online documentation of the new codes is wrong in a few places.

As of now, we discard all outdated codes. However, we're losing valuable information this way, so we're considering mapping the old codes to their newer counterparts.

Tuesday, September 1, 2015

RIXML validation

Tools designed to process valid RIXMLs can misbehave when provided with invalid file; they can either ignore the non-compliant parts, or refuse to process the document entirely. Therefore to ensure correct processing you should always use valid RIXML files.

RIXML.org provides a description of their format in XSD files; links can be found on RIXML specification page (look for RIXML schema); there are quite a few tools that can be used to verify if a document is correct according to XSD.

I found XMLLint quite useful in troubleshooting RIXML problems; it is a command-line tool that can point you to problems encountered when validating the XML document. In order to use it you need to download all 3 schema files, and then run the following command:
$ xmllint my.rixml --schema RIXML-2_4.xsd --noout

my.rixml:47: element Abstract: Schemas validity error : Element '{http://www.rixml.org/2013/2/RIXML}Abstract': This element is not expected. Expected is ( {http://www.rixml.org/2013/2/RIXML}TitleFormatted ).
my.rixml fails to validate

Well, I'm surprised; according to documentation, TitleFormatted is not a required element, but XSD disagrees.
Anyway. Number 47 in the message is the number of the line where the problem was found. After adding TitleFormatted in line 47, the tool produced only one line:

my.rixml validates

XMLLint is freely available for download; it works under Linux, Windows, and a number of other platforms.

Publishing research in Thomson Reuters, part 3: search engine optimization

Now that you know how to publish documents and how to entitle customers to view them, you probably want to know how to get people to read your publications. The users will probably have a long list of documents to choose from, and even getting on that list requires putting some effort into tagging your document properly.

Ideally, release date of your document should be very close to the time when you send it to Thomson Reuters. This has a few benefits:

Many users filter out documents older than a certain age; if you publish old documents, some audiences will not be reachable to you
Some users do not use search engine, but instead opt to receive an alert when a document matching their criteria arrives. The alerts are usually limited to documents released in the last 24 hours.

Next, you can tag up to 2 primary companies the report is about. Use these wisely - virtually all searches for research on a particular financial instrument start with the issuer company.
There is a level of indirection here. You can only specify a symbol denoting a financial instrument. The document will be tagged with the company that issued the instrument, assuming that the symbol can be resolved. RICs, ISINs, CUSIPs and SEDOLs should usually work.
You can tag any number of non-primary symbols; these are less frequently used in searches.

When the users actually find your report, they will initially be presented with its headline, author and author's StarMine rating. These things will let them decide whether to read your report or not, so make sure they count.

Disclaimer: the above is not a complete list of fields that can be used to describe your document. Check your documentation to see what else you can tag.

Monday, August 31, 2015

Publishing research in Thomson Reuters, part 2: availability

Every time you publish a document, you need to decide who it is going to be available to. Listing all entitled users on every document does not scale well, so instead you need to specify entitlement group.

All documents in the same entitlement group share some properties:

List of entitled readers
Embargo settings
Price algorithm

The common set up involves having one subscription group, where the users you select can access the document free of charge, and one pay-per-view group, where the entitled users need to purchase the document (in whole or in part) before viewing.

If you want to release your documents to different user groups at different points in time, you can set up an embargo. This way the group you publish to will receive the document immediately, and embargoed group will receive the document automatically after it reaches a certain age.

Price algorithm allows you to decide how much you want to charge for documents; it is possible to tweak price per page (applies to PDF documents only) and price per document. The price can be constant, or it can be calculated based on certain factors like document age or number of pages. Note, some distribution channels require predefined price algorithms, so using custom prices will limit your ability to reach certain audiences.

In HDM you specify the document group using the following syntax:
Distribution.GroupID[0] = 1
In this example 0 is the sequential number of user group, 1 is the group ID.

In RIXML 2.4 you would use the following:

<Research>
<Product>
    <Context>
      <ProductDetails>
        <EntitlementGroup>
          <Entitlement>
            <AudienceTypeEntitlement audienceType="PublisherDefined" entitlementContext="ThomsonReuters">1</AudienceTypeEntitlement>
In this example 1 is the group ID. Multiple entitlement groups can be specified by adding multiple AudienceTypeEntitlement tags.

Sunday, August 30, 2015

Publishing research in Thomson Reuters

So, you decided to publish your financial research documents on Thomson Reuters systems. One of the first choices you have to make is - how are you going to publish your documents?

There is a number of options here, each has its advantages and disadvantages.

Publish document + description file

This option works best if you already manage your documents using a computer system. Once you upload a RIXML or HDM description file to Thomson Reuters servers, the document becomes available almost immediately.

Publish document + description using web tool (TRRC)

If you don't store the document description anywhere, you can create one manually in the web tool. This option is fairly time-consuming, but gives you full control over the information you submit, and shields you from potential problems with the description file.

Publish document only

This option is the easiest to use. Thomson Reuters will extract the document description from your document. The document will not be immediately available, and you have no control over the document description. Also, this option is subject to different terms and conditions.

Publish using third party tools

Some companies offer to publish your documents on your behalf. The quality of their submissions varies on a case-by-case basis. Your mileage may vary.