Monday, September 28, 2015

Migrating to current TRBC industry codes

As I mentioned in an earlier post, TRBC codes changed over time. This article lists the older codes and how to replace them with the current ones, known as TRBC 2012.

The earliest version of Thomson Reuters Business Classification was called Reuters Business Sector Scheme (RBSS). All RBSS codes were 5 digits long. In 2008 the classification was renamed to TRBC and codes were changed to their current form, where top-level codes have 2 digits, and each consecutive level adds 2 more digits.
In order to change a RBSS code to TRBC, apply these changes:
  • Add 0 after every digit, starting with the 3rd one
  • Remove all pairs of consecutive zeroes
Examples:
  • code 50000 becomes 50000000 after first step, then 50 after second
  • code 50131 becomes 50103010 after first step, no change in second
T-SQL code snippet to change the codes:
REPLACE(STUFF(STUFF(STUFF(RBSS_Code,6,0,'0'),5,0,'0'),4,0,'0'),'00','')


Also, some codes were changed during the migration to TRBC. After that these was another revision of TRBC codes in 2012, when 5th level of classification was added and more codes were changed. Summary of all changes (after changing codes from RBSS to TRBC using method outlined above):
  • RBSS codes starting with 9 (90000 / Governmental services, 90010 / Non-profit organizations, 90099 / Subject to further review) were removed from the classification
  • 50104010 Renewable Energy Equipment & Services was changed to 50201010 / Renewable Energy Equipment & Services
  • 50104020 Renewable Fuels was changed to 50201020 / Renewable Fuels
  • 52201010 / Construction - Supplies and Fixtures changed to 53203020 / Construction Supplies & Fixtures
  • 52203050 / Commercial Services and Supplies was discontinued
  • 52401010 / Air Freight & Courier Services (4) changed to 52405010 / Air Freight & Courier Services
  • 52402010 / Airlines changed to 52406010 / Airlines
  • 52402020 / Airport Services changed to 52407010 / Airport Services
  • 52403010 / Marine Transportation changed to 52405020 / Marine Freight & Logistics
  • 52403020 / Marine Port Services changed to 52407020 / Marine Port Services
  • 52404010 / Rails & Roads - Passengers changed to 52406020 / Passenger Transportation, Ground & Sea
  • 52404020 / Rails & Roads - Freights changed to 52405030 / Ground Freight & Logistics
  • 52404030 / Highways & Railtracks changed to 52407030 / Highways & Rail Tracks
  • 53201010 / Homebuilding changed to 53203010 / Homebuilding
  • 53201020 / Consumer Electronics was discontinued
  • 53201030 / Appliances, Tools & Housewares changed to 53204030 / Appliances, Tools & Housewares
  • 53201040 / Home Furnishing changed to 53204040 / Home Furnishings
  • 53201050 / Leisure Products was split to 53205010 / Toys & Juvenile Products and 53205020 / Recreational Products
  • 53204020 / Consumer Electronics was discontinued
  • 53302090 / Media Diversified was discontinued
  • 53401010 / Retail - Department Stores changed to 53402010 / Department Stores
  • 53401020 / Retail - Discount Stores changed to 53402020 / Discount Stores
  • 53401030 / Retail - Catalog & Internet Order was discontinued
  • 53401040 / Retail - Apparel & Accessories changed to 53403040 / Apparel & Accessories Retailers
  • 53401050 / Retail - Computers & Electronics changed to 53403050 / Computer & Electronics Retailers
  • 53401060 / Retail - Specialty changed to 53403090 / Miscellaneous Specialty Retailers
  • 55101040 / Investment Services was split under 551020 / Investment Banking & Investment Services
  • 55102040 / Specialty Investment Services was discontinued
  • 55103010 / Diversified Financial Services was discontinued
  • 55201020 / Financial Services - Diversified (4) was discontinued
  • 55301060 / Insurance Brokers was discontinued
  • 55401010 / Real Estate Operations changed to 554020 / Real Estate Operations
  • 55401020 / REIT - Residential & Commercial was split under 554030 / Residential & Commercial REITs
  • 56201010 / Pharmaceuticals - Diversified was merged to 56201040 / Pharmaceuticals (4)
  • 56201020 / Biotechnology changed to 56202010 / Biotechnology & Medical Research (4)
  • 56201030 / Pharmaceuticals - Generic & Specialty was merged to 56201040 / Pharmaceuticals (4)
  • 57103010 / Computer Hardware changed to 57106010 / Computer Hardware
  • 57103020 / Office Equipment changed to 57105010 / Office Equipment (4)

Wednesday, September 9, 2015

Research: analyst data processing

Research documents are rated based on the StarMine rating of document author. If the author information is not processed correctly, document will have incorrect rating, or have no rating at all.
Historically research description file (HDM) specified analyst identifier only. The identifier had to be mapped to analyst name using other means - like another file, or appropriate service call.
HDM files can contain either contributor-provided identifiers or Thomson Reuters identifiers. If the provided identifier matches one in our database, the document is linked to an appropriate analyst, and all products are able to display analyst name and rating.
On the other hand, if the identifier does not match our DB, the analyst code is sent to a manual review. If the author name is available in the research document, analyst information is added to our database and document is mapped to that analyst.
RIXML files contain both analyst ID and name. We figured that we can save some effort by actually using that information. Now if we find a new analyst ID with a name, we will just add the analyst to the database.
With this approach there was a risk of duplicating analyst information if an analyst has multiple identifiers. As a precaution we check if we have another analyst by the same name, and if we do, we first check if the new analyst is the same as the old one before creating a new entry.
Now, with the extra information we decided we could also check if the analyst name in RIXML matches our records for the analyst ID specified. If the name does not match, we send it to a manual review.
The idea was decent, but it lost a lot in implementation. We only store one name for analyst, and that name was stored in Latin character set. This check helped catch some cases where users sent the same ID for different analysts. But it also created some serious trouble for users who used the IDs correctly, but for some reasons used different names than the one we had in our database.
Sample problematic cases:
  • Analyst name was entirely stored in "FamilyName" field
  • Analyst name contained diacritical characters that we don't store in the DB
  • Analyst name had multiple spellings, for example was spelled in English on English documents, and in Japanese on Japanese ones
These cases still go to manual review every time we get them.

In order to deal with these cases, and also cover the users who reuse the analyst IDs, we could identify the analysts using the entire set - ID, first name, last name. We're going to try that out when time permits.

Saturday, September 5, 2015

Thomson Reuters Business Classification

Thomson Reuters Business Classification (TRBC) is a hierarchical industry classification scheme. Since 2012 it has five levels of hierarchy: economic sector, business sector, industry group, industry and activity. Activities were added in 2012, before then the classification had four levels.

Each sector has an assigned code; the code describes a place in hierarchy, for example code 50101010 (Coal) is an industry located under industry group 501010 (Coal), business sector 5010 (Energy-Fossil Fuels) and economic sector 50 (Energy). Each level adds 2 digits to the code, so all industries have 8 digit codes.

TRBC codes can be used to search for research documents related to a particular industry. Search for a code will return all documents covering that code, plus codes lower in the hierarchy. For example, searching for Energy economic sector will return all documents about Energy, but also documents about Coal industry, Oil & Gas industry group, Uranium business group and others. On the other hand, search for Uranium will not return documents that cover Energy in general.

TRBC evolved from Reuters Business Classification Scheme (RBSS). Many of the codes in current TRBC specification are the same as in RBSS. However, some codes were discontinued, and some industries were moved to a different place in the hierarchy, so their codes changed.

The decision to offer the most recent TRBC codes in our search engine was a rather straightforward one; the change was almost unnoticeable to our GUI users, and the users of our APIs were forced to make a one-time change, as the old codes stopped working. However, on the collection side we have to deal with all versions now. Many of our contributors are still sending us the discontinued codes, some because they were not informed of the change, others because they are not well equipped to make the change in their end, and others because our online documentation of the new codes is wrong in a few places.

As of now, we discard all outdated codes. However, we're losing valuable information this way, so we're considering mapping the old codes to their newer counterparts.

Tuesday, September 1, 2015

RIXML validation

Tools designed to process valid RIXMLs can misbehave when provided with invalid file; they can either ignore the non-compliant parts, or refuse to process the document entirely. Therefore to ensure correct processing you should always use valid RIXML files.

RIXML.org provides a description of their format in XSD files; links can be found on RIXML specification page (look for RIXML schema); there are quite a few tools that can be used to verify if a document is correct according to XSD.

I found XMLLint quite useful in troubleshooting RIXML problems; it is a command-line tool that can point you to problems encountered when validating the XML document. In order to use it you need to download all 3 schema files, and then run the following command:
$ xmllint my.rixml --schema RIXML-2_4.xsd --noout
my.rixml:47: element Abstract: Schemas validity error : Element '{http://www.rixml.org/2013/2/RIXML}Abstract': This element is not expected. Expected is ( {http://www.rixml.org/2013/2/RIXML}TitleFormatted ).
my.rixml fails to validate

Well, I'm surprised; according to documentation, TitleFormatted is not a required element, but XSD disagrees.
Anyway. Number 47 in the message is the number of the line where the problem was found. After adding TitleFormatted in line 47, the tool produced only one line:
my.rixml validates

XMLLint is freely available for download; it works under Linux, Windows, and a number of other platforms.

Publishing research in Thomson Reuters, part 3: search engine optimization

Now that you know how to publish documents and how to entitle customers to view them, you probably want to know how to get people to read your publications. The users will probably have a long list of documents to choose from, and even getting on that list requires putting some effort into tagging your document properly.

Ideally, release date of your document should be very close to the time when you send it to Thomson Reuters. This has a few benefits:
  • Many users filter out documents older than a certain age; if you publish old documents, some audiences will not be reachable to you
  • Some users do not use search engine, but instead opt to receive an alert when a document matching their criteria arrives. The alerts are usually limited to documents released in the last 24 hours.
Next, you can tag up to 2 primary companies the report is about. Use these wisely - virtually all searches for research on a particular financial instrument start with the issuer company.
There is a level of indirection here. You can only specify a symbol denoting a financial instrument. The document will be tagged with the company that issued the instrument, assuming that the symbol can be resolved. RICs, ISINs, CUSIPs and SEDOLs should usually work.
You can tag any number of non-primary symbols; these are less frequently used in searches.


When the users actually find your report, they will initially be presented with its headline, author and author's StarMine rating. These things will let them decide whether to read your report or not, so make sure they count.

Disclaimer: the above is not a complete list of fields that can be used to describe your document. Check your documentation to see what else you can tag.