Wednesday, September 9, 2015

Research: analyst data processing

Research documents are rated based on the StarMine rating of document author. If the author information is not processed correctly, document will have incorrect rating, or have no rating at all.
Historically research description file (HDM) specified analyst identifier only. The identifier had to be mapped to analyst name using other means - like another file, or appropriate service call.
HDM files can contain either contributor-provided identifiers or Thomson Reuters identifiers. If the provided identifier matches one in our database, the document is linked to an appropriate analyst, and all products are able to display analyst name and rating.
On the other hand, if the identifier does not match our DB, the analyst code is sent to a manual review. If the author name is available in the research document, analyst information is added to our database and document is mapped to that analyst.
RIXML files contain both analyst ID and name. We figured that we can save some effort by actually using that information. Now if we find a new analyst ID with a name, we will just add the analyst to the database.
With this approach there was a risk of duplicating analyst information if an analyst has multiple identifiers. As a precaution we check if we have another analyst by the same name, and if we do, we first check if the new analyst is the same as the old one before creating a new entry.
Now, with the extra information we decided we could also check if the analyst name in RIXML matches our records for the analyst ID specified. If the name does not match, we send it to a manual review.
The idea was decent, but it lost a lot in implementation. We only store one name for analyst, and that name was stored in Latin character set. This check helped catch some cases where users sent the same ID for different analysts. But it also created some serious trouble for users who used the IDs correctly, but for some reasons used different names than the one we had in our database.
Sample problematic cases:
  • Analyst name was entirely stored in "FamilyName" field
  • Analyst name contained diacritical characters that we don't store in the DB
  • Analyst name had multiple spellings, for example was spelled in English on English documents, and in Japanese on Japanese ones
These cases still go to manual review every time we get them.

In order to deal with these cases, and also cover the users who reuse the analyst IDs, we could identify the analysts using the entire set - ID, first name, last name. We're going to try that out when time permits.

No comments:

Post a Comment