Way back in Legato Developers Corner #6, we used the WordParse function to pull apart simple strings, and parse through some simple HTML data. While WordParse is a great tool for doing simple things like that, what happens if you want to parse through an entire file, and extract information from it? To do that, you’ll need to use a more powerful SGML parser. This blog post is intended to be an introduction to our SGML parsing and support object using a real-world example.
Friday, February 24. 2017
LDC #23: Using Advanced SGML Parsing On the 13H Broker-Dealer List
For this example script, I’ve picked a script written for a client. The client wanted to pull a broker-dealer list out of a 13H XML file. Well, that’s not a function native to GoFiler, but Legato’s SGML parser is more than capable of picking the file apart and building a simple data file from extracted information. Our example script is below, followed by a walkthrough of what each piece does:
// Form13HBrokerDealerExtractor.ms // // extracts information from broker dealer list. // // 2-24-2017 Author: Steven Horowitz // int run(int f_id, string mode); int setup() { string fnScript; string menu[10]; int rc; menu["Code"] = "EXTENSION_EXTRACT_BROKERDEALER"; menu["MenuText"] = "&Extract Broker Dealer List"; menu["Description"] = "<B>13-H Tools</B>\r\rWrite all broker dealers to a .CSV file"; rc = MenuFindFunctionID(menu["Code"]); if (IsNotError(rc)) { return ERROR_NONE; } rc = MenuAddFunction(menu); if (IsError(rc)) { return ERROR_NONE; } fnScript = GetScriptFilename(); MenuSetHook(menu["Code"], fnScript, "run"); return ERROR_NONE; } int run(int f_id, string mode) { handle log; handle sgml; string element; string name; string table[][]; int ix, ox; string output; string input; boolean item6; if (mode != "preprocess") { return ERROR_NONE; } log = LogCreate("Form 13H Broker Dealer Extractor"); input = BrowseOpenFile("Select Input:", "13H XML Files (.xml)|*.xml"); if (GetLastError() == ERROR_CANCEL) { return ERROR_CANCEL; } if (GetExtension(input) != ".xml") { MessageBox('x', "Invalid file type, select a 13H XML file"); return ERROR_EXIT; } sgml = SGMLCreate(input); if (IsError(GetLastError())) { MessageBox('x', "Cannot open file. Error: 0x%08X", GetLastError()); return ERROR_EXIT; } output = BrowseSaveFile("Save File:", "CSV File(.csv)|*.csv"); if (GetLastError() == ERROR_CANCEL) { return ERROR_CANCEL; } ix = 0; table[ix][0] = "Name"; table[ix][1] = "Prime Broker"; table[ix][2] = "Executing Broker"; table[ix][3] = "Clearing Broker"; AddMessage(log, "Input File: %s", input); AddMessage(log, "Output File: %s", output); element = SGMLNextElement(sgml); while (element != "") { if (element == "<itemSix>") { item6 = true; } if (element == "</itemSix>") { break; } if (item6 != true) { element = SGMLNextElement(sgml); continue; } if (IsInString(element, "/")) { element = SGMLNextElement(sgml); continue; } if (element == "<name>") { ix++; name = SGMLNextItem(sgml); ox = 0; while (SGMLGetItemType(sgml) != SPI_TYPE_TAG && ox < 100) { name += SGMLNextItem(sgml); ox++; } name = ReplaceInString(name, "</name>", ""); AddMessage(log, name); table[ix][0] = name; } if (IsInString(element, "prime")) { AddMessage(log, " prime"); table[ix][1] = "Yes"; } if (IsInString(element, "executing")) { AddMessage(log, " executing"); table[ix][2] = "Yes"; } if (IsInString(element, "clearing")) { AddMessage(log, " clearing"); table[ix][3] = "Yes"; } element = SGMLNextElement(sgml); } CSVWriteTable(table, output); LogDisplay(log); CloseHandle(log); CloseHandle(sgml); return ERROR_NONE; } int main() { setup(); return ERROR_NONE; }
We start out by prototyping the run function, and adding our setup function. Recall from previous blog posts that the setup function is called by GoFiler on application startup and allows us to hook our function into other parts of the application. In this case, it’s adding a menu item called “Extract Broker Dealer List” to the File toolbar, and then hooking the run function of our script to it.
int run(int f_id, string mode); int setup() { string fnScript; string menu[10]; int rc; menu["Code"] = "EXTENSION_EXTRACT_BROKERDEALER"; menu["MenuText"] = "&Extract Broker Dealer List"; menu["Description"] = "<B>13-H Tools</B>\r\rWrite all broker dealers to a .CSV file"; rc = MenuFindFunctionID(menu["Code"]); if (IsNotError(rc)) { return ERROR_NONE; } rc = MenuAddFunction(menu); if (IsError(rc)) { return ERROR_NONE; } fnScript = GetScriptFilename(); MenuSetHook(menu["Code"], fnScript, "run"); return ERROR_NONE; }
The next main part of the script is the run function, which is where all the fun happens. After declaring all our variables, and ensuring that we are only running the script as a post process function (each script hook as a pre-process and post-process action, we only want to run this once, so we only do it post-process instead of letting it run twice), we create a log using the LogCreate script function. This lets us dump information about what happened on the script’s run to the Information View window at the bottom of GoFiler. Then, we query the user about what file to open using the BrowseOpenFile script function. This pops up an open file selection window, so the user can pick what 13H file to open. Following that line are two if statements, testing if the user cancelled the operation so we can just return the cancel error code and be done, and testing if the user actually picked a valid file extension.
log = LogCreate("Form 13H Broker Dealer Extractor"); input = BrowseOpenFile("Select Input:", "13H XML Files (.xml)|*.xml"); if (GetLastError() == ERROR_CANCEL) { return ERROR_CANCEL; } if (GetExtension(input) != ".xml") { MessageBox('x', "Invalid file type, select a 13H XML file"); return ERROR_EXIT; }
Now that we know what file we’re going to be processing, we need to actually set it into our SGML parser object. This is done using the SGMLCreate SDK function. The SGMLCreate SDK function can take a file path from the BrowseOpenFile SDK function and return a parser object. After we use this function, we need to make sure it actually worked. The next line tests to see if the last function returned an error, and if so it prints out that error message to the user and quits. In my experience, the most common error would be that the user has the file open in GoFiler or some other program. The XML file cannot be open anywhere else, or this function will not work.
sgml = SGMLCreate(input); if (IsError(GetLastError())) { MessageBox('x', "Cannot open file. Error: 0x%08X", GetLastError()); return ERROR_EXIT; }
Once we’re sure the user picked a good file, and the SGML object has been created, we can use the BrowseSaveFile script function to pick a location to which our exported data will be saved. We need to test if the user cancelled the operation afterwards, so we know to exit the script with the appropriate error code. If we don’t do this, the rest of the script will still run, but it won’t have a defined output file, and the behavior will either be an error or it will just do nothing.
output = BrowseSaveFile("Save File:", "CSV File(.csv)|*.csv"); if (GetLastError() == ERROR_CANCEL) { return ERROR_CANCEL; }
So now we have our input file loaded to our SGML parser, and we have a known output file. The next thing we need to do is set up our output by creating column headings in our output table. I used a two-dimensional array variable named “table” to represent our output CSV table, and gave it some simple column headings in the first row. The script also shows the output in the Information View so we add a simple message to the log. The ix counter represents the number of total rows in our table.
ix = 0; table[ix][0] = "Name"; table[ix][1] = "Prime Broker"; table[ix][2] = "Executing Broker"; table[ix][3] = "Clearing Broker"; AddMessage(log, "Input File: %s", input); AddMessage(log, "Output File: %s", output);
Then we are ready to get the first SGML element. The SGML parser object can get the next (or previous) item or element. The next element would be an SGML element, like an XML tag. The next item would be anything separated by spaces. We need to go through our file looking for specific SGML elements, so we want to get the next element. Once we are out of elements, it will return an empty string, so we can process the entire file using a while loop.
element = SGMLNextElement(sgml); while (element != "") {
We’re now parsing our way through the XML file. Great! We’re not interested in most of it, so we want to make sure we set a boolean flag when we hit the SGML Element for Item 6, which is the broker-dealer list. Its element name from the Form 13H XML technical specification for EDGAR is <itemSix>. If we encounter that element we set our item6 flag to true. If we see the close element we set our flag to false. Then since all we care about is Item 6, if we are not parsing Item 6 we get the next tag and check again. We also don’t care about closing tags so if a tag contains a slash we can ignore it. This isn’t the best way to check for a closing tag but for our purposes, it works fine.
if (element == "<itemSix>") { item6 = true; } if (element == "</itemSix>") { break; } if (item6 != true) { element = SGMLNextElement(sgml); continue; } if (IsInString(element, "/")) { element = SGMLNextElement(sgml); continue; }
So from this point on in the loop we are only inside the Item 6 section of the XML. It’s important to note that since Item 6 is the last section of the XML we could not check for the ending tag but, by doing so, we are making our script more immune to changes in the schema. We want to extract the name and information for each entry so the next few if statements check for the appropriate element names. The <name> element is the broker-dealer name, once we’ve hit this element we can start to read the broker-dealer name. Since the broker-dealer information always starts with a <name> element, we can increment the count of broker-dealers. The ox counter represents the number of words in the broker-dealer name. After resetting the ox counter, we can start by setting name equal to the next item since it will be the first thing in the <name> tag. We use SGMLNextItem instead of SGMLNextElement because we actually care about the text and spaces, instead of skipping to the next tag. Next, we check the item’s type by using the SGMLGetItemType function. If it’s a tag, we want to stop parsing, because we’ve reached the end of the name. If it’s not a tag, we can keep looping, and appending the next space or text item to our name variable. Once we hit the end of the name and exit the loop, we want to use our ReplaceInString script function to strip out the </name> tag that would be at the end, log that we found a name, add the value to our table in the first column of the current row, and then get the next element.
if (element == "<name>") { ix++; name = SGMLNextItem(sgml); ox = 0; while (SGMLGetItemType(sgml) != SPI_TYPE_TAG && ox < 100) { name += SGMLNextItem(sgml); ox++; } name = ReplaceInString(name, "</name>", ""); AddMessage(log, name); table[ix][0] = name; }
That was the hard part. Now that we have the broker-dealer name, it’s smooth sailing because of the way the SEC structured this part of XML schema. The name of the tags indicate the data values so we don’t need to read any more text or spaces. Furthermore, any tags we read apply to the previous broker-dealer name. So all we have to do after using the SGMLGetNextElement script function is test if the string contains the keywords “prime”, “executing”, or “clearing”, and we know which columns need to have a “Yes”. If we don’t encounter a tag the column will be blank. We could alter it to put “No” in unfilled columns, but for our purposes, that isn’t necessary. After we’ve set the value into the appropriate column of the table, we log what happened, and get the next element. This will continue to loop until the end of the document, getting each and every row.
if (IsInString(element, "prime")) { AddMessage(log, " prime"); table[ix][1] = "Yes"; } if (IsInString(element, "executing")) { AddMessage(log, " executing"); table[ix][2] = "Yes"; } if (IsInString(element, "clearing")) { AddMessage(log, " clearing"); table[ix][3] = "Yes"; } element = SGMLNextElement(sgml); }
After the loop completes, we use the CSVWriteTable script function to write the table to the chosen output location, and use the LogDisplay function to display the log inside GoFiler. Then use the CloseHandle function to close the handles to the log and the SGML object.
CSVWriteTable(table, output); LogDisplay(log); CloseHandle(log); CloseHandle(sgml); return ERROR_NONE;
We will skip over the main function as it is not required to use the script as a hook but allows for the hook to be added when the script is manually run.
This is a fairly simple example, showing how you can break down a document using the SGML parser, stop on tags of interest, test to see what you’re looking at, and dump relevant information to a log file. The parser is actually significantly more powerful than this, and this example just scratches the surface of what it can do. We will have follow-up posts later that discuss it in more depth, but for now this provides a nice real-world example of the SGML parsing capabilities of Legato.
Steven Horowitz has been working for Novaworks for over five years as a technical expert with a focus on EDGAR HTML and XBRL. Since the creation of the Legato language in 2015, Steven has been developing scripts to improve the GoFiler user experience. He is currently working toward a Bachelor of Sciences in Software Engineering at RIT and MCC. |
Additional Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato