The script for this week is going to be another request by a customer. It is a good example of how to search for patterns in HTML and replace the content matching that pattern. A common task when converting an HTML document is trying to get indented paragraphs to indent at the same level. This is especially problematic when the document contains numbered paragraphs that are not hanging (such as a legal outline.) Dealing with this case isn’t exactly trivial in HTML, as it often imports with a set number of non-breaking spaces to create the indent. Unless you’re using a monospace font, the text will not perfectly align, and the outline will not look the same for every number. The best solution is to remove the non-breaking spaces, and wrap the lead-in text (the “(a)”, or “1.”, or “Section 1.”, etc) with a inline tag, using the display type “inline-block”. This, combined with a fixed width and “nowrap” properties, allows us to add a block of space that is a fixed width into a paragraph, so our lead-in text takes up exactly the right amount of space.
Friday, June 30. 2017
LDC #41: Align Outline Text
I’ve created a sample file to demonstrate what the before and after would look like in a file. In the before, notice how the “(1)” does not line up exactly with the start of the first paragraph’s text after the lead-in of “A.”. The “(a)” that leads into the third paragraph also doesn’t line up with the body of the second paragraph. It’s close, but not exact. In the after, you can see that the alignment has resolved, because the lead-in texts are now enclosed in font blocks that fix their width to a half inch.
The script below makes a couple of assumptions about our file, which means it may not work on all files. It assumes:
1) | The file’s non-breaking space tags are wrapped in a font tag. In my experience this is the most common way Word codes this style indent, resulting in the spaces wrapped by a font tag with font size 7. |
2) | The file being converted is indenting things a half inch at a time. The defined tags at the top of the script can be edited if this isn’t the case, but I wrote and tested it for things indented in half inch increments. |
3) | There aren’t any HTML tags between the start of the lead in text of a paragraph and the the paragraph, other than bookmark tags. |
4) | There aren’t any HTML tags after the lead-in text and before the start of the font block containing the non-breaking spaces. |
5) | The HTML being parsed is well-formed without tag nesting issues or missing close tags. |
Not all of these assumptions are going to be true in all files, so this script will not work on 100% of files. However, they can all be compensated for, so this is a script that could grow in complexity as new exceptions are found that require modifications to the script. With HTML, there’s always hundreds of different ways to write anything, so making a script that works for everything is probably not possible (or prohibitively time consuming). So instead we can aim for common methods and then iteratively fix the script and our file through trial and error. We’re going to go ahead and assume these conditions are true for the purposes of this blog.
This script was actually written by using a different one from a previous blog post as a base. The previous post LDC #5: Converting Wingding Checkboxes to Unicode worked great as a base, because it already has all the code required to add a menu item, add a hook to the menu, and parse through an HTML file. I simply changed a bunch of the wording in the comments / setup, and re-did the run function so the parse action does something different. Taking scripts that do something similar to what you want (in this case parse through and do things to an HTML file) and changing them to do something else is probably the fastest way to make and deploy new scripts. As the setup and main functions are the same, just with a couple of changes (for the script hook), they will not be discussed in this post.
This week’s script:
// // // GoFiler Legato Script - Align Outlined Text // ------------------------------------------ // // Rev 06/29/2017 // // (c) 2016 Novaworks, LLC -- All rights reserved. // // Examines any HTML file for 0-30 characters in a paragraph followed by a font tag containing spaces only, // and then replaces the font tag with spaces with a better, more uniform spacing method. // /********************************************************/ /* Global Items */ /* ------------ */ /********************************************************/ #define NBSP_ONLY "^(( )|(&NBSP;)|(&160;)|( )){5,}$" /* font tags for SM, MD, and LG values defined below */ #define FONT_TAG_SM "<FONT STYLE=\"display: inline-block; width: 0.5in; float: left; white-space:nowrap\">" #define FONT_TAG_MD "<FONT STYLE=\"display: inline-block; width: 1in; float: left; white-space:nowrap\">" #define FONT_TAG_LG "<FONT STYLE=\"display: inline-block; width: 1.5in; float: left; white-space:nowrap\">" /* max values in 10's place of length of lead-in */ #define CHARS_SM 0 /* below 10 chars */ #define CHARS_MD 1 /* 10 chars to 19 chars */ #define CHARS_LG 2 /* 20 chars to 29 chars */ #define CLOSE_FONT "</FONT>" int run (int f_id, string mode);/* Call from Hook Processor */ /****************************************/ int setup() { /* Called from Application Startup */ /****************************************/ string fnScript; /* Us */ string item[10]; /* Menu Item */ int rc; /* Return Code */ /* */ /* ** Add Menu Item */ /* * Define Function */ item["Code"] = "EXTENSION_ALIGN_OUTLINE"; /* Function Code */ item["MenuText"] = "&Align Outline Text"; /* Menu Text */ item["Description"] = "<B>Align Outline Text</B> "; /* Description (long) */ item["Description"]+= "\r\rBreaks outline out into aligned blocks.";/* * description */ /* * Check for Existing */ rc = MenuFindFunctionID(item["Code"]); /* Look for existing */ if (IsNotError(rc)) { /* Was already be added */ return ERROR_NONE; /* Exit */ } /* end error */ /* * Registration */ rc = MenuAddFunction(item); /* */ if (IsError(rc)) { /* Was already be added */ return ERROR_NONE; /* Exit */ } /* end error */ fnScript = GetScriptFilename(); /* Get the script filename */ MenuSetHook(item["Code"], fnScript, "run"); /* Set the Test Hook */ return ERROR_NONE; /* Return value (does not matter) */ } /* end setup */ /****************************************/ int main() { /* Initialize from Hook Processor */ /****************************************/ setup(); /* Add to the menu */ return ERROR_NONE; /* Exit Done */ } /* end setup */ /****************************************/ int run(int f_id, string mode) { /* Call from Hook Processor */ /****************************************/ int counter; /* increment counter */ int px,py; /* start pos of paragraph text */ int text_width; /* the width of the lead in text */ int ex,ey,sx,sy; /* positional variables */ dword type; /* type of window */ string font_tag; /* font tag to write */ string content; /* content of an SGML tag */ string closetag; /* closing tag to write out */ string element; /* sgml element */ handle sgml; /* sgml object */ handle edit_object; /* edit object */ handle edit_window; /* edit window handle */ string text; /* closing element of sgml object */ /* */ if (mode!="preprocess"){ /* if mode is not preprocess */ return ERROR_NONE; /* return no error */ } /* */ edit_window = GetActiveEditWindow(); /* get handle to edit window */ if(IsError(edit_window)){ /* get active edit window */ MessageBox('x',"Cannot get edit window."); /* display error */ return ERROR_EXIT; /* return */ } /* */ type = GetEditWindowType(edit_window) & EDX_TYPE_ID_MASK; /* get the type of the window */ if (type!=EDX_TYPE_PSG_PAGE_VIEW && type!=EDX_TYPE_PSG_TEXT_VIEW){ /* and make sure type is HTML or Code */ MessageBox('x',"This is not an HTML edit window."); /* display error */ return ERROR_EXIT; /* return error */ } /* */ edit_object = GetEditObject(edit_window); /* create edit object */ sgml = SGMLCreate(edit_object); /* create sgml object */ element = SGMLNextElement(sgml); /* get the first sgml element */ while(element != ""){ /* while element isn't empty */ if (FindInString(element, "<p", 0, false)>(-1)){ /* if the element is a paragraph */ px = SGMLGetItemPosEX(sgml); /* store end pos */ py = SGMLGetItemPosEY(sgml); /* store end pos */ element = SGMLNextElement(sgml); /* get the next element */ while (FindInString(element, "<font", 0, false)>0 && /* while not a font tag */ MakeLowerCase(element)!="</p>" && element!=""){ /* and not at the end of P */ if (FindInString(element, "<a", 0, false)>(-1)){ /* if next element is an anchor */ SGMLNextElement(sgml); /* advance 2 times */ px = SGMLGetItemPosEX(sgml); /* get px */ py = SGMLGetItemPosEY(sgml); /* get py */ element = SGMLNextElement(sgml); /* advance 2 times */ } /* */ else{ /* */ break; /* if not an A tag, break */ } /* */ } /* */ if (FindInString(element, "<font", 0, false)>(-1)){ /* if the next element is a font tag */ sx = SGMLGetItemPosSX(sgml); /* start of font tag */ sy = SGMLGetItemPosSY(sgml); /* start of font tag */ content = ReadSegment(edit_object,px,py,sx,sy); /* get content of lead-in */ text_width = GetStringLength(content); /* get width of text */ content = SGMLFindClosingElement(sgml,SP_FCE_CODE_AS_IS); /* get the content of the font tag */ content = TrimPadding(content); /* remove leading / trailing space */ if (IsRegexMatch(content, NBSP_ONLY)){ /* check if font tag is only NBSP's */ ex = SGMLGetItemPosEX(sgml); /* end of font tag */ ey = SGMLGetItemPosEY(sgml); /* end of font tag */ switch (text_width/10){ /* switch on width of lead-in text */ case (CHARS_SM): /* if less than 10 chars */ font_tag = FONT_TAG_SM; /* use small tag */ break; /* break switch */ case (CHARS_MD): /* if less than 20 chars */ font_tag = FONT_TAG_MD; /* use medium font tag */ break; /* break switch */ case (CHARS_LG): /* if less than 30 chars */ font_tag = FONT_TAG_LG; /* use large font tag */ break; /* break */ default: /* if none of the above */ font_tag = ""; /* do not set a font tag */ break; /* break */ } /* */ if (font_tag!=""){ /* if we have a font tag to use */ WriteSegment(edit_object,"",sx,sy,ex,ey); /* remove font tag */ WriteSegment(edit_object,CLOSE_FONT,sx,sy); /* write close font tag */ WriteSegment(edit_object,font_tag,px,py,px,py); /* write begin font tag */ SGMLSetPosition(sgml,px,py); /* set SGML position */ counter++; /* increment count */ } /* */ } /* */ } /* */ } /* */ element = SGMLNextElement(sgml); /* get the next sgml element */ } /* */ CloseHandle(edit_object); /* close edit object */ CloseHandle(sgml); /* close edit object */ MessageBox('i',"Found and modified %d paragraphs.",counter); /* messagebox */ return ERROR_NONE; /* Exit Done */ } /* end setup */
First, let’s take a look at our defined values.
#define NBSP_ONLY "^(( )|(&NBSP;)|(&160;)|( )){5,}$" /* font tags for SM, MD, and LG values defined below */ #define FONT_TAG_SM "<FONT STYLE=\"display: inline-block; width: 0.5in; float: left; white-space:nowrap\">" #define FONT_TAG_MD "<FONT STYLE=\"display: inline-block; width: 1in; float: left; white-space:nowrap\">" #define FONT_TAG_LG "<FONT STYLE=\"display: inline-block; width: 1.5in; float: left; white-space:nowrap\">" /* max values in 10's place of length of lead-in */ #define CHARS_SM 0 /* below 10 chars */ #define CHARS_MD 1 /* 10 chars to 19 chars */ #define CHARS_LG 2 /* 20 chars to 29 chars */ #define CLOSE_FONT "</FONT>"
NBSP_ONLY is a regular expression string constant. It is used to test a string to see if it contains only non-breaking space characters as the HTML entities, “ ”, “&160;”, or “ ”. If the string is only space characters and there are five or more of them the regex will pass.
FONT_TAG_SM, FONT_TAG_MD, and FONT_TAG_LG are used with small, medium, or large lead-in text. They go up in half inch increments, from .5 inches to 1.5 inches.
CHARS_SM, CHARS_MD, and CHARS_LG are the number of characters in a lead-in segment of text that defines if it’s small, medium, or large written as the length divided by 10. This works out such that the value of these defines is the value in the 10’s place of the number of characters in the lead-in. So below 10 is small, 10-19 is medium, 20-29 is large. Anything bigger doesn’t get processed.
CLOSE_FONT is the closing font tag that will match any of the open font tag defines. This is a define in case we want to change to a different inline HTML tag in the future.
if (mode!="preprocess"){ /* if mode is not preprocess */ return ERROR_NONE; /* return no error */ } /* */ edit_window = GetActiveEditWindow(); /* get handle to edit window */ if(IsError(edit_window)){ /* get active edit window */ MessageBox('x',"Cannot get edit window."); /* display error */ return ERROR_EXIT; /* return */ } /* */ type = GetEditWindowType(edit_window) & EDX_TYPE_ID_MASK; /* get the type of the window */ if (type!=EDX_TYPE_PSG_PAGE_VIEW && type!=EDX_TYPE_PSG_TEXT_VIEW){ /* and make sure type is HTML or Code */ MessageBox('x',"This is not an HTML edit window."); /* display error */ return ERROR_EXIT; /* return error */ } /* */ edit_object = GetEditObject(edit_window); /* create edit object */ sgml = SGMLCreate(edit_object); /* create sgml object */ element = SGMLNextElement(sgml); /* get the first sgml element */
So starting with our run function, the first thing we need to do is ensure mode is preprocess, otherwise return, to ensure we only run this function one time. Then we need to use GetActiveEditWindow to get the edit window, and test to make sure it’s either Page View or Text View. Then we can use the GetEditObject function to get the active edit object for the view, and create our SGML parser with the SGMLCreate function. Using the SGMLNextElement function, we can grab the first element in our file, and begin parsing it.
while(element != ""){ /* while element isn't empty */ if (FindInString(element, "<p", 0, false)>(-1)){ /* if the element is a paragraph */ px = SGMLGetItemPosEX(sgml); /* store end pos */ py = SGMLGetItemPosEY(sgml); /* store end pos */ element = SGMLNextElement(sgml); /* get the next element */ while (FindInString(element, "<font", 0, false)>0 && /* while not a font tag */ MakeLowerCase(element)!="</p>" && element!=""){ /* and not at the end of P */ if (FindInString(element, "<a", 0, false)>(-1)){ /* if next element is an anchor */ SGMLNextElement(sgml); /* advance 2 times */ px = SGMLGetItemPosEX(sgml); /* get px */ py = SGMLGetItemPosEY(sgml); /* get py */ element = SGMLNextElement(sgml); /* advance 2 times */ } /* */ else{ /* */ break; /* if not an A tag, break */ } /* */ } /* */
While we have a next element, we can test if it’s a paragraph tag or not by using FindInString. If the position of “<p” is greater than -1, it means the element is a paragraph. Then we can store the end of the paragraph tag as px and py using the SGMLGetItemPosEX and SGMLGetItemPosEY functions, because this is often the start of our lead-in text. Then we grab the next element. While the next element is NOT a font tag, and it’s not an end paragraph tag, and it’s actually an element, we need to keep iterating. This will iterate over all HTML tags up until the end of the paragraph (or a font tag) unless we break the loop. The test for an empty element is only in there to avoid an infinite loop, in case there are no more font tags and the paragraph never closes properly.
This part of the code is really only there to check for bookmark tags. If we hit an “<a” tag, it runs the SGMLNextElement function again to get to the close of the tag (it assumes it’s well formed HTML, there really shouldn’t be anything else inside the bookmark). Then we store a new px and py value, so the stored start of the lead-in text segment is at the end of the closing bookmark tag, and advance to the next tag. If the tag isn’t a bookmark tag, we break the loop.
if (FindInString(element, "<font", 0, false)>(-1)){ /* if the next element is a font tag */ sx = SGMLGetItemPosSX(sgml); /* start of font tag */ sy = SGMLGetItemPosSY(sgml); /* start of font tag */ content = ReadSegment(edit_object,px,py,sx,sy); /* get content of lead-in */ text_width = GetStringLength(content); /* get width of text */ content = SGMLFindClosingElement(sgml,SP_FCE_CODE_AS_IS); /* get the content of the font tag */ content = TrimPadding(content); /* remove leading / trailing space */ if (IsRegexMatch(content, NBSP_ONLY)){ /* check if font tag is only NBSP's */ ex = SGMLGetItemPosEX(sgml); /* end of font tag */ ey = SGMLGetItemPosEY(sgml); /* end of font tag */ switch (text_width/10){ /* switch on width of lead-in text */ case (CHARS_SM): /* if less than 10 chars */ font_tag = FONT_TAG_SM; /* use small tag */ break; /* break switch */ case (CHARS_MD): /* if less than 20 chars */ font_tag = FONT_TAG_MD; /* use medium font tag */ break; /* break switch */ case (CHARS_LG): /* if less than 30 chars */ font_tag = FONT_TAG_LG; /* use large font tag */ break; /* break */ default: /* if none of the above */ font_tag = ""; /* do not set a font tag */ break; /* break */ } /* */ if (font_tag!=""){ /* if we have a font tag to use */ WriteSegment(edit_object,"",sx,sy,ex,ey); /* remove font tag */ WriteSegment(edit_object,CLOSE_FONT,sx,sy); /* write close font tag */ WriteSegment(edit_object,font_tag,px,py,px,py); /* write begin font tag */ SGMLSetPosition(sgml,px,py); /* set SGML position */ counter++; /* increment count */ } /* */ } /* */ } /* */ } /* */ element = SGMLNextElement(sgml); /* get the next sgml element */ } /* */ CloseHandle(edit_object); /* close edit object */ CloseHandle(sgml); /* close sgml object */ MessageBox('i',"Found and modified %d paragraphs.",counter); /* messagebox */ return ERROR_NONE; /* Exit Done */ } /* end setup */
If our next element was a font tag, we can actually begin processing it. First, we get the start of our font tag with the SGMLGetItemPosSX and SGMLGetItemPosSY functions. We can then read the lead-in text by using ReadSegment, starting at px, py and ending at sx, sy. Using the GetStringLength function we can get the width of the lead-in, which we’ll need later. Then using the SGMLFindClosingElement function, we can get the contents of the font tag. The SP_FCE_CODE_AS_IS flag is used so it returns the character entities as text, instead of just ignoring them and returning spaces. We need to use the TrimPadding function on it as well to remove any breaking whitespace such as normal spaces and returns. Using IsRegexMatch, we can then test to see if the contents of the font tag match our style of indent.
If so, we get the end positions as ex and ey, then switch on the width of our lead-in divided by ten to decide which font tag to use. If we don’t have an appropriate font tag, just don’t set one. If we have a set font tag, we use three WriteSegment functions to fix our code. First, we remove the entire font tag. It’s not needed anymore. Then, we insert the close font tag at the point where the old font tag used to be. Lastly, we can insert our new font tag at the beginning of our lead-in text. When using WriteSegment, it’s important to work from left to right, bottom to top, to not throw off our position values with our own editing. After the new text is written out, we use the SGMLSetPosition function to reset our file mapping, and increment the counter value for how many paragraphs were modified.
After doing all the replacing logic, we need to use SGMLNextElement at the end of it to get the next element for the loop to begin again. Once the loop is completed, we can close the handle to our sgml and edit objects, write out a message box with MessageBox to display status of the script to our users, and then return without error.
This script works well on the sample files I had, but it could definitely be improved even further. For example, it could have a user interface to configure the default small, medium, and large values. It could consider different sized lead-ins besides the three I put in there. It could have better handling of how to detect lead-in text. For example, if the lead-in text contains an italicized word. The script as-is would ignore that paragraph. This is more of a jumping-off point that can be improved upon for other situations, rather than a comprehensive solution for all HTML files involving outlines.
Steven Horowitz has been working for Novaworks for over five years as a technical expert with a focus on EDGAR HTML and XBRL. Since the creation of the Legato language in 2015, Steven has been developing scripts to improve the GoFiler user experience. He is currently working toward a Bachelor of Sciences in Software Engineering at RIT and MCC. |
Additional Resources
Legato Script Developers LinkedIn Group
Primer: An Introduction to Legato