LDC #161: Quickly Adding CSS Classes to an Existing HTML Document

Monday, November 25. 2019

LDC #161: Quickly Adding CSS Classes to an Existing HTML Document

In the EDGAR world, inline CSS can be used to style documents. However, CSS style sheets are not allowed. I recently ran into a case where document needed to be published on a system which was just the opposite — CSS classes are ok, but inline styles are verboten. A quick Legato script saved may hours of work by translating the inline styles to a style sheet.

Introduction

There are two cases where CSS sheets are used: first in providing an easy-to-edit, consistent set of styles for a document, and, second, for browser control when the browsing platform varies and must be responsive to the media type or media format. Many websites or subsections of websites use Content Management Systems (CMS) which restricts the formatting of the content. In fact, the article you are reading now had to be preprocessed before it was posted on the blog CMS (we have a Legato script for that, too). Website level CSS can be very complex, well beyond the scope of what we are doing here.

The script described here has a single simple purpose: isolate inline styles, combine duplicate styles, create a style sheet, and update the HTML to use the classes rather than inline style. The script works really well on clean HTML converted with Novaworks applications since the HTML is very consistent. If you play with the script, you will find that the script may “over class” the document since styling inconsistencies will result in additional classes being defined.

A Brief Introduction to CSS

Can one describe CSS in a few paragraphs? Let’s try.

If you have spent any time with HTML, you know there are HTML attributes to adjust style and there are CSS properties that significantly expand styling options. For almost any HTML element, the display behavior of the tag can be changed:

<p style="padding: 3pt; color: white; background-color: black">Some Knocked Out Text</p>

This code will cause the text “Some Knocked Out Text” to appear as white in a black box. This is an inline style using CSS property-value pairs. If I use this style repeatedly, or just want to have centralized control, I can decide to make a class:

... (header) ...

<style>

p.knockout { padding: 3pt; color: white; background-color: black }

</style>

... (body) ...

<p class="knockout">Some Knocked Out Text</p>

In the header of the HTML document, there can be one or more <style> groups or references to external CSS style sheets. For many websites, there will be many style sheet files with combinations of styles. The text:

p.knockout { padding: 3pt; color: white; background-color: black }

is what is known as a CSS rule. The rule begins with a selector, p.knockout, which is a class for the p element as specified by the period. Following that is a declaration, which contains the style information. If the element is omitted from the selector, .knockout, the class can be referenced by all elements. Selectors can become very complex by adding IDs, pseudo classes, parent-child relationships, inheritance, or just simple elements. Multiple sets of rules can also be defined for media types, size of the media area, and more.

The Script

Put simply, this is a “quick and dirty” script to quickly build a CSS sheet that can be either later edited and combined or simply used to deal with the situation described in the introduction. The code is be broken into six parts:

                                                                   // Declarations and Start
handle          hWP, hPool;
string          code;
string          styles[];
string          aa[];
string          fnSource, fnDestination;
string          prefix;
string          s1, s2, s3;
boolean         lc_flag;
int             rc,
                dx, ix, size;

fnSource = R"https://www.sec.gov/Archives/edgar/data/1333986/000119312519297702/d788758dex11.htm";
fnSource = R"https://www.sec.gov/Archives/edgar/data/1603756/000160375619000080/axnx-111919x424b5finaldocu.htm";
fnDestination = GetScriptFolder() + "Result.htm";
prefix = "sacc_";

                                                                   // A: Capture File
code = FileToString(fnSource);
if (code == "") {
  ReportFileError(fnSource, GetLastError());
  exit;
  }
if (GetStringLength(code) > 0x003FFFFF) {
  MessageBox('x', "Source is too big!");
  exit;
  }

                                                                   // B: Build Style List
hWP = WordParseCreate(WP_SGML_TAG, code);

s1 = WordParseGetWord(hWP);
while (s1 != "") {
  if (IsSGMLTag(s1)) {
    s2 = GetTagElement(s1);     
    if (IsLower(s2[0])) { lc_flag = TRUE; }
    aa = GetTagAttributes(s1, TRUE);
    if (ArrayFindKeyName(aa, "style") >= 0) {
      styles[dx] = aa["style"];
      dx++;
      }
    }
  s1 = WordParseGetWord(hWP);
  }

                                                                   // C: Delete Common Items
SortList(styles);
size = ArrayGetAxisDepth(styles);
ix = 0;
while (ix < size - 1) {
  if ((styles[ix] == styles[ix+1])) {
    DeleteListItem(styles, ix);
    size--;
    continue;
    }
  ix++; 
  }

                                                                   // D: Replace Code
size = ArrayGetAxisDepth(styles);
ix = 0;
while (ix < size) {
  s2 = "style=\"" + styles[ix] + "\"";
  if (lc_flag) {
    s3 = FormatString("class=\"%s%03d\"", prefix, ix + 1);
    }
  else {
    s3 = FormatString("CLASS=\"%s%03d\"", prefix, ix + 1);
    }
  code = ReplaceInString(code, s2, s3, TRUE);
  ix++;
  }

                                                                   // E: Add CSS Table
if (lc_flag) {
  s1 = "<style>\r\n";   
  }
else {
  s1 = "<STYLE>\r\n";   
  }
ix = 0;
while (ix < size) {
  s1 += FormatString(".%s%03d   {%s}\r\n", prefix, ix + 1, styles[ix]);
  ix++;
  }
if (lc_flag) {
  s1 += "</style>\r\n"; 
  }
else {
  s1 += "</STYLE>\r\n"; 
  }

                                                                   // F: Insert and Write
ix = InString(code, "</head>", FALSE);
if (ix < 0) {
  MessageBox('x', "Unable to locate </head> tag.");
  exit;
  }
code = InsertStringSegment(code, s1, ix);

rc = StringToFile(code, fnDestination);
if (IsError(rc)) {
  MessageBox('x', "Unable to write file (%08X)\r\r%s", rc, fnDestination);
  exit;
  }

The start of the file is the declarations followed by the capture of data. The script does not have a user interface (UI), so the source, destination and prefix are hard coded. The prefix is used to prefix the class. This is needed since we want to create something that does not interfere with existing classes for whatever environment our final document will he hosted. For our example, I picked sacc_, “Style as CSS Class”.

A: Capture File

As you can see in the declaration section, I was just picking example source files from the SEC’s EDGAR archive. For this little script, we’re doing everything with strings. As such, the source file cannot be larger than 4mb as limited by the ReplaceInString function (4mb for GoFiler 5.1c and 1mb for earlier versions). Without too much work, the script’s capability could be expended by using a Mapped Text object and the SGML parser.

B: Build Style List

Our first major step is to collect all the styles. This is done by using the Word Parse Object in SGML tag mode. We just go item by item, looking first for a tag and then pulling the attributes apart. If a style attribute is present, it is added to our styles array. Note that TRUE on the GetTagAttributes function forces all the attributes to be lower case.

hWP = WordParseCreate(WP_SGML_TAG, code);

s1 = WordParseGetWord(hWP);
while (s1 != "") {
  if (IsSGMLTag(s1)) {
    s2 = GetTagElement(s1);     

    if (IsLower(s2[0])) { lc_flag = TRUE; }

    aa = GetTagAttributes(s1, TRUE);

    if (ArrayFindKeyName(aa, "style") >= 0) {
      styles[dx] = aa["style"];
      dx++;
      }
    }
  s1 = WordParseGetWord(hWP);
  }

During the parse, we check tags for their case. Any lower case tag will set the lc_flag, which is used later to match the document’s tagging style. Note that this rather clumsy method is used because the referenced SEC archive files always have SGML prefix tags in upper case. Again, quick and dirty. You would need to implement more intelligent processing for files with varying formats.

C: Delete Common Items

Once the styles have been accumulated, we want to sort and remove duplicates.

SortList(styles);
size = ArrayGetAxisDepth(styles);
ix = 0;
while (ix < size - 1) {
  if ((styles[ix] == styles[ix+1])) {
    DeleteListItem(styles, ix);
    size--;
    continue;
    }
  ix++; 
  }

As I said above, this method works well with our HTML since it is very consistent; with other vendors there may be extra styles. I noticed in the second file in the example, the same style combinations had extra semicolons, some differing shorthand notation, and other inconsistencies. In the second example, 16,487 style references are condensed into 235. This could be improved by normalizing the styles and then removing them. The original style code must be kept for the replacement. This, too, can be optimized, but again, it isn’t so quick on the execution even if it is quick on writing.

D: Replace Code

The next loop looks back through the code and replaces the style items with class references. We are using the lc_flag to replace the style code with a class reference. Back to the quick and dirty approach: if the source has spaces or uses single versus double quotes, etc., this will not work.

size = ArrayGetAxisDepth(styles);
ix = 0;
while (ix < size) {
  s2 = "style=\"" + styles[ix] + "\"";
  if (lc_flag) {
    s3 = FormatString("class=\"%s%03d\"", prefix, ix + 1);
    }
  else {
    s3 = FormatString("CLASS=\"%s%03d\"", prefix, ix + 1);
    }
  code = ReplaceInString(code, s2, s3, TRUE);
  ix++;
  }

Can you make an improvement? Note that ReplaceInString function has the TRUE flag. This tells the function to ignore case. This is also where the size limitation comes into play, as the ReplaceInString function has an internal limit. Moving to the SGML parser would remove that limit and fix the quote issue and remove the possibility of erroneously replacing data outside of tags with class info.

E: Add CSS Table

Now that we have replaced all the style attributes with class attributes, we need to create a style sheet. This is a fairly mechanical operation.

if (lc_flag) {
  s1 = "<style>\r\n";   
  }
else {
  s1 = "<STYLE>\r\n";   
  }
ix = 0;
while (ix < size) {
  s1 += FormatString(".%s%03d   {%s}\r\n", prefix, ix + 1, styles[ix]);
  ix++;
  }
if (lc_flag) {
  s1 += "</style>\r\n"; 
  }
else {
  s1 += "</STYLE>\r\n"; 
  }

We start by using s1 to open a style table and then iterate through a loop that runs through the styles, builds rules, and adds the suffix. Note that addition of \r\n in various places to keep the resulting HTML tidy. Also note that we are not attaching element names to the class. These become universal class names. Some authors use the *.name method.

F: Insert and Write

ix = InString(code, "</head>", FALSE);
if (ix < 0) {
  MessageBox('x', "Unable to locate </head> tag.");
  exit;
  }
code = InsertStringSegment(code, s1, ix);

rc = StringToFile(code, fnDestination);
if (IsError(rc)) {
  MessageBox('x', "Unable to write file (%08X)\r\r%s", rc, fnDestination);
  exit;
  }

Finally, we insert our style sheet into the file and write it back to the result, which by default is in the same folder as the script.

Conclusion

Even if the result has a lot of superfluous classes, a search and replace in the code can remedy that. As stated from the start, this is quick and dirty. Some ideas (some of which have been touched upon in the article):

– add a front-end UI to select data;

– move to Mapped Text and the SGML Object for more robust parsing without limits;

– make the styles variable a table to expand processing;

– add a UI to combine and edit the rules with tools such as the CSSEditDeclaration function;

– add functionality for cascading aspect of CSS for the style table build.

As is frequently said, “necessity is the mother of invention.” This script took about an hour to pound out to address a simple problem. After writing this blog, it becomes obvious that the door is open to do so much more.

Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages.