LDC #148: Unicode and UTF-8

Friday, August 16. 2019

LDC #148: Unicode and UTF-8

When it comes to text, we as programmers live in a world that is a combination of 8-bit and 16-bit. Living in 8-bit is okay until you need to access characters outside the scope of standard ASCII and ISO Latin 8859. Then you need additional support, and that support comes from Legato’s functions to handle Unicode and UTF-8.

Introduction

By default, all Novaworks applications perform all file operations in 8-bit strings using a technology called UTF-8. UTF, or Unicode Text Format, is a method of representing Unicode in an 8-bit environment. For western language documents, this is pretty efficient since the majority of characters are ASCII or ISO Latin. In straight Unicode, 16-bit strings are required though, and with ASCII strings this wastes 50% of space for every string since the top portion is all zero.

More and more internal and SDK functions in Legato are being adapted to handle both 8-bit and 16-bit strings. First, we will talk about ASCII, Unicode, and UTF-8. In this blog I will give a little background about character sets and encoding and then talk about Legato support.

ASCII and Its 8-Bit Friends – A Seriously Abridged History

For those of us who have worked in the US or with early desktop computers, we are quite familiar with the limitations of 8-bits strings. Put simply, ASCII (American Standard Code for Information Interchange) became the standard for text interchange representing the 96 characters you see on an standard US keyboard. During the 1960s, the standard was adopted with improvements made over the succeeding decades.

You can look up more information, but ASCII is a 7-bit format with available 127 positions:

– 32 control characters (like line return, page break, tab)

– Basic symbols @#$%^&*{} \ | / <>

– Letters A-Z, a-z

– Numbers 0-9

– Punctuation such as . , ( ) ? ! : ; “ ‘ and `

– and, of course, word space

Over time, other competing formats, such as EBCDIC and Baudot, all slowly died out. In my career, I drifted into word processing and digital typesetting in the 1980s. It became clear very quickly just how limiting a 96 printable character set can be. Every word processor, typesetting software, and font foundry had different code pages and sets for the dozens or even hundreds of additional character glyphs.

Since most computers settled on byte frame for simplest information, that left another 128 character positions of the original 256 possible positions. For those additional 128 positions, there were a number of additions by various vendors which were not compatible. For example, Apple had their own set, old MS-DOS had their set, and then there was ANSI (American National Standards Institute) which developed an extension, which was combined with ISO-8859 and widely used by Windows (as code page 1252). Then as desktop computers spread to other countries, obviously the character sets had to expand dramatically. Windows adopted many code pages defining a different glyph to the same position in each code page. To represent more than 200 characters, formats such as Shift-JIS (Shift Japanese Industrial Standards) were developed to use multiple bytes to present characters in a compatible 8-bit ASCII frame, but these formats were not compatible or standardized. Clearly a better way of representing character glyphs around world was needed.

A Brief Description of Unicode

It was recognized early on that ASCII simply was not going to cut it and having a ton of vendor specific code pages was adding to the problem rather than solving it. The Unicode Consortium was created in January 1991. This was four years after the concept of a new character encoding, to be called "Unicode", was broached in discussions started by engineers from Xerox (Joe Becker) and Apple (Lee Collins and Mark Davis).

Basic Unicode began as a 16-bit character set. With 16-bits, approximately 65,534 character positions are available. That seems like a lot, but in later years expansion was required, resulting in character positions being defined in the 24-bit range. Unicode incorporates as a subset of ASCII and ISO-Latin 8859-1. It is important to note that ANSI (Windows 1252) has characters in the 0x80 to 0xA0 range that are NOT mapped to the same position. For example, the Euro currency symbol € is 0x80 in ANSI and 0x20AC in Unicode.

Most fonts in Windows cover Unicode while some specialize in Chinese and other languages. Part of the Unicode character set, 0xF000 to 0xFFFF, is reserved for vendor proprietary characters, such as those used in Power Point for bullets.

Unicode Text Formatting — UTF

The problem with Unicode is it’s 16-bits living in a world driven by 8-bit data channels and storage devices. In addition, there is a lot of existing material in western 8 bit world. To deal with these issues, UTF, commonly known as Unicode Text Formatting (but actually stands for Unicode Transformation Formatting), was developed.

Since 2009, UTF-8 has been the dominant encoding for the Web and declared mandatory "for all things" by WHATWG. As of July 2019, it accounts for better than 90% of all web pages. The default character set for both XML and HTML 5 is UTF-8. It is important to note that there are variations of UTF for deferring channel widths, such as UTF-16 to express 24-bit Unicode.

UTF takes a Unicode character and encodes it into multiple 8-bit safe high characters. For example:

The glyph for “Euro” is the € symbol.

Would be encoded into:

The glyph for E2809CEuroE2809D is the E282AC symbol.

Note that the blue hex is actually a representation of the -bit codes. These are broken down as follows (chart from wikipedia.org):

Number of bytes	Bits for code point	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	7	U+0000	U+007F	0xxxxxxx
2	11	U+0080	U+07FF	110xxxxx	10xxxxxx
3	16	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	21	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

As a programmer, you don’t really have to worrying about the actual mechanics of encoding, but it is good to understand how it works. Basically, any Unicode character above the value 0x7F (127) must be encoded into multiple bytes. The source word is broken into sections of 5 or 6 bits and spread amongst 2-4 bytes, each of which would be supported to transport and store on any 8-bit system while not being treated like binary data. When it is time to display the data, the rendering system reassembles the character bits and displays the correct glyph.

If the data is not properly decoded in the browser or editor, you get:

The glyph for â€œEuroâ€ is the â‚¬ symbol.

Ever see that before? I have many times. That is either the result of the encoding not being set correctly or someone copying and pasting and not taking into account differences between the source and destination applications. Ironically, quite often the erroneous text is further encoded into UTF-8 or HTML character entities.

For the Web (and specifically for EDGAR), character entities can be used to reference characters either by name or by position (for example, € or €). Note that each of those representations are in ASCII. This obviates the need for UTF, but the character entity € is considerably larger than â‚¬ ; with an 8:3 difference in size. Since bandwidth is king on the internet this could be a game changer for large documents.

Novaworks Applications

Starting in 2018, Novaworks has been transitioning to Unicode via UTF-8. For example, all filename functions now expect UTF-8 filenames. For conventional ASCII, this changes nothing, and as a user and programmer using ASCII, there’s no apparent difference. For Windows-1252/ANSI, the names are all now converted to Unicode, stored, and processed as UTF-8.

For Legato, all file related functions employ UTF-8. A set of dialog functions support ANSI, UTF-8, and Unicode. There are some that support ANSI and Unicode, and others that only support ANSI.

The string datatype is used for 8-bit strings, including UTF-8, and wstring is used for Unicode. Certain functions will adjust for input parameters of each type. For convenience, some API functions will change their return value datatype, such that a function may return either string and wstring depending on the context. In other cases, there are sometimes two or three separate functions defined. Since UTF and ANSI look the same to the API (both use string datatype), there will be UTF variants of those functions in Legato, if there is a difference.

It is important to note that the API can fudge the return datatype on the fly. Functions defined in your code cannot.

Encoding Conversion Functions

To move to and from various encoding formats, Legato provides a number of functions. It is important to note that dropping down to ANSI may result is lost character data. Let’s look at the functions.

Taking ANSI to Unicode and Back

There are four functions to bring ANSI to Unicode/UTF. Each will convert the ANSI character positions to the equivalent Unicode positions:

wstring = AnsiToUnicode ( string data );

Note that the result is a 16-bit wstring datatype. To go to UTF:

string = AnsiToUTF ( string data );

This function results in an 8-bit string datatype. Going the opposite direction, we have the UnicodeToAnsi function:

string = UnicodeToAnsi ( wstring data );

The function moves through the source data and converts Unicode characters to ANSI character positions while converting each 16-bit word to a 8-bit byte. As mentioned above, Unicode shares the ASCII and ISO Latin portion of ANSI characters. Characters that can be converted to ANSI are converted. Those that cannot be converted are set as Ctrl+? (0x1F). When characters cannot be converted, an error condition is set and the number of failed character conversions is placed in the lower word of the error code. Use the GetLastError function to retrieve error information. The IsError function can also be used to determine if there were any conversion errors.

In a similar fashion, UTF can be converted to ANSI:

string = UTFToAnsi ( string data );

This function too will report character conversion errors.

Taking Unicode to UTF and Back

Unicode to UTF-8 conversion, and the opposite direction, is transparent. There are two routines to perform the conversion:

string = UnicodeToUTF ( wstring data );

and

wstring = UTFToUnicode ( string data );

UTF, HTML, XML and XHTML

There are plenty of cases where data needs to be converted to and from UTF to character entities. The first is taking an 8-bit UTF string and making it an SGML compliant ASCII string:

string = UTFToEntities ( string data, [boolean pcdata] );

The pcdata parameter is an optional boolean value specifying to treat the data as PCDATA (that is, not to encode protected characters). The default value for this paramater is FALSE. Characters are converted to SGML character entity values in the form of &#nnnn;, where nnn is the translated decimal code. Values over 0x7F are converted. If sequence is not in valid UTF syntax, each character is converted. Note the protected characters &, < and > are also processed unless the pcdata parameter is TRUE.

Sometimes it is desirable to check a string to see if the content is valid UTF.

int = CheckUTFCompliance ( string data );

The function moves through the source data and decodes any UTF-8 sequences and counts the valid Unicode characters (characters above 0xC0). Characters between 0x80 and 0xC0 are not valid Unicode positions and if detected will cause the function to return an error.

A string with character entities can be converted to UTF:

string = EntitiesToUTF ( string data );

Any syntax errors in character entities cause the character error count to be incremented and the existing character text to be passed through without alteration. Character entities can be SGML/HTML/XML numeric syntax as decimal or hex or can be HTML 5 character entity names.

String Functions

Many string processing functions have been adopted to work with both ANSI/UTF and Unicode. Here are some examples showing different operating mode:

int = FindInString ( string source, string match, [int position, [boolean case]] );

int = FindInString ( wstring source, wstring match, [int position, [boolean case]] );

In these cases, the function will have two prototypes. The string type will be required to match for all the parameters. The next example changes the return value:

string = InsertStringSegment ( string target, string insert, int position );

wstring = InsertStringSegment ( wstring target, wstring insert, int position );

If the strings are UTF, you must exercise caution when inserting strings to avoid placing data into the middle of a coded segment. The same is true when deleting segments.

Another import function is the FormatString function:

string = FormatString ( string format, [mixed parameters ... ] );

wstring = FormatString ( wstring format, [mixed parameters ... ] );

The Legato API contains too many string functions to name here. If the function supported Unicode, it will have multiple prototype definitions in the documentation.

Dialog Controls

Windows dialog controls unfortunately do not natively support UTF-8. A number of Novaworks custom controls, such as the Data Control, do natively support UTF-8.

Certain set functions support both ANSI and Unicode, for example:

int = EditSetText ( int id, mixed data, [parameters ... ]);

The data parameter can be Unicode or ANSI (it can also be a numeric value).

For some controls, there are two or three functions defined to retrieve the different encodings. For example:

string = EditGetText ( int id, [string name], [int flags], [int size] );

wstring = EditGetUnicode ( int id, [string name], [int flags], [int size] );

string = EditGetUTF ( int id, [string name], [int flags], [int size] );

If Unicode is entered into a control and ANSI is retrieved, characters that are not supported are replaced with ‘?’ characters.

In Summary

Over time, I have been adding more and more UTF and native Unicode support to Legato. The key takeaways are:

– All file and path information within Legato and its host application are stored in UTF-8 and such data should be handled accordingly;

– When working with dialog controls, programmers should realize not all controls support all three encoding modes; and,

– Some functions, like OLE interfacing, require Unicode.

If you’re in need of specific functions to be updated to support Unicode or UTF, don’t hesitate to contact support and we will make them a priority.

Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages.