LDC #116: Getting Text into Shape

Thursday, December 27. 2018

LDC #116: Getting Text into Shape

When a program presents textual data to a user, it is usually (and hopefully) formatted for easy consumption. As the programmer, understanding how to format strings will make the process of making text presentable significantly faster and smoother. Legato offers an arsenal of tools to make formatting text simple and easy.

Formatting Strings

Most languages provide a method to format data into presentable string arrangements for human or computer consumption, and Legato is no exception. Formatted strings can be employed as part of a number of functions, including the FormatString, AddMessage , and MessageBox functions. For example, the following code:

MessageBox('i', "%d items processed, %d errors", items, errors);

will yield something like this (depending on the values of items and errors):

1022 items processed, 0 errors

Functions employing formatted strings will automatically convert both the variable type and formatting of the string content. For example, if a string is requested to be inserted and a floating-point number is provided as a parameter, the floating-point number will be converted to a string with formatting as specified and then inserted.

Formatted strings contain effectively two types of data: inert characters that are passed through and control segments that are replaced with formatted content. The source string is directly copied to the result until a ‘%’ character is encountered (which begins the control segment), at which point the conversion process replaces the content. Inert or ordinary characters are like any other string literal within the language.

The control segments end with a specific format character (detailed in the next section). C/C++ programmers will recognize this coding style since the underlying formatting function uses the C++ libraries. In between, optional formatting information can be added. For example:

AddMessage("%3d - %-32s : %s", index, name, description);

might be used to dump formatted messages to a log with a leading count of 1 to 3 digits, a name that takes up to 32 characters, and a description.

Syntax of Formatted Strings

The crux of the formatted string is the ‘%’ character, which is used as an escape character to indicate that the content of a variable will be substituted. For example:

converts a numeric parameter (or converted string) to a string character representation of that number.

%3d

converts the integer but will add padding of up to two spaces leading the number.

%03d

converts the integer and adds padding with leading zeros.

To display a percent symbol itself, the code ‘%%’ is added to the format string.

The basic format for each escaped item is as follows:

% flags size type

Where:

Flags — One or more characters, in any order, which modify the control segment:

Specifies a left adjustment to the converted segment.

Specifies that the number will always be converted with a +/- sign (does not apply to strings or hex).

Specifies to fill with zeros rather than spaces.

Specifies an alternative output form based on the type. For ‘o’ (octal), the first digit will always be zero. For ‘x’ or ‘X’ (hex), a leading prefix is added as 0x. For floating-point (‘e’ ,’E’, ‘f’, ‘g’ and ‘G’), the output will always have a decimal point. For ‘g’ and ‘G’, trailing zeros will not be deleted.

Size — A number specifying a minimum field width. The converted segment will be created to be at least this wide or wider if required. If the resulting field contains less characters, it will be padded left or right as necessary to fill the size requirement. The padding character is normally a space but can optionally be a zero if the size is specified with a leading 0 for certain numeric types.

A period character within size separates the integer size from the precision for floating-point format. For example, %2.1f will always have a tenths place precision.

Type — A character dictating the type of formatting to be applied. The following types are permitted:

Signed decimal integer for compatibility with C (“0, 5, -1”).

Signed decimal integer (“0, 5, -1”).

Octal (“1”).

Hexadecimal lower (“1a0e”).

Hexadecimal upper (“1A0E”).

Unsigned decimal integer (“1”).

Character (“a”).

String (“hello”). String fields may have one component for formatting:

%ns or %-ns

Where n is the overall field size in characters. By default, the field is right aligned in the character padding area. Adding a leading dash will left align the data.

Float (“1.2”). Note that the default precision for floating point values is 6 digits. To achieve a higher level of translation of precision, use explicit numeric format values. For example, ‘%1.10f”.

Scientific lower (“1.2e+3”).

Scientific upper (“1.2E+3”).

Use the shortest representation: %e or %f (“123.45”).

Use the shortest representation: %E or %f (“123.45”).

Accounting (“12,345”). If the number is negative, the value is placed inside of parentheses. For example, -12345 becomes “(12,345)”. For padded fields, such as “%15a”, the last character is reserved as either a space or close parenthesis for alignment purposes. The provides for easy alignment of text fields.

Accounting fields may have one component for formatting of integer values:

%na or %-na

Where n is the overall field size in characters. By default, the field is right aligned in the character padding area. Adding a leading dash will left align the data.

For float values, the accounting field has an additional component:

%n.pa or %-n.pa

Where n is as described above and p is the number of decimal places. If p is omitted, the default value is 2.

Accounting (“12,345”). If the number is negative, a dash is placed in front of the value, for example, -12345 becomes “-12,345”. For padded fields, such as “%15a”, the last character is reserved as either a space or close parenthesis for alignment purposes. The provides for easy alignment of text fields.

Accounting fields may have one component for formatting of integer values:

%na or %-na

Where n is the overall field size in characters. By default, the field is right aligned in the character padding area. Adding a leading dash will left align the data.

For float values, the accounting field has an additional component:

%n.pa or %-n.pa

Where n is as described above and p is the number of decimal places. If p is omitted, the default value is 2.

The above items are based on ANSI C++ except for ‘a’ and ‘A’, which are an addition to simplify the display of large numbers. The accounting mode does not take localization into account, meaning the currency formats of any particular region are not automatically reflected.

The number of control segments must match the number of provided parameters. If there are too many parameters, a warning will be placed into Legato’s last error value. If there are too few parameters, a run time error will occur.

Programmers should take care of the resulting size of the output. Anytime the %s format parameter is used, it allows for large amounts of data to be added. For most functions using formatted strings, the content will be truncated and an overflow error is added as the last error code. The size limitations vary by application; for example, the AddMessage function is limited to 4,096 characters largely because the log itself is limited to 4,096 characters. See the specific function documentation for details on limits.

Functions That Support Formatted Strings

As functions have been added to the Legato SDK, the question is usually raised: is it likely there will ever be a need to add formatted data to this function? If the answer is yes, then that part of the internal library is connected to support formatted strings. For all functions, it operates in the same manner with the exception that there may be parameters leading the formatted string, for example:

int = AddMessage ( [handle hObject], string format, [parameters ...] );

In the above case, if the function detects a handle value, it is assumed it references an object to which the function can write, such as a log or console window. The next parameter is the formatted string.

The following is a list of some of the functions that support formatted strings:

AddMessage — Adds a formatted message to the default log, a specified log, or other object.

ConsolePrint — Prints a message to the default or specified console.

EditSetText — Sets data into an edit or static control with optional formatting parameters.

FormatString — Formats a string with embedded parameters for accounting, decimal, hex, etc.

MessageBox — Displays an ‘OK’ style message box.

ODBCQuery — Performs a structured query based on the connection handle and query string.

OkCancelBox — Displays an ‘OK/Cancel’ style message box.

PoolAppendFormattedString — Formats a string, appends it to a string pool, and returns the pool offset.

ProgressSetStatus — Sets a status message to a specified field.

StaticControlSetText — Sets text into a static control with optional formatting parameters.

StatusBarMessage — Sets text into the message area of the status bar with optional color.

WriteLine — Writes a formatted line of text to a Basic File Object.

YesNoBox — Displays a ‘Yes/No’ style message box.

YesNoCancelBox — Displays a ‘Yes/No/Cancel’ style message box.

YesNoCancelRememberBox — Displays a ‘Yes/No/Cancel’ style message box as a question with an option to remember the selection.

YesNoRememberBox — Displays a ‘Yes/No’ style message box as a question with an option to remember the selection.

printf — Provided for simple compatibility with C/C++ for console applications.

Most functions that support formatted strings will simply pass the template string through to the destination if additional parameters are not provided. In that case, all the ‘%’ codes are ignored. For example:

MessageBox("Here is a string that is not formatted %d HI %d");

will not result in a warning or error. The message box will display as follows:

Here is a string that is not formatted %d HI %d.

Combining Data

The FormatString function is the go-to tool for creating complex strings. Strings can also be added and processed using simple string math and other functions. For certain data, such as tab delimited information, formatted strings can be used by embedding tab characters directly:

result = FormatString("Row %d\t%s\t%s\t%d”, item, name, date, count);

This works unless the name or date parameter contains one or more tab (0x09) characters. For message boxes and the log, tab characters are not really processed and tend to be treated as spaces. For list boxes and data controls, they can be used to delimit columns.

In other cases, one might be tempted to structure comma delimited (CSV) data with the FormatString function. This is generally not a good idea unless the programmer can ensure that commas will not be within the injected strings. There are other SDK functions more well suited to supporting CSV.

Data Types

As each formatted item is processed during the creation of the string, the data type of the source information is converted to the format necessary for the creation of the formatted section. It is best to use the appropriate data type to avoid problems and ambiguous conversion. For example, a format section requiring an integer will convert a numeric string. However, it will not convert the decimal component of a floating-point value.

Mixing string and wstring as %s formatted output will result in a gross conversion from 8 to 16-bit or 16 to 8-bit depending on the context. Obviously, converting from 16-bit to 8-bit will potentially result in the loss of data.

Encoding

The most common string type used with formatted string functions is the 8-bit string data type. The content can be plain ASCII, ANSI, or encoded as Unicode Text Format (UTF). UTF can be used so long as the code processing the string handles the specific character encoding.

Over time, Legato SDK functions are being updated to support Unicode or the 16-bit wstring data type. For Legato 1.1r (GoFiler 4.25b), the FormatString and WriteLine functions have been updated. The string type of the data to be formatted determines the output string type.

Conclusion

Most programming languages support some method of formatting data so that users can experience nice and appropriate text presentation. Understanding the basic operation of formatted strings is a good weapon to have in the programming arsenal.

Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages.