LDC #157: Understanding Used versus Allocated Space in Arrays

Friday, October 18. 2019

LDC #157: Understanding Used versus Allocated Space in Arrays

Arrays, such as lists, tables or cubes, are a useful programming tool. However, to get the most from using an array, it helps to have a more thorough understanding of exactly how arrays are stored and allocated.

Lists, Tables and Cubes

Array support in Legato allows for up to three dimensions and a fourth if you count accessing characters of string type variables. Generally, a single dimension is a list, two dimensions is considered a table, and three dimensions are a cube.

Any datatype can be used for an array although it is typical to use strings when reading or writing information. Many functions will read or write array information. For example the CSVReadTable and CSVWriteTable functions will read in or write out entire files of comma separated data into or out of a two dimensional array.

Array dimensions are typically referred to as axes. Conventionally tables are referenced in rows and columns where the row is the first array subscript and the column is the second. The order is not especially important, but certain functions will expect the data to be oriented in a particular manner. In addition, there are performance advantages concerning how data in an array are order, which will be covered later. Defines are provided in the SDK for referencing array axes:

                                              /****************************************/
                                              /* ** Axis Definitions                  */
                                              /* Specified  as [1][2][3]  for up to a */
                                              /* maximum of three.  Depending the use */
                                              /* the names may value.  The dimensions */
                                              /* are specified as zero-based.         */
                                              /*  * Matrix/Cube Style (math)          */
#define AXIS_X                        0       /* X                                    */
#define AXIS_Y                        1       /* Y                                    */
#define AXIS_Z                        2       /* Z                                    */
                                              /*  * Data Style/Table                  */
#define AXIS_ROW                      0       /* Table Row                            */
#define AXIS_COL                      1       /* Table Column                         */
                                              /*  * Other                             */
#define AXIS_ALL                      -1      /* All (limited use)                    */

Many functions that require an axis declaration, such as the ArrayGetAxisDepth function, will default to the x dimension or row axis if none is provided.

Fixed or Auto-Allocate

Arrays can be fixed or auto-allocated. Fixed arrays are defined by placing the size of each dimension in the declaration:

string list[25];

string table[25][4];

Per this example, the list variable will have 25 elements, referenced from 0-24 as an index. The table variable will have 100 elements, referenced from 0-24 on the x axis and 0-3 in the y axis. What happens when we go beyond that fixed size?

Error 1180 line 5: [Variable 'list', axis 0 (X), requested index 25, max size 25]
Dimensional specification outside of defined space.

This is a run-time error and will stop the script. Note that I tried to access element index 25. Remember: array indices are zero-inclusive, so index 25 references the 26th element, which does not exist.

Fixed array variables have their elements stored directly in the pool. This makes creating the variable very fast. If any declaration size is left blank, the axis is defined as auto-allocated.

string list[];

string table[][4];

As soon as an axis is auto-allocated, the array will be said to “stand alone”, meaning its storage is independent of other variables in the global or local pool. There are minor disadvantages to being stand alone, the most noticeable being performance on creation. Since the array must be created as a separate entity, each time it is created it takes a little more CPU effort. For example, if a function is called thousands of times and the array is constantly being created and released, it will increase the processing load.

Auto-allocated dimensions are set to a starting size of the x axis to 200 and other axes to 10. This may seem like a lot, but usually an auto-allocated array is growing at some rate. Each time the script engine must go to the well for more space, the system’s memory manager has to do its processes; it all takes time. Take a look at this:

string          list[];
string          table[][];
string          cube[][][];

AddMessage("List:   Elements %5d", 
        ArrayGetAxisSize(list));
AddMessage("Table:  Rows     %5d   Cols  %5d", 
        ArrayGetAxisSize(table), ArrayGetAxisSize(table, AXIS_COL));
AddMessage("Cube:   X        %5d   Y     %5d      Z %5d", 
        ArrayGetAxisSize(cube), ArrayGetAxisSize(cube, AXIS_Y), ArrayGetAxisSize(cube, AXIS_Z));

The result when run:

List:   Elements   200
Table:  Rows      2000   Cols     10
Cube:   X        20000   Y        10      Z    10

The initial allocated size is set to be rather large (note that in writing, this blog I realized that the auto size algorithm has an error which will be corrected in a future release). That is okay and is actually the point. Bigger chunks of memory reduce the number of times reallocation is required and reduces the time the low-level heap manager spends shuffling data to manage all the program’s space.

Also, as mentioned above, there are advantages to having the row of a table be the first subscript. In an auto-allocated environment, as rows are added they are appended to the end of the data store without reorganizing data. Columns, or the second subscript, are different. To add another column, the grid must be expanded and each row shifted to make room for one or more new elements in the column. This can become very expensive when appending to a large table.

Key Names

Subscripts in Legato can be numeric zero-based integers or key names. Key names are convenient at a number of levels and are also supported by a number of programming languages. Names are limited in size and character range so some care needs to be exercised in defining them. First, the maximum size is 63 characters. Also, it cannot contain control character (codes < 0x20) and must have at least one non-numeric character. Key names are case sensitive, and they do not care about encoding.

Many API functions accept or return arrays and tables with key names. It is common for the columns in tables to have key names but not the rows. For example:

string        table[][6];
        
table[0]["name"] = "Johnson, James";
table[0]["email"] = "jj@weburl.com";
table[0]["phone_work"] = "333-555-1212";
table[0]["phone_home"] = "";
table[0]["phone_mobile"] = "333-555-2222";

The row employs numeric indices while the columns are named. Numeric subscripts can be used where key names have also been used, but the ordering is dependent on when the key names were first used. To avoid weird issues, it is generally best not to mix modes.

Key names are unique to the axis and variable. They are also not sorted or optimized, so use caution with respect to the usage, placement, and number of keys. For example, column key names are convenient, but creating 10,000 strings of “a0000” to “a9999” for row key names is bad practice.

The key name at a specific index can be determined by the function:

string = ArrayGetKeyName ( variable name, int index, [int axis] );

which returns the name at the specified zero-based position for the specified axis. All the keys can be returned in an array using the ArrayGetKeys function for a specified axis:

string[] = ArrayGetKeys ( variable name, [int axis] );

A key can also be located using:

int = ArrayFindKeyName ( variable name, string key, [int axis] );

which returns the index of the key. Conversely, if you just want to check to see if the key name exists, the ArrayHasKeyName function can be used:

boolean = ArrayHasKeyName ( variable name, int index, [int axis] );

For now, key names are created the first time they are used, even in a reference (for example as part of an if statement). Sometimes this is a bit annoying. On one project I found myself checking the key name to determine if an array had an XML attribute value, of which incoming tags could have hundreds. This started to be a major performance hit, so I ended up using the code:

if (ArrayHasKeyName(aa, "spaceBefore")) {

if (aa["spaceBefore"] != "") { ...

rather than the code:

if (aa["spaceBefore"] != "") { ...

The first approach may seem overly complex, but the performance gain was dramatic because the script checks to see if the key name exists in the array before doing anything with it. This does not cause it to get added to the array through referencing it. Many thousands of tags were being processed, and if the particular key name I was searching for didn’t exist, it was added by reference each iteration of the search loop. Yuck! So take note, I am planning to change Legato such that referencing a name will not create a key name in the future.Therefore, don’t rely on that quirk.

There is one final point about API functions and key names, including functions such as the ParametersToArray function:

string[] = ParametersToArray ( string data, [string delimiter] );

Let’s look at the remarks in the documentation:

The ParametersToArray function is useful for transporting arrays of information as string. The property name is used as the array entry key name. The parameters must be in the form of:

property-name-1: value; property-name-2: value;

The property name must be less than the key size. If not, an error count will be set and the key name truncated. Duplicate property names will be removed with the last duplicate item used for the property value. Blank lines or lines with no property name are also ignored.

. . .

So read carefully and process carefully when you cannot control the source data.

Used versus Allocated Space

Let’s get back to the main topic. When an array is first initialized, it has zero defined elements. When an element is added, the depth of the array changes. The depth of the array refers to the number of items in the array at any given time. The size of the array may or may not change depending on the required position of the element to be added, but only if declared as auto-allocated.

There are two functions, as mentioned earlier that allow the programmer to examine the array dimensions:

int = ArrayGetAxisDepth ( variable name, [int axis] );

int = ArrayGetAxisSize ( variable name, [int axis] );

Given what we learned above:

int          a[20];
        
AddMessage("Depth:");
AddMessage("   Empty         %d", ArrayGetAxisDepth(a));
a[5] = 0;
AddMessage("   a[5] = 0      %d", ArrayGetAxisDepth(a));
a[5] = a[10];
AddMessage("   a[5] = a[10]  %d", ArrayGetAxisDepth(a));

AddMessage("Size:");
AddMessage("   a             %d", ArrayGetAxisSize(a));

Yields:

Depth:
   Empty         0
   a[5] = 0      6
   a[5] = a[10]  11
Size:
   a             20

The size will be 20 elements throughout, but the depth changes depending on the activities, including referencing.

Conclusion

Writing this got me thinking about testing the performance of certain algorithms with various types of optimization. Think about the idea of pivoting a table on load from a CSV source and exactly how much extra processing time would be required if row and column positions were swapped. My guess is a lot.

Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages.