The technical term for a web address is a Uniform Resource Identifier or URI. The contents of a URI can be very simple, such as ‘www.gofiler.online’, or very complex. A URL, the common name for a web page is actually a Uniform Resource Locator, which is a component of a larger URI. From within a Legato script one may find the need to either build a query or perhaps take one apart. In this post we will explore how URIs work and how to work with them within Legato.
Friday, February 16. 2018
LDC #72: Get Crack'n - Working with URIs
A URI is actually made up of a number of components. Many are optional depending on the context. Consider the following figure (from wikipedia.org):
A simple address such as ‘http://www.cat.com/types.htm#fuzzy’ constitutes:
http — indicates we want to use hypertext transfer protocol (web).
www.cat.com — a web site address.
types.htm — a file or resource on the server (within the web page).
fuzzy — a location within types.htm to jump to.
Using the figure above, we can clarify:
scheme — The scheme specifies what protocol will be used, such as ‘http’, ‘https’, ‘ftp’, or perhaps ‘mailto’. Depending on the application, the scheme may automatically be added. For example,most browsers will normally add the ‘http’ to an address you type in.
authority — The authority actually comprises three components: user information, the host and the port. For common activities, the user information is omitted and defaults to ‘guest’. The host is what we commonly refer to as the web site or server name, which is generally a required component. Finally, the optional port number allows us to override which logical port number to use. By default, the scheme will have ports associated. For example, http is port 80 and https is port 443.
path — The path specifies a resource for the host. It is normally a traditional path with or without a file name. Depending on how the host server is setup, this may specify a path within a ‘webroot’ or ‘ftp’ area. It may also be interpreted by the host and reference a database or other tools that look like a hierarchical file system. When omitted, such as for a website, the host may automatically use a default name such as ‘default.htm’ or ‘index.htm’.
query — The query is a series of key/value pairs that the host can use to pass to programs on the server side. For example, search parameters to lookup data within a search engine make use of queries. As a side note, if the scheme is not secure, such as http versus https, programmers should never place sensitive data in the query,
fragment — A fragment specifies a named position, usually in an HTML file, as identified by the ID or NAME attribute. When present, the browser will scroll that position into view.
In the above figure, the hierarchical part refers to the larger tree of users, hosts, and paths that make up a unique resource.
As one can see, there is a lot of information that requires correct formatting. There are also special characters that should only appear in certain contexts. For example, question mark (‘?’) should only appear between the path and the query data and the scratch (‘#’) character should only delineate the fragment or name. Along with the slash (‘/’), colon (‘:’), and at-sign (‘@’), these characters cannot appear within a scheme, user, password, host name, query or fragment.
Since query information can contain a variety of data, a method of encoding and decoding protected characters must be used. In addition to the above characters, the equal (‘=’), slashes, and control characters must be encoded.
One final note about the host name: it can be a domain name or an IP address. The domain name returned is a combination of the subdomain, such as ‘www’, a second level domain such as ‘hotmail’, and a top level domain such as ‘.com’ or ‘.org’.
URL Encoding
As we said above, since special characters can be in the stream of the query or name, they must be encoded. This is performed by using the escape character % followed by a two digit hex code. Conventionally, many characters are encoded:
! | # | $ | & | ' | ( | ) | * | + | , | / | : | ; | = | ? | @ | [ | ] | ||
%21 | %23 | %24 | %26 | %27 | %28 | %29 | %2A | %2B | %2C | %2F | %3A | %3B | %3D | %3F | %40 | %5B | %5D |
as specified in RFC 3986. Other characters can be encoded and there are variations between applications. However, the ones above must be encoded to avoid creating issues with parsing the entire URL.
URL vs URI: What’s in a Name?
Just as a point of interest, you probably have noticed the terms URI and URL interchanged. The URI is a specific method of addressing information over a network, typically the internet. A URL is actually a form of URI. which is frequently used to refer to a web address.
‘Cracking’ a URI
There are a number of methods of cracking a URI. Individual components can be extracted or the entire URI can be cracked as a whole. Let’s look at the function that cracks the entire URI: the GetURIComponents function:
string[] = GetURIComponents ( string uri );
The return value is a keyed string array containing all the discovered components:
Key Name | Description | |||
scheme | Text of the specified scheme. | |||
authority | Combined user, password, domain (host), and port. | |||
host | Domain name or host name. | |||
path | Path component which may contain a filename. | |||
port | Port number. | |||
scheme_type | A string as a decimal value specifying a code for a common scheme. See URI_SCHEME_ codes in the SDK. For example: with URI_SCHEME_HTTP, if the scheme is not known, the field will be URI_SCHEME_UNKNOWN 0. | |||
name | Name or namespace (also referred to a fragment). | |||
query | Query portion of URI. Note that the size of this component is limited to 1,024 characters. If the size is larger, use the GetURIQuery function or the query can be manually be parsed using the q_x value below. | |||
user | Username if specified. For mailto:, this will be the recipient. | |||
password | Password if specified. | |||
a_x | Zero based index of the start of the authority. -1 if no index is available. | |||
p_x | Zero based index of the start of the path. -1 if no index is available. | |||
n_x | Zero based index of the start of the name. -1 if no index is available. | |||
q_x | Zero based index of the start of the query. -1 if no index is available. | |||
d_x | Zero based index of the start of the data. -1 if no index is available. Data only applies to non-URI schemes such as ‘data:’. |
The array is returned, even on error. On an error condition, items that can be filled prior to the error will be returned. Overflow conditions are the most likely error condition. Queries, which can be rather long, are the most common source of overflow (the buffer is limited to 1,024 characters) and can be resolved by using the GetURIQuery function directly.
Zero-based string index values are provided to allow a script to quickly access the position of the raw data of the uri parameter.
If you need just a piece of the URI, there are functions that will extract just that component:
string = GetURIAuthority ( string uri );
string = GetURIHost ( string uri );
string = GetURIName ( string uri );
string = GetURIPath ( string uri );
string = GetURIQuery ( string uri );
string = GetURISchemeString ( string uri );
int = GetURISchemeType ( string uri );
Operation of these functions is pretty much self-explanatory with the exception of GetURIQuery which will retrieve any size query up to 65,525 bytes. The GetURISchemeType function returns a scheme code for common schemes. Note that Legato only contains a handful of the hundreds of schemes defined by the internet community.
As mentioned above, the host name is a part of the authority. If you are parsing for a specific domain name, start with the host name. While the two may be the same 99% of the time, as soon as a port or user/password is added, your program may have a problem.
Finally, you may be wondering about the ‘filename’. If available, it is part of the path data. Remember that internet resources do not need to point to a conventional file and might be served by a server-side script or a redirection. Other Legato tools such as the GetFilename function can be used to extract a potential filename.
Cracking the Query
When a query string is retrieved using one of the above URI functions, it may contain a series of key-value pairs. These can be manually taken apart and decoded or decoded in one swoop by using:
string[] = URIQueryToArray ( string data );
which will return an array with named elements containing each value. The data parameter can optionally contain the leading ‘?’ character. The opposite is to create a query with:
string = URIArrayToQuery ( string data[] );
which takes an array with key names and creates a query string. Note that the key names are not encoded and therefore must be legal to use within the query as is.
If you need to encode or decode a portion, such as just a value, the functions are:
string = DecodeURI ( string source );
string = EncodeURIComponent ( string source );
These functions decode and encode an entire query, provided that the result does not contain any escape characters (such as ‘=’ and ‘&‘). Obviously, an entire query cannot be encoded from a source string since the escape characters will be encoded.
The encode and decode functions can also be used for other purposes, such as encoding spaces and other information for a stream that may count space characters as delimiters.
Creating a URI
Generally, a URI can be created by simply concatenating string sections. Depending on the operation, most programs will be accessing servers with not more than a couple of addresses. The more time consuming part of assembling a URI is encoding the query. Fortunately, there is a function that will create a query from an array:
string = URIArrayToQuery ( string data[] );
The data array is scanned and each entry key name and value become the query ‘key=value’.
Let’s Get Crack’n
Below is a little script that will crack a URI entered into a dialog:
// // URI Demo Program // ---------------- // // (c) 2018 Novaworks, LLC -- Free to be used without attribution. // #beginresource #define UC_URI 201 #define UC_DETAILS 202 #define UC_CRACK 203 #define UC_SCHEME 204 #define UC_HOST 205 #define UC_PATH 206 #define UC_NAME 207 #define UC_QUERY_LIST 208 URICrackDlg DIALOGEX 0, 0, 272, 182 EXSTYLE WS_EX_DLGMODALFRAME STYLE DS_MODALFRAME | DS_3DLOOK | WS_POPUP | WS_VISIBLE | WS_CAPTION | WS_SYSMENU CAPTION "Get URI Components" FONT 8, "MS Shell Dlg" { CONTROL "URI", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 6, 4, 20, 8, 0 CONTROL "", -1, "static", SS_ETCHEDFRAME | WS_CHILD | WS_VISIBLE, 20, 9, 245, 1, 0 CONTROL "", UC_URI, "edit", ES_LEFT | ES_AUTOHSCROLL | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 12, 17, 200, 12, 0 CONTROL "Crack", UC_CRACK, "button", BS_DEFPUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 219, 17, 40, 12 CONTROL "", UC_DETAILS, "static", SS_LEFTNOWORDWRAP | WS_CHILD | WS_VISIBLE, 12, 33, 244, 8, 0 CONTROL "Result", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 6, 44, 26, 8, 0 CONTROL "", -1, "static", SS_ETCHEDFRAME | WS_CHILD | WS_VISIBLE, 28, 49, 237, 1, 0 CONTROL "Scheme:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 58, 34, 8, 0 CONTROL "", UC_SCHEME, "edit", ES_LEFT | ES_AUTOHSCROLL | ES_READONLY | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 46, 56, 40, 12, 0 CONTROL "Host:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 93, 58, 28, 8, 0 CONTROL "", UC_HOST, "edit", ES_LEFT | ES_AUTOHSCROLL | ES_READONLY | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 124, 56, 136, 12, 0 CONTROL "Path/File:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 74, 34, 8, 0 CONTROL "", UC_PATH, "edit", ES_LEFT | ES_AUTOHSCROLL | ES_READONLY | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 46, 72, 214, 12, 0 CONTROL "Name:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 90, 34, 8, 0 CONTROL "", UC_NAME, "edit", ES_LEFT | ES_AUTOHSCROLL | ES_READONLY | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 46, 88, 214, 12, 0 CONTROL "Query:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 105, 34, 8, 0 CONTROL "", UC_QUERY_LIST, "listbox", LBS_NOTIFY | LBS_USETABSTOPS | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_VSCROLL | WS_TABSTOP, 46, 104, 214, 74, 0 } #endresource // Global string qd[]; // Program Entry int main() { DialogBox("URICrackDlg", "uc_"); return ERROR_NONE; } // Crack and Display void crack_uri() { string s1; string uc[20]; int ix, size; EditSetText(UC_SCHEME, ""); EditSetText(UC_HOST, ""); EditSetText(UC_DETAILS, ""); ListBoxReset(UC_QUERY_LIST); s1 = EditGetText(UC_URI); if (s1 == "") { MessageBeep(); EditSetText(UC_DETAILS, "* * * * enter a URI to crack * * * *"); return; } uc = GetURIComponents(s1); if (IsError()) { EditSetText(UC_DETAILS, "Error: %08X %s", GetLastError(), GetLastErrorMessage()); MessageBeep(); } EditSetText(UC_SCHEME, uc["scheme"]); EditSetText(UC_HOST, uc["host"]); EditSetText(UC_PATH, uc["path"]); EditSetText(UC_NAME, uc["name"]); qd = URIQueryToArray(uc["query"]); size = ArrayGetAxisDepth(qd); while (ix < size) { s1 = ArrayGetKeyName(qd, ix); s1 += "\t" + qd[ix]; ListBoxAddItem(UC_QUERY_LIST, s1); ix++; } } // Dialog Control Actions void uc_action(int c_id, int c_ac) { int sx; if (c_id == UC_CRACK) { crack_uri(); return ; } if ((c_id == UC_QUERY_LIST) && (c_ac == LBN_DBLCLK)) { sx = ListBoxGetSelectIndex(UC_QUERY_LIST); if (sx < 0) { return ; } EditSetText(UC_URI, qd[sx]); crack_uri(); return ; } }
The script allows a URI to be entered and then disassembled. As an aside, after doing a number of presentations on computer security, I thought it would be nice to crack some of those probing URIs sent in junk mail to see what is in them. Running the script:
You can see the query drives through Microsoft’s safe links processor. One of the query items is the original link within the email. When you double click on a query entry, it reprocesses that data:
What does all the displayed query stuff mean? Well, that should be left for another discussion, and I am not even sure this domain is legitimate. It was, after all, from junk email. But you can see the power and utility in URI cracking from this example.
The example script breaks into two major components: a dialog frame and a URI cracking routine. The dialog frame consists of the dialog resource and the program main() function, which opens the dialog box. The dialog box has a text control for entering the URI, a button for ‘Crack’ (which also has the BS_DEFPUSHBUTTON style allowing for the Enter key to invoke the crack function). There is no load() function because it’s not really needed. One could set tab stops on the query list box, but we are just using the default tab stop positions.
The main control runs through the dialog action() routine. The first control monitored is the UC_CRACK button. When clicked the action control enters with the control ID. The only action code is click, so we can just run the crack_uri() function on the button press. The section ‘action’ is for the list box. It is checked for UC_QUERY_LIST and the notification action LBN_DBLCLK. Note that the list box can send many types notifications, so we have to filter out the ones in which we’re not interested. Once we receive a double click, we get the selection index, check it for validity, set the value into the URI control, and then run the crack_uri() function.
The crack_uri() function clears the dialog controls containing the display results, grabs the URI data, and attempts to get the components. Note that if the GetURIComponents function has an error, the error detail is displayed but the function continues. This is because the function can still return an array of components after encountering an error, as mentioned before.
The query data is taken apart by the URIQueryToArray function and stored in a global static allowing for the list box double click action to access a specific query item. As an alternative, we could also retrieve the string from the list box and pull out the query value, which means we would not need the global static qd variable.
Conclusion
Our demo script has a lot of code wrapped around two important functions. These functions, the GetURIComponents and URIQueryToArray functions, significantly reduce workload by performing the heavy lifting for breaking up URI data. The opposite function, the URIArrayToQuery function, covers creating a query string. The basic encode and decode routines can be used in areas other than URIs.
Next time you have a nasty URI and you wonder what is in it, fire up this demo script and take a look.
Scott Theis is the President of Novaworks and the principal developer of the Legato scripting language. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages. |
Additional Resources