Wednesday, November 11, 2015

The Subtle Art of Substitution - Part 1

A simple text interpreter that allows code to be invoked by natural language words.     Part 1.

Over the weekend and in various bits of spare time I have been developing a tiny text interpreter in C, as part of the larger project of creating some low-overhead tools to run on various microcontroller targets.  The toolset will eventually include assemblers and compilers for some custom soft-core processors - but first I need the means to interpret typed text words and execute blocks of code if the word is recognised.

Why this is useful

This text interpreter is intended to provide a more human friendly interface layer to my SIMPL interactive programming language.  Writing in high level, more meaningful natural language will greatly enhance the speed at which SIMPL code can be generated.

A natural language interface makes programming tasks much easier.  Real words are more memorable than individual ascii characters, and it all makes for more readable code listings. Whilst SIMPL might use lower case "a" to initiate the analog read function, typing "analog" is a lot more reader friendly. An interpreter that follows a few simple parsing rules can offer a much increased speed of programming, yet be modest in the amount of on-chip resources utilised to do this.  The code to implement the interpreter is about a 2K to 3K overhead on top of SIMPL - but that will include listing, editing and file handling utilities too.

Substitution and Assemblers

A text interpreter and its ability to execute blocks of code based on parsing the text commands or file it receives is a fundamental part of utility programmes such as assemblers and compilers. Here a set of keyword mnemonics representing instructions can be interpreted and used to assemble machine code instructions by direct substitution.

With a simple text interpreter we can move out from the realms of numerical machine language, and implement the likes of assemblers, dissassemblers and even compilers.

In the case of an assembler, the wordset will comprise of the mnemonics used by the target processor - and the interpreter will merely substitute the human readable mnemonic for the machine instruction numerical opcode.

For example, a certain processor may have an instruction set including mnemonics such as ADD, AND, SUB, XOR etc. The role of the text interpreter is to find these words within the text buffer or input file and create a table consisting of direct machine code instructions, subroutine call addresses and other variables to be passed via the stack to those subroutines.

At a level above the assembler is the compiler.  This also takes text based input and generates machine code to run on a specific processor.  However, compilers are very complex pieces of software, and it is more likely that I will find an alternative solution  - given long enough.

Why do this?

The purpose of the text interpreter is to provide a natural language text interface for a small, resource limited microcontroller - in a similar style to what was provided with the various BASICs of the late 1970's. It's remarkable to think that some fully functioning basics fitted into 4K ROM and 1K of RAM - solely by some very clever programming tricks - in raw assembly language.

Fortunately most embedded programming these days does not have to resort to raw - assembler, and C has become the preferred interchange language layer for portability.  C code written for an Arduino, may be fairly easily ported to an ARM - provided that low level I/O routines - such a getchar and putchar are available for the target processor.

Coding up a text interpreter is a good exercise in C coding - and as I am not a natural born coder, any meaningful coding exercise is good practice.  I also enjoy the challenge of making something work from scratch block by block - rather than being over reliant on other peoples' code, that I don't even pretend to understand.


As a bare minimum, we assume that the microcontroller target can provide a serial UART interface for communicating with a terminal program. I have recoded Serial.print and it's associated functions to use a much smaller routine - which saves memory space.

Ideally the microcontroller should have at least a couple of kilo-bytes of RAM for holding the dictionary and headers making it possible to implement it on anything from an Arduino upwards.

The text interpreter is an extension of the SIMPL interpreter, and can be used for programming tools such as text editors, assemblers, compilers and disassemblers. It provides the means to input text, analyse it for recognised keywords and build up a dictionary and jump table.

Borrowing from Forth experience, the text interpreter (or scanner) will look for a match on the first 3 characters of the input and the length of the word.  As a word is typed in, it will initiate a search of the dictionary (of already known words). If a match is found, the word will be substituted for a 4 digit  (16 bit) jump address. If the word is not matched, it will be added in full to the dictionary table.

This sounds all very Forth-like, and indeed it is, because it is a proven means to input new text data into a processor's memory using minimum of overheads. The dictionary structure is simple enough that it can easily be parsed for a word-match, and also processed for editing and printing.

As each Forth definition is effectively just a line of text it can easily be  handled with a line-editor - again a simple task for a resource limited processor.

Numbers are handled as literals. A quick scan of the text with "is_a_num" will reveal whether it is numerical text - if so it should be converted to a signed integer and put onto the stack.

The output of the text interpreter should be a series of call addresses relating to the functional blocks of code that perform the routines associated with the keyword.  In the case of the assembler example, the mnemonics can be translated directly using a look-up table which converts them directly into the machine instruction of the target processor - this is especially relevant if the target is a stack machine - such as the J1 forth processor.

Charles Moore struck on the idea of a language that was designed for solving problems.  He envisioned having separate vocabularies for each problem he wanted to solve.  For example his assembler application would use a vocabulary tailored to that application - namely the mnemonics as keywords, similarly the SIMPL language would utilise a vocabulary that supported the SIMPL command set. Thus by pointing to a different vocabulary in flash, the processor can readily swap between contexts.

Hop, Skip and Jump, - the Mechanics of Text Interpretation

Short of providing a flow chart - the description below describes the operation of the text interpreter.

The text interpreter will parse through lines of text, taking each "word" as defined by a group of characters terminated by white-space, and check through a list of dictionary words for a match. If there is a match, then the newly scanned word is either a system keyword or a new one that the user has previously added to the dictionary.

If the word does not generate a match with any existing keywords then it is added to the end of the dictionary - thus allowing a match the next time it is used.

In addition to the dictionary, there is a separate array of records that will be known as the "headers". The headers consists of a shortform record of all of the words in the dictionary.  The purpose of the headers is to allow an efficient search to be performed on the dictionary entries - as words are listed in the headers by their first three characters and their length.  A match on the first 3 characters and the length was proven many years ago to be an effective and efficient means of word recognition- see section 3.2.3 here

Once the header of scanned word has been deemed to match the header of one already in the header table, a jump address pointer can easily be calculated - its actually generated as part of the matching routine.  This jump address pointer is decoded by a look up table to generate an actual 16-bit jump address.

For compactness and efficiency,the word matching routine is limited to a maximum vocabulary of 255 words - which is more than enough for most applications.

The text interpreter deals with lines of code.  At some point their will be an accompanying package that implements a line editor, as the first step towards a full, screen editor.

The input buffer of a terminal program may be some 250 to 300 characters long. This is more than adequate space to define most sequences of command words.  Indeed - it may be beneficial to restrict the input buffer to say 128 characters - as this is what can be displayed sensibly on an 800 x 600 VGA screen.

Word storage format

The shortform entries stored in the dictionary headers can be saved as a group of 4 bytes, consisting of the first 3 characters and the length byte. The routine that searches the headers for the match automatically generates the jump address pointer allowing a lookup to the actual jump address from a table.

        Char1, Char2, Char3, Len
Byte     0         1           2        3

So a word can be expanded by knowing its length and the dictionary pointer to its 1st character
The jump address is shortened to a single byte fast look-up from a table.


It's taken a bit longer than expected, but after an intensive day thinking, re-thinking, then coding, the tiny (2K) text interpreter is now starting to take shape.

I have put an interim version (#45) on this github gist

The interpreter is written in fairly standard C so it can be ported to a number of devices.  If implemented using Arduino using the Serial.print library it uses about 4142 bytes of flash and 1897 of RAM.  By using much more efficient custom UART routines for serial input and output, this can be massively reduced to just 2002 bytes of Flash and 1710 bytes of RAM.

Part 2 of this posting will look further at features of the text interpreter and the SIMPL toolset.

No comments: