Sunday, November 29, 2015

Colour Coding - Part 1

The 800 x 600 VGA display is partitioned into 16 "Keyholes" or code windows

Colour Coding was a quick Sunday afternoon hack to see if there might be a better way of presenting a coding environment to the human user.

The code was written in "Arduino" and running on a ZPUino soft core processor hosted on a Papilio Duo Spartan 6 FPGA board.

This combination was chosen because the ZPUino supports VGA generation hardware, and can make use of the Adafruit GFX common graphics library.

This is just a trial mockup of how code might look if presented in a radically different format on a VGA screen - plus it's a means of trying out some ideas - of what might work, and what won't.

These big bold, bright displays can easily be generated with modest hardware and limited processing power.  However they have their application for FPGA projects that need to generate more than just a UART output - but create an interactive graphical user interface on a big LCD monitor.

Historical Note 

The traditional screen editor has been around for about 45 years - ever since serial terminals could rapidly update a whole page of text and show screen cursor and editor operations.

Expressing source code as a linear text file has always been accepted as one the easiest means of getting source code into a compiler, so that the various functions may be compiled in the order that they appear in the text file. Its a throwback to paper tape and punched cards that were fed in and read in sequence until the end of the tape or card deck was reached.

This might not be the best way to organise source code for display - especially, as in the case of Forth, and certain object oriented languages, where the source consists of a lot of very short functions.

Forth traditionally organised it's source into 1024 byte blocks - as these were seen to be a convenient size to edit, update and store on the disc.  A 1024 character block of source code is just 32 lines of 32 characters, and on a modern display 1024 x 768 screen - at a comfortable text size, this occupies about 1/15 of the screen viewing area.

The text in the white keyhole is 32 x 16 characters

Colourful Times

Text has traditionally been in monochrome, but more modern editors have started to add colour to the text in order to signify meaning.

Colour is something that is so simple to overlay over monochrome data - even in chunky VGA text, that it is surprising that not more use has been made of it. Indeed whe I edit pcb layouts in EagleCAD - the only way I can distinguish the top layer tracks from those of the bottom layer is by the use of bold colours.

The same process could be applied to text - making it more readable and conveying meaning through the colour attributes.

The text is almost the same as a screenful of ZX81 BASIC

Micro-Windows or "Keyholes"

Perhaps there might be a  better way of presenting source code on the screen, rather than having to constantly scroll up and down the text file looking for the function that you wish to edit.

Using standard 1024 character Forth blocks, it would be practical to display up to 15 blocks arranged as a 5 x 3 array. The background colour of each block would be set to be different to that of it's neighbours, for ease of visibilty, and using the mouse or touch pad, each of these "micro-windows" could be brought into focus, to allow editing, compilation or execution of that block.

Each micro-window would have access to its own toolset: edit, compile, execute, save etc accessed from buttons along the base of the window.

Where a high level code definition was dependent on certain lower definitions, then those lower definitions could be colour-tagged to show the dependency tree.

Micro-Windows is such a dreadful constructed name, perhaps "porthole" or "spyhole" or even "keyhole" may be more appropriate.

Small Objects of Desire.

Much of modern application development is done in object oriented languages,  where objects are created from data structures, and those structures are manipulated by methods. An object plotted to a screen might take the form of a Red Ball,  located at a certain x,y position on the screen and with a certain diameter and colour.  Further attributes could be used to describe it's shading, transparency or what graphics layer it resides on.  The use of keyholes might be a good way to present the attributes to the developer, such that they could be edited.  Likewise the method scripts that manipulate the objects could be viewable through the porthole system.


Another questionable made up word, however, one that describes the selection and subsequent viewing of a particular block of code based on its colour.  A great deal should be possible by partitioning the overall screen view into zones, and sub-zones defined by colour.  When the mouse mouseovers or lands in a zone or subzone the options are immediately highlighted.

Multi Processor Arrays

The array of keyholes might be a good way to view tasks that are distributed across arrays of multi processors.


The Frustrations of a Casual C Coder

mbed to the Rescue!

Last weekend I had all the frustration of trying to develop code for the STM32F746, and was forced to use Keil' s code size limited IDE, and an unwelcome foray into STM's hardware abstraction later HAL.

Previously my benchmark tests had been developed on a mix of the Arduino, STM32-Arduino, CooCox and Keil MDK IDEs. Trying to juggle so many IDEs and HAL as well was becoming counter productive and not at all a rewarding experience. There has to be a better way.

During the week, I started to look again at using the STM32F7 Discovery board for a small project, and was pleased to see that this target platform is now fully supported by mbed.

I had dabbled with mbed last year when building an experimental motor control board and found it a useful means to put together small applications quickly - so I thought I'd give it another go using the STM32F7 Discovery board.

I can now say that I was pleased with the progress with using mbed for this board -   so I have decided that from now on, that I will try to standardise my ARM project developments using the mbed compiler.

The mbed online IDE presents a clean modern look, is easy to use and removes a lot of the complex clutter that a conventional multi-platform IDE presents. The only downside is that the online compiler does not directly support debugging - but a separate project GCC4MBED makes use of the CMSIS-DAP debugging and programming output via a USB port.

It appears that if you want to work across several targets, that there is still not yet one solution that fits all, and it will always be necessary to do some juggling between different toolchains.

The rest of this post is an autobiographical account of my various dealings with microcontrollers - over the last 35 years.

Early Beginnings.

My first introduction to personal computers was over 35 years ago when my secondary school bought an early Z80 machine, the Research Machines RM-380Z, housed in a 19" rack. These machines were bought by the Isle of Man Board of Education, under a nationwide government initiative to introduce computers into schools on the Isle of Man - a full 2 years ahead of the BBC Micro initiative.

So my early experience of personal computers was sitting at a black and white monitor, writing in fairly standard, interactive BASIC, and this is more or less how it remained for the next 5 or 6 years or so - through a progression of machines including ZX81, Spectrum, BBC B, Apple II, Tatung Einstein and several others.

Other experiences included writing a fair amount of assembly language - mostly in Z80, and mostly without the benefit of a machine that actually ran a Z80 assembler!   Assembly by hand, using paper, pencil and a table of mnemonics and opcodes is brutally tough, especially when you then have to then use an Apple II, to create an eprom from the hand assembled code you had just written.  With a code, program erase cycle of some 20 minutes - coding mistakes were frustratingly painful and time consuming.

However - as a naive 18 year old in the technology backwaters of the Isle of Man, I was unaware that tools like hex editors actually existed, and the possibilities of getting a Z80 assembler on a Z80 machine - in a small company that had invested a lot in an Apple II - seemed somewhat remote.

Sometime later, I wrote a Z80 disassembler in ZX Basic, and I also adapted a Z80 monitor program to allow hex dump and hex editing - and that was the extent of my tool chains up until the late 1980s when I believe the Tatung Einstein a CP/M machine I bought half price, end of line, was the first machine to offer these tools as part of the CP/M package.


Forth is a fascinating language that has captivated me since I was a teenager at school.  I probably first heard a rumour about it around 1982 - and that it had been used to control radio telescopes 10 years earlier. Byte Magazine ran a special edition on Forth in 1980 - which I later found in a technical library.

My first copy was ZX81 ROM-Forth - and it was a copy.  As students in an electronic engineering department we had access to an eeprom programmer, and a good friend, Hadyn and I bought a legitimate ROM between us from an advert in the back of a computer magazine, and then ran off a copy at the soonest practical opportunity. ZX81 ROM Forth was a sophisticated yet quirky product, somewhat ahead of it's time, but sufficient to teach me the basics of the language. It was also much faster than interpreted ZX81 BASIC.  This was probably my first introduction to the notion of swapping ROMs to get a computer to do a completely different task.

In the Summer of '84 Hadyn andd I took a day trip to London (from North Wales) to trawl the electronic music shops (elecrosynth was all the rage) and electronic surplus stores off the Tottenham Court Road. There in "Henry's Radio"  I spotted a Jupiter Ace - again end of line - so I snapped that up cheaply.

Forth was somewhat frustrating on those early UK 8-bit machines. The only means of saving a program was on an audio cassette, which was not entirely successful everytime.

By the late 1980's I had more or less given up programming personal computers, became a user of other peoples applications, and dropped high level coding entirely for about the next decade.  By the time I got back to it the world had moved on significantly....

I wrote some 68000 assembly language for a company that made scientific instruments, but this was all so alien to me compared to Z80, that I was kind of "in at the deep-end" and struggling to keep above water.  From hand-assembling a few lines of code, to working on a multi-module was a big leap - and not one that I managed to achieve successfully.  I left that company shortly afterwards....

PIC Practice

I had known about PICs since about 1995, but had no practical experience of them. I joined a telecoms company in late spring of 1998 and I worked on some telephone dialler products.  During this time I worked out some PIC machine code routines to perform many of the basic telephone signalling tones, including DTMF send and receive, V23 modem send and receive, American payphone "Nickel Tones" and several others.  This was a productive time for me, and with nothing but a PIC and a R-2R resistor ladder network as a primitive audio DAC - I developed code that in the wrong hands could create network chaos.  However wireline telecom development was very much on the way out, so there was another career change on the horizon.

A Baptism of C

I took a management role for a company in the midlands that were doing early asset tracking and wireless telematics - using a combination of GPS receiver coupled to a GSM modem module.  Their code was also running on a PIC, but this time in C.  In my year as a manager there, I watched other developers struggle with C, and me not having the experience to assist.

In 2005 I dabbled with a budget C compiler for PIC.  The language was still so alien to anything I had experienced, I put C firmly on the back burner for another few years.

A colleague urged me to have another look - telling me that "it was not too bad once go get into it". This would happen a few years further on - when I was first introduced to Arduino.

I took a job in central London that were developing a smart energy monitor.  The source code for this was in PIC C for the receiver and 8051 C for the transmitter unit.  I was involved in product testing, and we built a test system and other gadgets, based around the Arduino.

Arduino was the first product that sufficiently de-mystified the C language for me to begin to make some progress with it.

My first exposure and early experiences with Arduino gave me sufficient confidence to stick at C - which conveniently brings us up to 2010.  The last 5 years of this rambling account, will be the subject of a later post.

Saturday, November 28, 2015

Experiments with STM32F7 Discovery

In this post I look at some possible uses for the STM32F7 Discovery board.  I look at the possibilities of getting it to generate a video display for an external VGA monitor.  I also look at a innovative means of creating a custom GUI - based on a low cost IC.

In late September I traveled up to Cambridge, to attend an ARM course hosted by ST Microelectronics as a vehicle to showcase their new Cortex M7 microcontroller.

The course described in detail some of the key features of the M7 architecture, and how it's performance has been enhanced over the existing M4.  In the afternoon there were practical introductions to some of its DSP capabilities.  And of course, the main reason for going was to get a free STM32F7 Discovery board - worth about £40.

The F7 Discovery board comes with a load of hardware bundled in, notably a 4.3" capacitive touchscreen, ethernet, stereo audio, microSD card,  camera interface,  dual MEMS microphones - for directional sound analysis, and external 8Mbyte SDRAM and Flash ( to demonstrate code execution from various external memory devices).

I had put the board to one side for a couple of months, whilst waiting for mbed to support this feature rich board.  Now at last, mbed has stable support - and I can begin experimentation.

As with all mbed platforms, there are a few code examples supplied to get you started  - in this case - writing text and graphics to the LCD, and reading the capacitive touchscreen.  The examples were easy to follow, and easy to modify to suit my own purposes.

I decided that it would be interesting to run my J1 Simulation on the Discovery board, by way of benchmarking it.  If the J1 Sim ran reasonably quickly on this extremely portable platform then it would be useful to try out some of my J1 assembler development on it - written in mbed C++.

The benchmark was reasonable - about 4.65 million J1 instructions per second.  This should be quite fast enough for now to develop the next stage of the project.

LCD and VGA Dispays

Both  the STM32F42x M4 and the STM32F7xx M7 processors contain an LCD controller.  This generates the RGB parallel 8 bit video data and the horizontal and vertical sync signals for most modern LCD displays.   The  sync generators are entirely programmable, and will handle display resolutions of up to 1024 x 768.

It is a fairly trivial matter to combine the digital data lines using a weighted value resistor network - which acts as a simple D to A converter and allows an analogue video signal to be recreated.   This is exactly what VGA monitors require - so it should be possible to get the M7 to generate reasonaby good graphics on a flat screen monitor.  As much as I like the touchscreen LCD - text on it is a little small.

This post to a forum confirmed the technique is valid - by modifying a '429 Discovery board 

To put this plan into action, I will need to create another breakout - board,  in order to get at the necessary signals  - unfortunately the F7 Discovery uses a BGA packaged processor - and virtually none of the required signals are accessible.

The board can use either a STM32F439 or a STM32F746 - as they are pin identical.  There should be at least 2MB of fully static SRAM and a block of SDRAM.

The new breakout board will have a modified Arduino MEGA foortprint - to allow it to be compatible with my Papilio Duo and Computing shield hardware. These provide PS2 and VGA break-out connectors.


Since getting my Papilio Duo FPGA boards earlier this year, it has opened up experimentation with computer hardware and softcore processors.

One of my goals is to put an FPGA and a reasonably powerful ARM Cortex M7 processor together on the same pcb. This will allow exploration of the FPGA hardware - but with a fairly standard ARM to provide system hosting.  It is likely that they will share some dual ported RAM.

For debugging the ARM will provide a user interface - allowing keyboard, mouse, USB and ethernet interfaces.  Code can be simulated at about 1/30th of real speed on the ARM, before being ported across to the FPGA.  This is going to be a challenging brain stretching project.

The intention is to create a workstation-like environment, with 1024 x 768 graphics on a large screen monitor.


It should be possible to prototype a lot of this before having to commit to a lot of new hardware.

The 100 pin STM32F746 BOB board allows most of the RGB lines to be accessed - enough to allow a full 8 bits of green, and 6 bits of each Red and Blue to be connected.  This will at least allow the system timing to be tested on a VGA monitor.  The Computing shield has a break off section which allows VGA combination of 4 bit RGB signals.

The STM32F746 has 320K of SRAM - and if we are using 4 bits of each RGB - then we cannot display much of a picture.  Instead - for experimentation - pack up the RGB as 3:3:2 into a single byte - that would then allow a 640 x 480 display.

A lot of work has already been done in getting microcontrollers to output video.  Cliffle's Blog is a good place to start.

A Graphics Co-Processor? 

The idea of offloading the burden of driving a video display to a custom graphic processor has been around since the earliest days of the PC.  Traditionally the graphics system would involve a frame buffer RAM, into which the cpu would paint the pixels required for the video image. The cpu would usually use the vertical blanking period (about 50 blanked lines or 3.2mS) to update the data in the frame buffer.

Loading the frame buffer with the data for text characters and shapes is computationally intensive - and if it could be offloaded to a special graphics engine, then it frees up huge resources in the cpu.  The other way of looking at this, is that a much slower, simpler cpu could be used, if it no longer has the graphics overhead.

This is the approach adopted by FTDI, who have launched a series of embedded video engine (EVE) ICs - which are finding there way into embedded systems.  EVE does not just handle the video for the LCD, it also has a touchscreen controller interface and an audio generator - which a range of pre-recorded sounds.

EVE treats all the items that go to make up the video display as objects.  There are numerous in-built objects - such as buttons and sliders - that go up to making a modern GUI.  The host processor need only built up a display list of the objects that it wants to display, in terms of size, position, colour and other attributes - and this list is passed to the EVE device, whenever the mcu needs to update the display.

As the display list is a fraction of the size of  the objects that it represents, it means that it takes very few resources to generate and update it. This means that a simple 8 bit microcontroller can easily generate a display list for a very slick looking GUI.  FTDI have emphasised this capability - by basing there development boards around the ubiquitous ATmega328p  - commonly known as Arduino.

There are several key advantages to using this approach:

  • Future proofs the GUI,  will work with any display up to 800 x 600 pixels.
  • Changes can easily be made  - eg change position, font, colour etc of display objects
  • Speeds up the development time of slick looking GUIs
  • GUI is decoupled from the microcontroller resources - freeing up time and memory
  • GUI object lists can easily be ported to different microcontrollers.
  • Capacitive or resistive touch screen interface -if required
  • jpegs can be easily rendered from microSD
  • GUIs can be knocked up easily - "Try before you Buy" to meet sales & marketing aspirations

  • The EVE chips may be purchased in 1 off from Farnell - for about £5.

    Revelation Time!

    Revelation Time!

    The STM32F7 Discovery board is remarkable a board for the money providing a rich mix of hardware, communication interfaces and memory. With it's 216MHz STM32F746 ARM Cortex M7 microcontroller and 4.3" capacitive touchscreen display, it makes a very capable platform to develop embedded applications on.  It represents probably the fastest system available that allows true bare metal programming - unhindered by an operating system. For anyone developing embedded systems requiring touchscreen display, ethernet, audio and a fast 32 bit processor, the F7 Discovery is worth considering. All of the principal ICs on the F7 Discovery are available in LQFP or TSSOP packages - which means that the core hardware design could be recreated on a simple double sided pcb - suitable for hobbyist construction.

    When I first looked at the STM32F7 Discovery board, and saw the Arduino headers on the underside of the pcb - I thought that such a powerful board, having been design restricted to just 20 feeble I/O lines, was a bit lame - kind of added at the last moment - as an Arduino afterthought.

    However, when I took a closer look at what signals had actually been routed out to the Arduino headers - I discovered that it is possible to get all but 1 of the signals for an RGB  2:3:2 display!  It almost looks as the signals were planted on the Arduino headers on purpose - for a backdoor VGA display  - of up to 128 colours.  Now that did get my attention!

    A VGA shield could be easily built on a proto shield or even stripboard.

    Not only that - but two SPI ports and an additional UART appear on the headers - allowing for a PS2 mouse and keyboard, and an auxiliary serial debug port!

    Whoever laid out the F7 Discovery board was either a Genius of a bluffer!

    A Pocket Workstation

    With the possibility of quite an easy hack to get 7 bit VGA from the Arduino headers, and add a PS2 keyboard and mouse - it occurred to me that the Discovery board had just redeemed itself as a prime contender for the proposed retro-workstation project.   It would certainly work well as a target board - to allow ideas to be tried before committing to further hardware.

    A simple shield carrying the VGA resistor networks, a VGA connector and 2 PS2 sockets could be made up very easily - on the new preferred 50x50mm board format.

    The Discovery board provides  8Mbyte of SDRAM,  16MByte of Nor Flash and a microSD card. This would be sufficient for much of the experimentation I want to do - after all I am looking for minimal systems with modest resources. In terms of computing power - the Discovery board represents a respectable mid-1990s desktop.

    Other useful hardware is the ethernet connectivity, the OTG high speed USB and the audio interface. hardware.

    mbed on the STM32F7 Discovery

    With mbed available for code development - this seems to be a step up from the humble Arduino - and definitely a very exposed platform for bare metal code development.   The hardware is totally accessible - and not hiding behind Linux, and there is no part of this hardware design that I cannot recreate using LQFP packages on a self designed pcb, adding more SRAM or SDRAM as required.

    The hardware is quick too - it looks like it will be about 30 times the speed if the ZPUino - plus more resources - and a more flexible VGA or LCD system.  There is scope for Li Po battery operation - and discover some of the low power modes of the STM32F7 micro.

    Discovery F7 VGA Output Hardware Details

    Here's the hardware connection details for the proposed "Arduino"  VGA/ PS2 shield.

    Colour                      Port                       Arduino Pin

    R7                             PG6                       Dig 2
    R6                             PA8                       Dig 10

    G7                             PI2                        Dig 8
    G6                             PI1                        Dig 13
    G5                             PI0                        Dig 5

    B7                             PB9                       Dig 14
    B6                             PB8                       Dig 15

    H_SYNC                  PC6                       Dig 1  
    V_SYNC                  TBD                      Available on camera FPC

    The missing V_sync is probably not too much of an issue.  It could be generated by a Timer - clocked by H_sync

    LCD_CLK                PG7                       Dig 4

    TIM3 CH1                PB4                       Dig 3
                                     PH6                       Dig 6
    TIM2 CH1                PA15                     Dig 9
                                     PB15                     Dig 11
                                     PB14                     Dig 12

    For serial communication - we have access to UART7 on PF6 ad PF7 - which appear as AN4 and AN5. This will allow a FTDI cable to be plugged is as an auxiliary/ debug port

    So the full line up - rearranged

    H_SYNC                  PC6                        Dig 1
    R7                             PG6                       Dig 2
    TIM3 CH1                PB4                       Dig 3
    LCD_CLK                PG7                       Dig 4
    G5                             PI0                        Dig 5
    TIM12 CH2              PH6                       Dig 6
    Spare                        PI3                         Dig 7
    G7                             PI2                        Dig 8
    TIM2 CH1                PA15                     Dig 9
    R6                             PA8                       Dig 10
    TIM12 CH2              PB15                     Dig 11
    TIM1/8 CH2N          PB14                     Dig 12
    G6                             PI1                        Dig 13
    B7                             PB9                       Dig 14
    B6                             PB8                       Dig 15

                                   PA0                       AN0
                                   PF10                      AN1
    SPI5_MOSI            PF9                        AN2
    SPI5_MISO            PF8                        AN3
    SPI5_SCK              PF7                        AN4
    UART7 TX             PF6                        AN5

    Sunday, November 22, 2015

    A New Compact Microcontroller Board

    The new board shown fitted with a 40 pin DIL Package  - eg ATmega1284

    A Low Cost Generic Microcontroller and FPGA Board

    The Arduino board format is now looking dated with its bulky footprint and only 20 useful I/O lines.

    I realised that there was an opportunity to redesign the board and make it more useful for prototyping or developing with larger pin-count microcontrollers - yet retain nominal compatibility with the Arduino connector format, and therefore will also accept most original Arduino shields.

    The proposed new board footprint is just 70% of the original area yet provides up to 58 I/O pins, direct USB programming and on board wireless communications.

    The pcb makes use of a standard 50mm x 50 mm board footprint - which are now manufactured very cheaply (as little as $14 for 10) by various low cost board houses. 

    The board format may also be used as a basis of a 50mm x 50 mm expansion shield.

    Pin Naming.

    Arduino started life  with 6 Analogue inputs and 14 Digital I/O pins. Over the years these have often been labelled A for analogue and D for digital.

    The naming convention I have settled upon keeps the A and D headers for backwards compatibility, but adds extra headers  - labelled B, C, E and F.  Alphabetical port names make sense

    These additional 0.1" pitch headers are placed in-board of the existing headers - which give an inner row of headers on a 1.70" width, which makes these entirely compatible with most breadboards and 50mm x70mm 0.1" prototyping boards.

    Header A is 6 pins - Arduino standard  - providing analogue inputs
    Header B is 6 pins - and provides additional lines with higher resolution analogue capability.
    Header C is 8 pins - providing a mix of analogue, digital, communication and timer functions.
    Header D is digital and has been extended to include the extra two I2C pins
    Header E is 16 pins  - for Expansion - and is exclusively digital GPIO
    Header F is also for Future and may provide up to 5 GPIO lines

    The layout of the headers has been chosen so as not to be entirely symmetrical - this hopefully prevents any shield from being plugged in back to front.

    Making it Compatible with Arduino Shields.

    A brief word about Arduino.  Arduino originally offered 6 analogue input pins and 14 digital pins. Unfortunately due to a CAD error, the digital pins are not on a standard consecutive 0.1" spacing - as there is a gap of 0.060" between D7 and D8.

    The first task was to come up with a shield footprint that could be compatible with this layout - yet  fit into the narrower width of a 50mm square pcb.  This was done by careful customisation of the size and shape of the header pads - so that they will just fit into a 50mm width.

    The second task was to provide some additional 16 pin header strips, inboard of the original Arduino headers, which would give access to an additional 32 GPIO lines.

    This was done in a way that would also allow two M3 fixing holes in opposite corners.  Finally, 4 additional signals - not present on the original Arduino headers were added, to give an I2C and the 2 extra pins on the R3 power header.

    The 50x50 pcb fitted with 100 pin LQFP and mini-USB connector 

    Choice of Processor.

    The 50 x 50 board layout could be used for any microcontroller that offers around 50 to 60  I/O lines and can be readily adapted to suit various packages - up to 100 pin LQFP (Like the STM32F746).  For most projects it is a good match with 48 pin or 64 pin LQFP packages.

    It may also be used with DIL footprint ICs - and it is just possible to shoehorn a 40 PIN DIL onto the pcb - such as ATmega1284 etc.

    However because my recent experience lies with the STM32Fxxx range of ARM Cortex M3 and M4 microcontrollers, these were the obvious first choice.

    Conveniently a board designed for one particular variant, can also be populated with another close family member  - so I chose the STM32F103 workhorse, and the STM32F373 - which has a faster M4 core , a floating point unit and significantly more analogue ADC capability - in terms of ADC resolution and signal lines.

    Each of these processors has a maximum of 51 or 52 GPIO lines, but once you remove two for the crystal, two for the USB, two for the ST-link and two for the RTC  - you are down to a more manageable 44 lines.

    The designation "PA" refers to the physical pins of GPIO Port PA on the STM32 mcu package - and not the A pins on the header. I hope this does not cause undue confusion.

    Port PA   12 signals
    Port PB   12 Signals
    Port PC   14 Signals
    Port PD     2 Signals
    Port PE     2 Signals
    Port PF     2 Signals

    Total        44

    This is 24 more than the original Arduino, so at least an additional 3 x 8 pin headers will be needed to accommodate these.

    The problem is how best map the various GPIO ports on the ARM to the physical pins of the connectors - in a way that makes sense and clusters them by function.  Separating into nominally analogue and digital is a good starting point.

    Layout of the Ports.

    In addition to the Arduino's A0-A5, the proposed board offers a further 10 analogue inputs  - allowing A0 - A15 and 6 additional analogue or digital lines C0 to C5

    These are provided on a 16 pin header on the same side as the existing analogue and power headers.

    On the "digital side" of the board there is also and additional 16 pin connector.  This is the Expansion, or Extra port  - and is designated E0 to E15. If you want a 16 bit bus connected  - say for a FPGA project, then this would be good use of  port E.

    Furthermore, later Arduino UNO R3 models offer two pins for I2C devices  - these are added as D14 and D15.


    The proposed 50 x 50 board size is convenient, compact and versatile.  It has sufficient pins for the more demanding applications, and sufficient board area to allow plug in modules to be added.

    The board can sensibly accept microcontrollers or FPGAs up to about 144 pin LQFP - which makes it viable for projects incorporating the STM32F7xx Cortex M7 or the Xilinx Spartan 6 range of FPGAs - both of which are available in LQFP - and this solderable by the hobbyist/enthusiast.

    Results of J1 Simulation

    First Up  - Some Results

    In the last post I looked at running the J1 simulator on various platforms from the humble 16MHz Arduino to the red-hot STM32G746 - running at 216MHz.  Here are the results of those tests in J1 instructions per second - or JIPS as I call them.

    Arduino             16MHz    ATmega328                67,000 JIPS
    ZPUino              96MHz    Soft CPU                   152,000 JIPS
    STM32F103     72MHz   ARM Cortex M3         404,000 JIPS
    STM32F407   168MHz   ARM Cortex M4    3,000,000+ JIPS *
    STM32F746   216MHz   ARM Cortex M7    9,000,000+ JIPS *

    *  The last 2 results are based on level 00 Compiler Optimisation.  With more aggressive optimisation, the '746 was returning 27 million JIPS.

    So now that we have a means of simulating the processor at about 1/20th of full speed, the time has come to decide exactly how we are going to port a useful high level language onto this processor model.

    James Bowman has done excellent work porting Forth onto his J1 soft core, but I am not quite ready to plunge into Forth - for me it's about the journey of exploration in reaching a high level language implementation - under my own steam.

    A small revelation

    At this point it is interesting to note - that if the 27 Million JIPS is indeed correct - then the 216MHz  Cortex M7 core is executing about 8 instructions for every emulated J1 instruction - in this particular (non demanding) test program.  So it would probably to be fair to say that most modern ARM processors (M7 and above) would probably achieve a similar decimation ratio whilst simulating the J1.

    If this is the case, then a 1GHz ARM could simulate a 100MHz J1 - or put the other way, then a 100MHz J1 would have a similar overall performance to a 1GHz ARM - that was executing some sort of stack based Virtual Machine bytecode language - i.e. Java.

    As a lot of applications are written in Java  (eg Arduino IDE), then the overhead of running a virtual stack machine on a register based cpu slows it down by a factor of 10.  If however the Java bytecode were translated into an intermediate form (possibly J1 Forth) it would likely run appreciably faster.

    The point I am making is that with access to making one's own customised soft core stack cpu that has been tailored to Java bytecode, running on a FPGAs  could make Java run a lot faster on less powerful hardware machines. Some ARM ICs already have this ability to directly run Java bytecode - known as Jazelle. This is how some games are written, in order to run faster on small platforms - such as mobile phones.

    Running the J1 Simulator on ZPUino.

    The ZPUino has shown itself to be a  convenient and useful 32 bit processor, implemented on FPGA hardware. As the ZPUino is Arduino code compatible, and runs my simulations about twice the speed of an Arduino, plus the fact that it allows easy use of the Adafruit GFX graphics library, which permits 800 x 600 VGA text and graphics to be displayed on a flat screen monitor.

    Whilst not a particularly fast processor, ZPUino does allow easy and unrestricted access to the graphics library - such that it is easy to create a series of animated display screens for displaying high level output, using what is effectively and Arduino sketch. This technique is particularly flexible, and allows you to creatively interact with the particular problem - rather than get bogged down in someone else's system calls and drivers.

    I took the very short J1 test program as used in the simulations - a simple loop consisting of 7 instructions, and used the ZPUino to run this test program as an animated simulator - which graphically showed the contents of memory - as a hex dump, plus the main J1 registers, the stack and the instructions as they were stepped through. Repeated re-drawing of the hex dump memory display slowed the execution right down to about 1 instruction per second - about 100 millionth of real J1 execution speed.

    The Missing Assembler

    What was missing from this exercise was the ability to easily write J1 test programs in J1 machine code - and this rather hampered progress. So it is for this reason that the first application of the SIMPL text interpreter will be at the core of the J1 cross assembler.

    Whilst the J1 is intended to run Forth, and has the tools to support it, my Forth skills are not great, and anyway I'm trying to challenge myself to learning C to a reasonable standard.  So a coding project written in C, that taxes my language and thinking skills is a good way to learn, and achieve something useful.

    The interpreter can take a set of mnemonics, tailored for the J1 processor and by the process of direct substitution, create the series of 16 bit instructions that can then be run on the J1 virtual machine. I really want this to be an interactive process working in a Forth-like manner - so that small snippets or blocks of J1 assembly language can be assembled and tested individually as an iterative process.

    It's many years since I wrote any code in assembler - and that was Z80 which had a reasonable mix of registers to play with.

    Writing in a minimal instruction set language, is going to be interesting.

    In order to gen up on the processes involved within a typical assembler - I returned to "NAND to Tetris"  Chapter 6.  There is a good description of what is needed there.  I then wnt on to refresh myself on the contents of Chapter 7 -  "Virtual Machine I - Stack Arithmetic" and Chapter 8 "Virtual Machine II - Program Flow".  Having re-read these chapters, in the fresh light of a new day, I believe that my musings about the J1 cpu - are not only very relevant - but completely on track with the content and approach outlined in "NAND to Tetris".

    More on this in a later post.

    NAND to Tetris - A Personal Journey

    From NAND to Tetris (N2T) is the popular name for an open study Computing Science course devised by Shimon Schocken and Noam Nisan.  It is accompanied by a book, by the same authors "The Elements Of Computing Systems - Building a Modern Computer from First Principles" and a series of online and downloadable study materials.

    For anyone who wishes to get a more in depth understanding of the interaction between hardware, operating system and software application layers of a modern computer, or consolidate existing knowledge - then I would highly recommended purchasing this book, following the course materials and supporting this project - as a whole.  We need a whole new generation of Computer Science and Electronic Engineers - who understand this stuff from first principles.

    After first hearing about the course from contacts at the London Hackspace, I bought the book last year and I am slowly working my way through it.  By this, I mean that I am making my own personal tour of the country that it describes  - and not necessarily by the direct linear route outlined in the book.  I dip into it occasionally, rather like a travel guide, as if I were planning a trip to the next major city.  I believe that I will reach the final destination, but it will be the wealth of experience gained from meandering on the journey, rather than the final destination, that currently is my driving factor.

    I embarked on the book, having spent most of career in digital hardware design, but very little real experience of writing software tools. Whilst I found the chapters on hardware were fairly easy to follow, I hoped that the book would lead me gently into picking up new software skills.

    The first 5 chapters of the book illustrate and re-enforce the principles of combinatorial and sequential digital logic, by having the student design the logic function of the various "chips" that go up to make a simple cpu.  From basic gates you combine ever increasing complex designs to make up the arithmetic logic unit, ALU, the program counter and the various registers that make up the cpu.

    A hardware design language package allows the design, simulation and testing of the various logical components and gives the student confidence that their design meets the test spec of the required item.  It soon becomes apparent that there is no one way to implement the logic of the ALU - but some ways are quicker, more flexible or have a more efficient use of silicon.

    I completed the hardware design chapter exercises of the book during an intensive week of evenings in spring last year.  Then got a more than a little bogged down in the software section, as I realised at the time I did not have the programming skills in any language to do justice to the demands of the software exercises - beginning at Chapter 6 "Assembler".

    Rather than defeat by a complete road-block,  I have spent the last year surveying the surrounding territory for an alternative route to complete the mission.  In this, I have invested in FPGA hardware, designed pcbs for ARM processors and written simulator code for simple stack based processors.  I have now got to the point where the next logical step is to write an Assembler.

    I have picked up enough C skills to put together a simple text interpreter and use it to parse through tables of mnemonics looking for a match and associating a machine instruction with that scanned mnemonic.  It is the basis of a "poor mans" assembler, but it has the flexibility to be applied to whatever novel processors instruction set I wish to explore.  I can now go back to Chapter 6 - with my new knowledge and software tool and make new progress.

    In the intervening year - and at this stage in life we view projects in terms of years of involvement,  I have also learned a bit of Verilog and done a bit of FPGA logic design. These are skills I will need  to develop if I am to keep up with the modern world. And whilst I may no longer to be able to see (without glasses) some of the hardware I am working with,  I can still type, and I have the option of increasing the font size. That should keep me viable in the workplace for the next decade or so - although I do increasingly have my "dinosaur days".

    This move was partly inspired by the N2T book, and also my desire to get involved in the new wave of low cost FPGAs that have now become available to the hobbyist.  I might be as bold to say that in 2015, they are to the enthusiast what the 6502 was in 1975, and the Arduino was in 2005.  User friendly FPGA hardware is definitely going to be a growth area for the next few years.

    FPGAs allow you to design your own custom hardware, or recreate vintage or retro-hardware computers from years ago.  Soft core processors, featuring custom instruction sets are one area of involvement - and these will require software tools to simulate operation and allow code to be written.

    In addition, I have moved on from being constrained by just 1 or 2 microcontrollers. I am now experiencing the portability of software written in C, and discovering how easy it can be to switch between processors - even though I have some concerns about the complexity of modern IDEs.

    One of the tasks I set this year was to benchmark several microcontrollers with dhrystone and whetstone benchmarks - in an attempt to get a better understanding of how they perform under different applications.

    By characterising the relative performance and resources of a few common cpus - I am now able to make informed decisions about which might be more suitable for a particular job. Currently I am impressed with  the ARM Cortex M7,  and I am eagerly awaiting 400MHz versions of this M7 core - expected in late 2016-2017.

    Whilst 400MHz might appear puny to those who regularly use twin-core 1GHz parts in their mobile phone or Raspberry Pi, to them I offer the challenge of writing from scratch an Assembler!

    Saturday, November 21, 2015

    Beating the Bloat


    This post is by way of a minor rant about the current state of the tools and methods we use to produce embedded firmware.

    In order to perform the benchmark tests on the series of processors yesterday, I had to use 4 individual IDEs and spend 12 hours of my life fighting the flab of blobby bloatware that is the embodiment of the modern IDE.

    My grief really started when I wanted to port the J1 simulator to the Cortex M7. For this I needed a "professional"  tool chain.

    The Long and Winding Road.......

    In order to blink an LED on my STM32F746 breakout board, I had to install the 32K codesize limited version of Keil's uVision 5 and their ARM MDK. This takes about an hour to install and set up.

    Then I had to find an example project of something that was close to what I wanted to do - i.e. blink a LED. I found their generic Blinky example - and then found that it had been tailored for a couple of commercial dev boards - and the files that set up the port allocation were locked from editing within the IDE.

    So I opened the files in Notepad++, edited the dozen or so lines of code that controlled the GPIO port allocation, and then wrote my edited version in place of the original - so far, so good.

    Had I known that at 6pm I was still about 2 hours away from blinking a LED, I would have probably thrown in the towel and gone to the pub.  I eventually tracked down the problem to my particular port pin being re-assigned as an input in the example code, immediately after I had set it up as an output. There was also a minor problem with the clock generation set up for the wrong PLL ratio - that prevented the code from running.

    Now I have learnt that ARM processors are fairly complex beasts - and the peripherals take up a a fair time to set them up with the myriad of different options -but when I looked at the project files to blink a LED, I saw that it was taking about 100 code modules to set up the peripherals - and some of those modules were each 1000+ lines of code.

    However - as a fairly recent newcomer to the Keil compiler and the ST Microelectronics hardware abstraction layer - who was I to know which of the 100 files I needed and which I didn''t.

    This leads me nicely on to  Shotgun, Voodoo and Cargo Cult coding practices. I'll let the interested follow up the definitions, but the point that I am making is that the modern IDE and methods of using a hardware abstraction layer do absolutely nothing to help simplify the problem or reduce the amount of bloat that has to be compiled - regardless of whether it is being used or not.

    In order to flash a LED on and off, a single bit in a register needs to change state - why then do I need to compile 10,000 lines of somebody else's code into a 9.5k byte binary, in order to make this happen?

    Compilation times of over a minute really do nothing to boost one's productivity. Yet we persist with this madness making our compilation tools even more sophisticated - with the excuse that the processors that we are compiling for are getting more complex - and the commercial suppliers of compilation tools - need to be seen to be keeping ahead of the competition.

    It has been ever thus for about the last 50 years or more - with the computing industry pedaling us over bloated, over expensive tools that we neither want nor need.

    HAL: Just what do you think you're doing, Dave? 

    Well perhaps Dave should be asking HAL just WTF he thinks he is doing.  

    And in this case, HAL is the new hardware abstraction layer - cooked up by the teams of clever code monkeys at ST Microelectronics.  I understand that as code gets more complex then it needs to be better managed, and that somewhere out there, someone writing code for a Cortex M0 may have an epiphany moment and realise that he should port his code to a Cortex M7......  

    However, it appears that ST Microelectronics has employed a million monkeys with typewriters to undertake the mammoth task of writing the HAL modules - put them in separate rooms (or countries) and made it difficult for them to talk to one another.

    Not unsurprisingly, the HAL reference manual runs to 963 pages - and took another team of our simian chums to cook that one up. This link is actually for the STM32F4xx Cortex M4 processors - because it appears that the M7 has not been published yet.

    So in reverence to the computer Holly, from Red Dwarf  - I will call this code behemoth HOL  - or the hardware obfuscation layer - as that is exactly what it does.  It makes it difficult to know what your hardware is doing, nor what you need to do, in order to make it work for you.

    There has to be a better way - and if Carlsberg wrote compilation tool chains - they would probably be the best in the World.

    OK  - time for the pub...........

    Friday, November 20, 2015

    A J1 Virtual Machine - Gimme some Jips!

    BOB is no slouch when it comes to simulating a virtual stack cpu!
    Historical Note.

    Way back in 1991 when I was half the age I am now,  I did my pcb design work  using OrCAD on a 25MHz 486 desktop. The picture above is of my latest experimental pcb - a breakout board for the 216MHz  STM32F746 ARM Cortex M7 microcontroller.  BOB (above) can emulate a 16 bit minimal instruction set processor  faster than the 25MHz ' 486 box - and for about $20!  Now that's progress.

    Implementing a Stack Processor as a Virtual Machine

    This post examines the role of a virtual machine, created to run on a given processor for the purpose of simulating another processor, for performing operations that the host processor might not readily do easily. One example was Steve Wozniak's "Sweet 16"  - a 16 bit bytecode interpreter he wrote to run on the 6502, to allow the Apple II to readily perform 16 bit maths and 16 bit memory transfers.

    In his closing remarks, Woz wrote:

    "And as a final thought, the ultimate modification for those who do not use the 6502 processor would be to implement a version of SWEET16 for some other microprocessor design. The idea of a low level interpretive processor can be fruitfully implemented for a number of purposes, and achieves a limited sort of machine independence for the interpretive execution strings. I found this technique most useful for the implementation of much of the software of the Apple II computer. I leave it to readers to explore further possibilities for SWEET16."

    The main limitations to the VM approach is that the execution speed is often one or two orders of magnitude slower than the host running native machine code, but with processsors now available with clock-speeds of 200MHz - this is not so much of a problem.

    It is more than offset by the ability to design a processor with an instruction set that is hand-crafted for a particular application, or the means to explore different architectures and instruction sets, and to simulate these in software, before committing to FPGA hardware.

    Stack Machines

    Whilst Woz's Sweet 16 was a 16 bit register based machine, I had ideas more along the lines of a stack machine, because of it's simpler architecture and low hardware resource requirement.

    I had become interested in an interpreted bytecode language that I believed would be a good fit for a stack machine, and so in order to get the ball rolling, I needed a virtual stack machine to try out the language.

    Earlier this year, I invested in a Papilio Duo FPGA board, and with this came access to a ZPUino soft-core stack processor - devised and much enhanced from an existing design, by Alvie Lopez. The advantage of the ZPUino was that it was one of the few soft core processors that had GCC available, and so the task of porting the Arduino flavour of C++ to it was not over arduous (for those accustomed to that sort of task - not me!).

    However, porting C to a stack machine is never a very successful fit - as C prefers an architecture with lots of registers - such as ARM.

    As a result, the ZPUino, whilst clocked at 6 times the speed of the standard Arduino, only achieved about twice the performance when running a Dhrystone Benchmark test - written in C.  The other factor limiting  ZPUino is that it executes code from the external RAM - and there is a time overhead in fetching instructions.

    Despite these limitations, the ZPUino has been a useful tool to run simulators, as it supports VGA hardware and the Adafruit Graphics library - allowing text and video output from an Arduino-like environment.

    The other stack processor that caught my attention is James Bowman's J1 Forth processor.  This became available as an implementation on the Papilio Duo  in early September to run on readily available FPGA hardware at speeds of up to 180MHz. So I have been working towards trying it out - first using a software simulator.

    A J1 Simulator - written in C - and tried on a number of processors.

    Back in the spring, I found a bit of C code that allowed a J1 processor to be run as a virtual machine on almost any processor.

    Initially, I implemented it on Arduino, but I quickly moved to the faster ZPUino - which, as stated above, is a stack based processor implemented on a FPGA.  This was a stop-gap, whilst I was waiting for James to release his J1 in a form that I could use.

    The simulator is about 100 lines of standard C code, and implements a 16-bit processor with integer maths and a 64K word addressing space.

    I then wrote a test routine, in J1 assembler, consisting of just  a simple loop - executing 7 instructions and incrementing (by 1) a 16-bit memory location, every time around the loop.

    Running this test code - the standard 16MHz Arduino managed  67,000 J1 instructions per second. (Jips).

    I then transferred the sketch to the ZPUino, running on the Papilio Duo board.  This provides a useful boost in performance to about 152,000 Jips.

    A 72MHz  STM32F103 running the same code under STM32-Duino managed  404,000 Jips - about 6 times the speed of the Arduino,  - a healthy performance boost.

    The difference in performance between the 8-bit Arduino and the 32 bit STM32F103 - could be explained to be partly down to the 4.5 times increase in clock speed, and partly that a 32 bit microcontroller can implement a 16 bit virtual machine somewhat more efficiently than an 8-bit device giving an additional 30% boost - over clock speed scaling alone.

    In addition, the test code only added one to the memory cell. If this were say adding a 16 bit value into that location - the 16 bit transfer would slow the 8-bit AVR down considerably.

    I then proceeded to port the simulator to a 168MHz  STM32F407 Discovery board. The 168MHz STM32F407 returned a slightly puzzling 764,000 Jips.

    Based on the increase in clock speed it should have been about  940,000 Jips. This appeared to be a bit slow.  In theory it should be running at 2.33 times the speed of the 72MHz part.  This needs further checking to ensure that it is not a compiler optimisation issue that is holding it back.

    I tried again with the various optimisation levels of the  GCC compiler:

    Optimisation -00           733,333    Jips
    Optimisation -01           3,083,333 Jips
    Optimisation -02           3,333,333 Jips
    Optimisation -03           3,583,333 Jips

    With only modest optimisation the '407 is returning around 3 million Jips!

    Meet BOB - the fastest, newest kid on the block.

    Back in the summer I made up a break out board BOB for the 216MHz STM32F746  Cortex M7 microcontroller.  Whilst ST Microelectronics had released their $50 F7 Discovery board - complete with LCD, I wanted a very simple board, with the same pin-out as the previous F4 Discovery to try out relative performance checks.

    So, it's now time to port the J1 simulator onto the STM32F746 - and see how it performs.

    The '746 is an M7 ARM and has a six-stage dual issue pipeline - which means that it can effectively load two instructions from memory at once.  This feature and the higher clock frequency gives it a 2.2 times speed advantage over the '407.

    With all this working, the 746 BOB board - should be able to simulate the J1 at around 7.8 million J1 instructions per second  - welcome back to the 1980's in terms of performance!

    Whilst we can emulate the J1 in C at around  8 Million Jips, the real J1 should manage nearly 200 Million Jips, so when I get real J1 hardware up and running - it should really fly!


    After a long day and half a night of battling with compilers, I just got the figures for the STM32F746 running the J1 interpreter at 216 MHz. Initial measurements suggest that it's running at close to 15 million Jips per second with minor optimisation and about 27 million JIPS with the most aggressive optimisation!

    Optimisation 00        9,000,000 JIPS
    Optimisation 01        15,000,000 JIPS
    Optimisation 03        27,000,000 JIPS

    Thursday, November 12, 2015

    Minimal Text Interpreter - Part 3

    The operation and main routines of a minimal text interpreter  - Part 3

    This post is merely a description of the first implementation of the text interpreter looking at the principal routines. It's so I can remember what I did in 6 months time.


    The minimal text interpreter is the first stage of enabling plain text to be converted into computer machine language.  Charles H. Moore the inventor of Forth, found an efficient means of doing this in the 1960s whilst working with mainframe computers, so I have chosen to adhere reasonably close to his methods.

    A word is a group of consecutive characters ending in whitespace. The first task for the interpreter is to step through the line of characters and identify the individual words.  At the same time it can keep track of a word-length counter so that it can reformat the text into a more compact format.

    Take the sentence "A word is a group of consecutive characters ending in whitespace"

    It is a string of 64 characters, containing 11 words, and a minimum of (11-1) spaces.  We first pick out the individual words and store them into a temporary processing buffer,  we can also count the number of characters in the word and put it alongside

    A                    1
    word              4
    is                   2
    a                    1
    group             5
    of                   2
    consecutive    12
    characters      10
    separated       9
    by                  2
    whitespace    10

    So we have 11 entries in our table,  with their lengths.  If we assume that we are going to restrict the word length to 16 characters - we could express the length as a single hex character

    We now crop the words down to the first 3 characters.  For those of less than 3 characters, we fill in with spaces:

    A  1
    is 2
    a  1
    of 2
    conC        (12)
    chaA        (10)
    by 2
    whiA       (10)

    So we have reduced our original 64 input characters  to 11 x 4 - which is about a 30% reduction.  Tests carried out with Forth showed that the first 3 characters and the length was an optimum way of differentiating between and storing words in the dictionary.  For assembler applications where mnemonics tend to be only 3 or 4 characters it is pretty much an optimal first step of decoding the source code, prior to allocating the machine code op-code tokens.

    Once the text scanner has  reduced each input word to three characters and a length byte, it then become practical to use this shortened representation to perform a word match.  All of the system keywords and user words can be stored compactly in this format, along with their jump addresses.  If 8 bytes are allocated to the word entry in the look-up table, this allows a 16 bit jump address, a pointer to the table where the original unencoded word is stored and a 1 byte attribute - that can be used to tell the compiler something about the word to help at compilation time.

    Consider a system that has a maximum of 256 user words and 256 keywords - each internally coded at 8 bytes per word.  Then this would need 2K in Flash for the keywords - not a problem for most small micros, but the 2K user table would rapidly start to eat up the limited RAM resources.  Fortunately most small user applications are unlikely to have anything like 256 User words, and so halving this, a 1K user dictionary space would be quite acceptable.

    Immediate and Compilation Modes

    The above text scanning function can be used in two distinct cases:  immediate and compilation modes.

    In immediate mode, a word typed into the input buffer will be executed immediately - provided of course that it already exists in the dictionary. If not, it will lead to an error message. Once found in the dictionary the jump address associated with that word is put onto the pc and the processor executes the code it finds at the jump address.

    However in compilation mode, the user wants to create a new word definition, and uses a Forth method called the colon definition.  As it name suggests a colon definition begins a new line with a colon : This tells the text interpreter, that the proceeding word is going to be new and the interpreter should enter compilation mode.  A colon definition begins with a colon:  then the name of the new_word, then the definition i.e. the code words associated with that function and finally ends with a semi-colon ;

    The Forth word colon definition performs the same operation as a function in C - compiling the function and putting it into memory as a series of threaded calls to the various routines.

    : new_word    (put the definition here)  ;

    In C

    // put definition here

    Once a new_word has been defined as above it can be executed immediately - just by typing its name. This is what gives Forth it's almost unique characteristics of being an interactive and extensible language.   The functions are written and compiled and can be tested in isolation of one another - so that a large project may be built interactively in small blocks - testing each block as you go.

    Currently only the basics have been implemented - by way of a proof of concept, and running on a 2K RAM Arduino. Later this will be ported to various ARM Cortex parts, the FPGA - softcore ZPUino and ultimately the J1 Forth processor.

    There are probably many ways in which this could be implemented - some giving even more codespace and memory efficiency.  As a rookie C programmer, I have stuck to really basic coding methods - that I understand. A more experienced programmer would probably find a neater solution using arrays, pointers and the strings library - but for the moment I have kept it simple.

    The interpreter resides in a continuous while(1) loop and consists of the following routines:


    Reads the text from the UART into a 128 character buffer using u_getchar.
    Checks that the character is printable - i.e. resides between space (32) and tilde ~ (127) in the ascii table and stores it in the buffer.
    Keeps accepting text until it hits the buffer limit of 128 characters or breaks out of this if it sees a return or newline  \r or \n character.


    This checks if the text starts with a colon, and so is going to be a new colon definition.
    sets flag colon=1
    calls the build_buffer function


    If the leading character is not a colon, this function determines that the word is either within the body of the definition, or it is for immediate execution.  It calls build_buffer,  but only builds the header to allow a word match. It should not add the word to the dictionary, if it gets a match and is already there.


    This checks the first 3 characters of the word and puts them into a new header slot in the headers table.
    It also calculates the word length by counting the characters as it stores them into the dictionary table, which it continues until it sees a terminating space character.
    It increments the dictionary pointer ready for the next word


    This compares the 4 characters of the header of the newly input word with all the headers in the header table.
    If all 4 characters match then it drops out with a match_address (for the jump address look-up table) and sets a match flag  match= 1.


    This is a utility routine which prints out a list of all the headers in the order they are stored in the headers table.


    This is a utility routine which prints out a list of all the words in the dictionary in the order they were stored in the dictionary table.


    This is the main character interpretation function which implements the SIMPL language core


    Not yet implemented.  Returns true if it finds a word and invokes build_buffer and word_match


    Not yet implemented.  Converts the ascii text to a signed integer and stores it in a parameter table.
    Might possibly use ascii 0x80 (DEL) to signify to the header builder that the following bytes are a number.  Will need a conversion routine to go between printable and internal storage formats.

    UART Routines

    These provide getchar and putchar support directly to the ATmega328 UART. Saves a huge amount of codespace compared to Serial.print etc


    Initialises the ATmega328 UART to the correct baudrate and format.


    Waits until the Tx register is empty and then transmits the next character


    Waits until a character is present in the UART receive register and returns with it

    Printing Routines

    Having banished Serial.print - I had to implement some really basic functions


    Sends a 16 bit integer to the UART for serial output


    Sends a 32 bit integer to the UART for serial output

    A Minimal Text Interpreter - Part 2

    A Text Interpreter to run on a Resource Limited Microcontroller  - Part 2

    In the previous post, I described the basics of a tiny text interpreter, written in C, intended for use on resource limited microcontrollers. The text interpreter would offer a natural language user interface, allowing programming and command line control of various microcontroller projects.

    It will also form the basis of a wider range of self-written computing tools, including assembler and compiler, editor and file handler - all of which could be hosted, if necessary on a resource limited target board.

    However, for the moment, and for ease of experimentation, the intention was to get the interpreter to run with only 2K of RAM (as per the Arduino Uno).

    I envisioned the text interpreter as being a universal resident utility programme (almost akin to a bootloader) that would be initially flashed onto the microcontroller thus allowing a serial command interface and the means to exercise the hardware or develop small interactive programmes.

    At work and at home, there are many instances of when I want some degree of interactive control over a small microcontroller project - even if it is just manipulating some port lines or sending and receiving a few serial responses on a terminal programme.

    Some Practical Limitations

    In order to keep the demands on the interpreter program reasonable it is necessary to put some limits on its capabilities.  In particular, the number of words it can recognise and create jump addresses for. For convenience I used a look up table to hold the jump addresses.  If the look up table is to remain reasonably compact - then a limit of 256 entries seems reasonable.  Restricting the word capacity will also help keep the dictionary and its headers to a manageable size in RAM. This is important when you only have 2K to play with!

    As the 4 byte header is in fact a shortform, or compact coding convenience that represents the dictionary, it could be said that in very RAM limited systems that it is not actually a requirement to keep the dictionary in RAM on chip.  The only role that the dictionary performs is to allow the header entries to be expanded to the full word at times of listing.

    As small micros generally have plenty Flash available, then the dictionary for all the standard words could be programmed into flash - as indeed could their headers.  If necessary, a shell hosted by a PC application could be used to host the various dictionaries and source code files needed for particular applications. However, the original aim is that this interpreter vastly increases the user-friendliness of the microcontroller target - even with just a serial terminal as the user interface.

    Additionally, I have imposed a word length limit to 16 characters.  Imposing this limit means that the word length can be coded as a single hexadecimal digit - which makes it displayable in ascii and human readable. If you can't name something uniquely in 16 characters then you are probably of German extraction.


    Different tasks need different tools, and as the interpreter will be used for a variety of tasks, then it seems reasonable that it can be augmented or tailored towards a particular task. This can be done very conveniently with the use of vocabularies - with a particular vocab being used for a particular task.  A vocab that contains the mnemonics of a particular processor's instruction set would be one sensible use when using the interpreter within an assembler, compiler or disassembler.  


    Those of you that are familiar with Forth, will say that I am just creating a portable Forth-like environment, but rather than being coded in the native machine language of the target processor, it has been written in C for convenience and portability.

    This is indeed partly true, as the utility I am creating has been inspired by the Forth language - especially in its compactness and low resource requirements.  Even in the 1960s Charles Moore was concerned how the tools provided for computing at that time hampered progress, and so set about redefining the whole man-machine interface. He compressed the previously separate editor, compiler, and interpreter programmes (none of which could be co-resident in memory at the same time) into a single compact, convenient package that did the job of all three.

    When Forth was first introduced in the late 1960s, mini computers had sub-MHz clock frequencies, and very little RAM, and so benefited greatly from a moderately fast and compact language like Forth. Nowadays, typical microcontrollers have clock frequencies in the 20MHz to 200MHz range and so are not so hindered by what is essentially a virtual machine implementation of a stack processor written in C.

    Virtual and Real Stack Machines

    I have embarked on this journey because of my wider interest in soft-core and open core processors implemented on FPGAs. Several of these cores are based on stack machines, partly because they may be readily implemented in surprisingly few lines of VHDL or Verilog. Indeed James Bowman's J1 Forth processor is fully described in fewer than 200 lines of verilog.

    Whilst a virtual stack machine might not be the easiest fit for a register based processor without performance penalties, it is a wonderful fit for a real stack machine.  A number of open-core processors including the ZPUino and James Bowman's J1 are true stack machines.  Here the instruction set of the virtual machine have a near one to one direct mapping to the the machine instructions of the stack processor.  In this case the text interpreter can be rewritten in the native assembly language of these cpus, to benefit from the vast increase in speed of running without an additional layer of virtual machine.

    In order to do this an Assembler will be required  that is tailored to the instruction set of the Forth Processor, and this is one of the first tasks that the text interpreter will be used for - the assembly of machine code for a custom processor.

    One of the reasons why I am concerning myself with such low level primitive tools, is the need to understand them from the ground up so that they can be implemented on a variety of non-conventional processors.

    Whilst the ZPUino will execute Arduino code directly (albeit very inefficiently  - because of the C to stack machine programming conflicts), the J1 will need the tools to write it's own language from the ground up - and if you already have the mechanisms of a language in place, plus an easily customisable assembler, then it makes the job a lot easier.

    In a later post, I will give an update on the text interpreter and it's application to custom code assemblers.

    Wednesday, November 11, 2015

    The Subtle Art of Substitution - Part 1

    A simple text interpreter that allows code to be invoked by natural language words.     Part 1.

    Over the weekend and in various bits of spare time I have been developing a tiny text interpreter in C, as part of the larger project of creating some low-overhead tools to run on various microcontroller targets.  The toolset will eventually include assemblers and compilers for some custom soft-core processors - but first I need the means to interpret typed text words and execute blocks of code if the word is recognised.

    Why this is useful

    This text interpreter is intended to provide a more human friendly interface layer to my SIMPL interactive programming language.  Writing in high level, more meaningful natural language will greatly enhance the speed at which SIMPL code can be generated.

    A natural language interface makes programming tasks much easier.  Real words are more memorable than individual ascii characters, and it all makes for more readable code listings. Whilst SIMPL might use lower case "a" to initiate the analog read function, typing "analog" is a lot more reader friendly. An interpreter that follows a few simple parsing rules can offer a much increased speed of programming, yet be modest in the amount of on-chip resources utilised to do this.  The code to implement the interpreter is about a 2K to 3K overhead on top of SIMPL - but that will include listing, editing and file handling utilities too.

    Substitution and Assemblers

    A text interpreter and its ability to execute blocks of code based on parsing the text commands or file it receives is a fundamental part of utility programmes such as assemblers and compilers. Here a set of keyword mnemonics representing instructions can be interpreted and used to assemble machine code instructions by direct substitution.

    With a simple text interpreter we can move out from the realms of numerical machine language, and implement the likes of assemblers, dissassemblers and even compilers.

    In the case of an assembler, the wordset will comprise of the mnemonics used by the target processor - and the interpreter will merely substitute the human readable mnemonic for the machine instruction numerical opcode.

    For example, a certain processor may have an instruction set including mnemonics such as ADD, AND, SUB, XOR etc. The role of the text interpreter is to find these words within the text buffer or input file and create a table consisting of direct machine code instructions, subroutine call addresses and other variables to be passed via the stack to those subroutines.

    At a level above the assembler is the compiler.  This also takes text based input and generates machine code to run on a specific processor.  However, compilers are very complex pieces of software, and it is more likely that I will find an alternative solution  - given long enough.

    Why do this?

    The purpose of the text interpreter is to provide a natural language text interface for a small, resource limited microcontroller - in a similar style to what was provided with the various BASICs of the late 1970's. It's remarkable to think that some fully functioning basics fitted into 4K ROM and 1K of RAM - solely by some very clever programming tricks - in raw assembly language.

    Fortunately most embedded programming these days does not have to resort to raw - assembler, and C has become the preferred interchange language layer for portability.  C code written for an Arduino, may be fairly easily ported to an ARM - provided that low level I/O routines - such a getchar and putchar are available for the target processor.

    Coding up a text interpreter is a good exercise in C coding - and as I am not a natural born coder, any meaningful coding exercise is good practice.  I also enjoy the challenge of making something work from scratch block by block - rather than being over reliant on other peoples' code, that I don't even pretend to understand.


    As a bare minimum, we assume that the microcontroller target can provide a serial UART interface for communicating with a terminal program. I have recoded Serial.print and it's associated functions to use a much smaller routine - which saves memory space.

    Ideally the microcontroller should have at least a couple of kilo-bytes of RAM for holding the dictionary and headers making it possible to implement it on anything from an Arduino upwards.

    The text interpreter is an extension of the SIMPL interpreter, and can be used for programming tools such as text editors, assemblers, compilers and disassemblers. It provides the means to input text, analyse it for recognised keywords and build up a dictionary and jump table.

    Borrowing from Forth experience, the text interpreter (or scanner) will look for a match on the first 3 characters of the input and the length of the word.  As a word is typed in, it will initiate a search of the dictionary (of already known words). If a match is found, the word will be substituted for a 4 digit  (16 bit) jump address. If the word is not matched, it will be added in full to the dictionary table.

    This sounds all very Forth-like, and indeed it is, because it is a proven means to input new text data into a processor's memory using minimum of overheads. The dictionary structure is simple enough that it can easily be parsed for a word-match, and also processed for editing and printing.

    As each Forth definition is effectively just a line of text it can easily be  handled with a line-editor - again a simple task for a resource limited processor.

    Numbers are handled as literals. A quick scan of the text with "is_a_num" will reveal whether it is numerical text - if so it should be converted to a signed integer and put onto the stack.

    The output of the text interpreter should be a series of call addresses relating to the functional blocks of code that perform the routines associated with the keyword.  In the case of the assembler example, the mnemonics can be translated directly using a look-up table which converts them directly into the machine instruction of the target processor - this is especially relevant if the target is a stack machine - such as the J1 forth processor.

    Charles Moore struck on the idea of a language that was designed for solving problems.  He envisioned having separate vocabularies for each problem he wanted to solve.  For example his assembler application would use a vocabulary tailored to that application - namely the mnemonics as keywords, similarly the SIMPL language would utilise a vocabulary that supported the SIMPL command set. Thus by pointing to a different vocabulary in flash, the processor can readily swap between contexts.

    Hop, Skip and Jump, - the Mechanics of Text Interpretation

    Short of providing a flow chart - the description below describes the operation of the text interpreter.

    The text interpreter will parse through lines of text, taking each "word" as defined by a group of characters terminated by white-space, and check through a list of dictionary words for a match. If there is a match, then the newly scanned word is either a system keyword or a new one that the user has previously added to the dictionary.

    If the word does not generate a match with any existing keywords then it is added to the end of the dictionary - thus allowing a match the next time it is used.

    In addition to the dictionary, there is a separate array of records that will be known as the "headers". The headers consists of a shortform record of all of the words in the dictionary.  The purpose of the headers is to allow an efficient search to be performed on the dictionary entries - as words are listed in the headers by their first three characters and their length.  A match on the first 3 characters and the length was proven many years ago to be an effective and efficient means of word recognition- see section 3.2.3 here

    Once the header of scanned word has been deemed to match the header of one already in the header table, a jump address pointer can easily be calculated - its actually generated as part of the matching routine.  This jump address pointer is decoded by a look up table to generate an actual 16-bit jump address.

    For compactness and efficiency,the word matching routine is limited to a maximum vocabulary of 255 words - which is more than enough for most applications.

    The text interpreter deals with lines of code.  At some point their will be an accompanying package that implements a line editor, as the first step towards a full, screen editor.

    The input buffer of a terminal program may be some 250 to 300 characters long. This is more than adequate space to define most sequences of command words.  Indeed - it may be beneficial to restrict the input buffer to say 128 characters - as this is what can be displayed sensibly on an 800 x 600 VGA screen.

    Word storage format

    The shortform entries stored in the dictionary headers can be saved as a group of 4 bytes, consisting of the first 3 characters and the length byte. The routine that searches the headers for the match automatically generates the jump address pointer allowing a lookup to the actual jump address from a table.

            Char1, Char2, Char3, Len
    Byte     0         1           2        3

    So a word can be expanded by knowing its length and the dictionary pointer to its 1st character
    The jump address is shortened to a single byte fast look-up from a table.


    It's taken a bit longer than expected, but after an intensive day thinking, re-thinking, then coding, the tiny (2K) text interpreter is now starting to take shape.

    I have put an interim version (#45) on this github gist

    The interpreter is written in fairly standard C so it can be ported to a number of devices.  If implemented using Arduino using the Serial.print library it uses about 4142 bytes of flash and 1897 of RAM.  By using much more efficient custom UART routines for serial input and output, this can be massively reduced to just 2002 bytes of Flash and 1710 bytes of RAM.

    Part 2 of this posting will look further at features of the text interpreter and the SIMPL toolset.