Sunday, November 22, 2015

Results of J1 Simulation

First Up  - Some Results

In the last post I looked at running the J1 simulator on various platforms from the humble 16MHz Arduino to the red-hot STM32G746 - running at 216MHz.  Here are the results of those tests in J1 instructions per second - or JIPS as I call them.

Arduino             16MHz    ATmega328                67,000 JIPS
ZPUino              96MHz    Soft CPU                   152,000 JIPS
STM32F103     72MHz   ARM Cortex M3         404,000 JIPS
STM32F407   168MHz   ARM Cortex M4    3,000,000+ JIPS *
STM32F746   216MHz   ARM Cortex M7    9,000,000+ JIPS *

*  The last 2 results are based on level 00 Compiler Optimisation.  With more aggressive optimisation, the '746 was returning 27 million JIPS.

So now that we have a means of simulating the processor at about 1/20th of full speed, the time has come to decide exactly how we are going to port a useful high level language onto this processor model.

James Bowman has done excellent work porting Forth onto his J1 soft core, but I am not quite ready to plunge into Forth - for me it's about the journey of exploration in reaching a high level language implementation - under my own steam.

A small revelation

At this point it is interesting to note - that if the 27 Million JIPS is indeed correct - then the 216MHz  Cortex M7 core is executing about 8 instructions for every emulated J1 instruction - in this particular (non demanding) test program.  So it would probably to be fair to say that most modern ARM processors (M7 and above) would probably achieve a similar decimation ratio whilst simulating the J1.

If this is the case, then a 1GHz ARM could simulate a 100MHz J1 - or put the other way, then a 100MHz J1 would have a similar overall performance to a 1GHz ARM - that was executing some sort of stack based Virtual Machine bytecode language - i.e. Java.

As a lot of applications are written in Java  (eg Arduino IDE), then the overhead of running a virtual stack machine on a register based cpu slows it down by a factor of 10.  If however the Java bytecode were translated into an intermediate form (possibly J1 Forth) it would likely run appreciably faster.

The point I am making is that with access to making one's own customised soft core stack cpu that has been tailored to Java bytecode, running on a FPGAs  could make Java run a lot faster on less powerful hardware machines. Some ARM ICs already have this ability to directly run Java bytecode - known as Jazelle. This is how some games are written, in order to run faster on small platforms - such as mobile phones.

Running the J1 Simulator on ZPUino.

The ZPUino has shown itself to be a  convenient and useful 32 bit processor, implemented on FPGA hardware. As the ZPUino is Arduino code compatible, and runs my simulations about twice the speed of an Arduino, plus the fact that it allows easy use of the Adafruit GFX graphics library, which permits 800 x 600 VGA text and graphics to be displayed on a flat screen monitor.

Whilst not a particularly fast processor, ZPUino does allow easy and unrestricted access to the graphics library - such that it is easy to create a series of animated display screens for displaying high level output, using what is effectively and Arduino sketch. This technique is particularly flexible, and allows you to creatively interact with the particular problem - rather than get bogged down in someone else's system calls and drivers.

I took the very short J1 test program as used in the simulations - a simple loop consisting of 7 instructions, and used the ZPUino to run this test program as an animated simulator - which graphically showed the contents of memory - as a hex dump, plus the main J1 registers, the stack and the instructions as they were stepped through. Repeated re-drawing of the hex dump memory display slowed the execution right down to about 1 instruction per second - about 100 millionth of real J1 execution speed.

The Missing Assembler

What was missing from this exercise was the ability to easily write J1 test programs in J1 machine code - and this rather hampered progress. So it is for this reason that the first application of the SIMPL text interpreter will be at the core of the J1 cross assembler.

Whilst the J1 is intended to run Forth, and has the tools to support it, my Forth skills are not great, and anyway I'm trying to challenge myself to learning C to a reasonable standard.  So a coding project written in C, that taxes my language and thinking skills is a good way to learn, and achieve something useful.

The interpreter can take a set of mnemonics, tailored for the J1 processor and by the process of direct substitution, create the series of 16 bit instructions that can then be run on the J1 virtual machine. I really want this to be an interactive process working in a Forth-like manner - so that small snippets or blocks of J1 assembly language can be assembled and tested individually as an iterative process.

It's many years since I wrote any code in assembler - and that was Z80 which had a reasonable mix of registers to play with.

Writing in a minimal instruction set language, is going to be interesting.

In order to gen up on the processes involved within a typical assembler - I returned to "NAND to Tetris"  Chapter 6.  There is a good description of what is needed there.  I then wnt on to refresh myself on the contents of Chapter 7 -  "Virtual Machine I - Stack Arithmetic" and Chapter 8 "Virtual Machine II - Program Flow".  Having re-read these chapters, in the fresh light of a new day, I believe that my musings about the J1 cpu - are not only very relevant - but completely on track with the content and approach outlined in "NAND to Tetris".

More on this in a later post.

















No comments: