Realising the DCPU16

I decided to do a blitz and get a simple working version of the DCPU16 cpu in hardware.

This was my journey.

General Architecture
The DCPU16 architecture is not exactly standard, nor was it optimised for hardware implementation. However, since it is a simple CPU, the architecture itself is not too difficult to deal with.

The main pipeline stages for regular instructions would consist of the following stages: fetch, decode, load A, load B, execute, and save A. My design pipelines it using a 8-stage pipeline, with 1-clock cycles for each stage, layered over 2-instructions. Therefore, each instruction would effectively take 4-stages or 4-clocks to complete.

To feed this pipeline, two independent memory busses are needed. I’d hazard to call it a Harvard architecture because one memory bus is used purely for loading data while the other bus is used for fetching instructions and storing data. This will work fine for internal memory access only. However if it is necessary to access memory mapped I/O, this will need to be modified slightly.

There are two ways to modify it. One bus can be used for data load/store operations (operand A) while the other is used for instruction fetch and data load (operand B) operations. The other way to modify it is to use the two spare cycles to do external memory access instead.

The decoder’s only job is mainly to decode the effective address calculation. Decoding this can be a little pain as the processor supports a whole number of addressing modes, various direct, indirect and immediate modes. So, this is the trickiest part of the core. In fact, this is also the file with the most code in it.

Nothing much to say here except that it uses unsigned numbers, which is fine for the adder but not so fine for the multiplier. My design uses a 17×17 multiplier instead of a 16×16 one. The conditional code testing is also part of the ALU decoding. This can be changed in later revisions.

Due to pipe-lining, there will be some data and control hazards. My design does not take into account data hazards at the moment. My assumption is that the compilers will take care of things or code can be manually re-ordered slightly.

All in all, it took me almost a week plus two iterations to get it done.

Now, it’s released on github.

Visual Diff

I like the idea of doing a visual diff – particularly for circuit schematics and PCB layouts. This is quite interesting. Honestly, I have never actually thought of this before but it’s great that existing tools can be used to clearly record the diff between hardware revisions in a visual manner.

The general idea behind it is pretty simple and straight-forward:

  1. Using the circuit tools, output a standardised graphical output e.g. PDF, SVG, PNG etc.
  2. Use ImageMagick to convert those graphics to a standard black and white format.
  3. Use ImageMagick to do some fancy processing on both graphics.
  4. This will immediately highlight the differences with false colouring.

Now, it’d be great if someone wrote a git post-hook to auto-magically do this.

Stuxnet Worm

This video comprehensively explains how a hypothetical attack could be carried out by an attacker using the Stuxnet worm. This has very serious implications because it means that low-level industrial embedded systems are also now targets for attack. These SCADA systems are used everywhere and lack the necessary resources to defend themselves from attack.

The technique used is a fairly straight-forward one. The attacker can download and modify a programming library and use that to intercept the actual programme being downloaded onto the SCADA system.

This technique has been used for ages, in a non-malicious way. For example, my Xilinx board does not properly support my OS for programming. Thankfully, someone out there has written an open source driver to intercept all the Xilinx calls. Install this driver and the Xilinx ISE will think that it is talking to its own driver while all its calls are actually being intercepted by the wrapper.

Interesting, and cool.

This makes me think that I will probably need to build in some protection to my future cores to at least enable a limited amount of security checking of downloaded code.

Solution, looking for Problem?

Sometimes, I wonder if I am driven to solve problems or that I get caught up with the beauty of the solution instead. There are some technologies that I have built that are definitely useful – for something. However, I may be staring too closely at the solution to actually see the problem space. Therefore, I have decided not to work on the solution for a while but to look outside for inspiration instead. I already know what I want to do. Now, I just need to know where to apply it.

For the next couple of months, I will embark on a serious journey of creation. I plan to document the entire process on this blog and hope that it may one day be useful to somebody. For now, I will go for a short run followed by dinner and some shopping.

AEMB Benchmarked!

The AEMB has actually been benchmarked! I have claimed that the AEMB is the world’s fastest and smallest 32-bit multi-threaded RISC processor. Chapter 2 of this thesis put it in terms of real numbers!

Extracting the pertinent section of the results:


The AEMB has a MHz/LUT result of (0.3 MHz/LUT), which is way ahead of the rest. The thesis goes on to do a software Dhrystone and Fibonacci benchmark and found the performance to be good too.

Unfortunately, there were many issues faced during the implementation of the AEMB the resulted in the author dropping the use of the AEMB in favour of making a custom processor.

The main issue was that the author had difficulties targeting an ASIC platform because the AEMB was designed for the FPGA and optimised for an FPGA (which shows in the results). There were many design trade-offs that were made to make it very small and fast on an FPGA platform. Unfortunately, this is a show stopper as trying to port it to an ASIC technology would essentially require a redesign of the entire architecture.

There were other issues as well, including poor documentation and bad sample software. While I will agree with the part on poor documentation, I think that the author probably mistook the old AEMB sample software for the new AEMB sample software. This is something that can be avoided with better documentation to make it clear. So, ditto – it’s bad documentation. This needs to be fixed for the next generation.

Regardless, I’m happy that we can now put some numbers to my claims!

Little Big Computer


After taking the customary bow of respect at the person who designed this level on LBP, I’d just like to say that this was not that difficult to do – albeit very time consuming.

He did not actually design a complete computer but focused on the main computational component – the arithmetic unit. In this particular case, the arithmetic unit can perform two functions – addition and subtraction.

What this video actually shows is the exemplification of the Church-Turing thesis. In this case, the PS3 has successfully performed a simulation of an arithmetic unit, within the confines of a simulated virtual world. Nice recursion.

Also, it might seem weird that this computer was constructed from moving parts instead of, say, electrons. However, before we were inundated with the world of electronic digital computers like today, we were once using mechanical computers too, such as the Zuse Z1 currently on display in Berlin. Our history of computing is filled with all kinds of computers.

Now, if only it could be turned into a really interesting gaming level.

Multi-core Parallelism

As I mentioned in a previous blog, I am running a six-core machine – it is actually a virtual machine. Regardless, I noticed one thing while doing it. As I went from single-core to six-core, performance improved accordingly. But when I went up to eight-core, things deteriorated. The reason for this is probably because of the arrangement of the processors. I am running a two-socketed six-core AMD system. So, for an eight-core machine to work, it would need to cross sockets, which is not a good idea due to the communications hierarchy.

So, that’s why I ended up using a six-core VM.