More Physical Programming

There is more build-up on what I have been talking about for years – that to write good software, one needs to know the hardware well. It just makes sense. Anyway, this blog has a good quote that I would like to include here.

Quicksort. Divide and conquer. Search trees. These and other algorithms form the basis for a classic undergraduate algorithms class, where the big ideas of algorithm design are laid bare for all to see, and the performance model is one instruction, one time unit. “One instruction, one time unit? How quaint!” proclaim the cache-oblivious algorithm researchers and real world engineers. They know that the traditional curriculum, while not wrong, is quite misleading. It’s simply not enough to look at some theoretical computing machine: the next-generation of high performance algorithms need to be in tune with the hardware they run on. They couldn’t be more right.

Physical Programming

I had the opportunity to have a chat with some colleagues at work recently about the physical factors affecting programming. Today, I came across a question on Serverfault that asked why performance deteriorated when they went from a 4-core machine to a 24-core machine. To the lay person, it does not seem to make much sense. However, to someone like me, it makes perfect sense.

while most computer scientists and engineers may treat computers as an abstraction of various hardware, the reality of the matter is that these hardware are all bound by the laws of physics.

Taking the example of a 4-core machine running faster than a 24-core machine, it is easy to understand this once you understand how computer memory works. Computer memory is extremely slow. Therefore, in order to perform fast computation, internal cache memory is used inside a microprocessor. There may be several levels of internal cache usually denoted by L1, L2, L3 and so on.

The closer the memory is to the processor, the faster it runs but the smaller it gets. Main memory on a PC can reach multi-gigabyte sizes but they run extremely slow. So, if an application runs well on a 4-core machine, it may suffer when it moves to a 24-core machine because data needs to be exchanged across memory boundaries. In a single socket 4-core machine, all inter-process communication can happen within the L2 cache but for a multi-socket 24-core machine, some of the inter-process communication needs to go through main memory, which is slow.

Taking another example further down the memory hierarchy – hard disk storage – the sizes are even larger reaching up to petabyte scales but their performances are even slower because they are constrained by mechanical drives. I had to explain to my colleagues why it is a bad idea to read data off a hard-disk from the end of the file towards the front because of the spin of the harddisk.

In fact, when choosing buffer sizes for processing data, it is often crucial to understand the underlying hardware. Harddisks are read in sectors, with specific sector sizes that are either multiples of 512 bytes or 4096 bytes. Cache memory is also organised with different cache-line lengths and different amounts of associativity. If something has a 128-bit cache line, it would make sense to use arrays that are whole multiples of it in order to save on crossing boundaries.

This is just one aspect of programming that most software programmers do not know about. While an application would still work regardless of whether these things are taken into account, performance would degrade unless physical factors are considered. I would recommend that this paper be made required reading for anyone who wishes to write high-performance software.

Update 2010-07-11: There is an ACM article that expounds the same thing I talked about.

Instruction vs Data

A software programme is made up of a bunch of things but the main ones are: instruction and data. Instructions tell the programme what to do and is usually stored in the .text portion of the binary. Data is what the programme works on and mainly consists of the heap and stack stored in different parts of the memory.

So, with all else being equal, what is the difference between these two parts of an application and should a programmer care?

Knuth claims that it is important to select an appropriate data structure for a programme and that the selection of the right data structure can make or break an application – or something to that affect. Therefore, it is obvious that the data portion of a programme is a very important consideration. However, the trick is to have a data structure that is small and fast for two reasons.

Firstly, small data structures can be held in cache memory, speeding up processing time by reducing the number of reads and writes between the processor and memory. Anyone who cares about writing good software should already know about this trade-off. So, this should not surprise most people. Unless there is good reason to use a large dynamic data structure, a small static one may produce better performance.

Secondly, small data structures use less memory. Instructions rarely change and can actually be stored in fixed memory on a chip such as flash memory or ROM. In the case of programmes actually embedded into on-chip hardware, it is very important to minimise the use of RAM and to maximise the use of ROM for one reason – cost. It costs six-times as much per bit of RAM than ROM on a chip. It’s because SRAM contains six times as many transistors per bit.

So, for code that is embedded on-chip, it is particularly important to write code that uses less RAM and uses more ROM. So, static look-up-tables in ROM is good, while dynamic pointer linked structures in RAM is not so good.