I had the opportunity to have a chat with some colleagues at work recently about the physical factors affecting programming. Today, I came across a question on Serverfault that asked why performance deteriorated when they went from a 4-core machine to a 24-core machine. To the lay person, it does not seem to make much sense. However, to someone like me, it makes perfect sense.
while most computer scientists and engineers may treat computers as an abstraction of various hardware, the reality of the matter is that these hardware are all bound by the laws of physics.
Taking the example of a 4-core machine running faster than a 24-core machine, it is easy to understand this once you understand how computer memory works. Computer memory is extremely slow. Therefore, in order to perform fast computation, internal cache memory is used inside a microprocessor. There may be several levels of internal cache usually denoted by L1, L2, L3 and so on.
The closer the memory is to the processor, the faster it runs but the smaller it gets. Main memory on a PC can reach multi-gigabyte sizes but they run extremely slow. So, if an application runs well on a 4-core machine, it may suffer when it moves to a 24-core machine because data needs to be exchanged across memory boundaries. In a single socket 4-core machine, all inter-process communication can happen within the L2 cache but for a multi-socket 24-core machine, some of the inter-process communication needs to go through main memory, which is slow.
Taking another example further down the memory hierarchy – hard disk storage – the sizes are even larger reaching up to petabyte scales but their performances are even slower because they are constrained by mechanical drives. I had to explain to my colleagues why it is a bad idea to read data off a hard-disk from the end of the file towards the front because of the spin of the harddisk.
In fact, when choosing buffer sizes for processing data, it is often crucial to understand the underlying hardware. Harddisks are read in sectors, with specific sector sizes that are either multiples of 512 bytes or 4096 bytes. Cache memory is also organised with different cache-line lengths and different amounts of associativity. If something has a 128-bit cache line, it would make sense to use arrays that are whole multiples of it in order to save on crossing boundaries.
This is just one aspect of programming that most software programmers do not know about. While an application would still work regardless of whether these things are taken into account, performance would degrade unless physical factors are considered. I would recommend that this paper be made required reading for anyone who wishes to write high-performance software.
Update 2010-07-11: There is an ACM article that expounds the same thing I talked about.