I had someone tell me that they do not need any specialised hardware for their application and all they needed was just simple hardware that was stable and easy to use. The main reason is because their value-add would be in the software algorithms and interface. I had a quick check on the algorithms and they are largely formulae with real numbers.

I almost fell off my chair.

Regardless of how great an uber algorithm is, it will finally get executed on real-world hardware. If the hardware was not designed to support the uber algorithm, it is going to choke. Take the example of an floating-point unit (FPU). In most desktops, we take floating point calculations for granted but this was not always the case in the past and is not always the case for embedded products today.

If a processor does not have an FPU and the uber algorithm works on single-precision floats, performance is going to get killed because of software emulation for the floating-point operations. You can basically forget about using double-precision. It would produce the same results as one that ran on an FPU but it would suffer from severe performance issues.

Another alternative is to use a fixed-point algorithm instead. This allows the algorithm to be implemented entirely without exploiting any floating-point numbers. However, this would require a redesign of the algorithm from the ground up to support fixed-point operations. This is not the kind of decision that one wants to make towards the end of a development cycle.

If on the other hand, a system comes with a whole host of FPU units, these can be exploited to perform multiple floating-point operations at a time. If the software was not written to cater to this (such as exploiting specific extensions that are available like SSE) then it would be a total waste of processing power and we can save cost by buying a less featured processor instead.

Therefore, the design of an algorithm has to always follow what the hardware allows. It cannot be done independently of the hardware, unless performance is not a concern – as highlighted in an earlier quote:

Quicksort. Divide and conquer. Search trees. These and other algorithms form the basis for a classic undergraduate algorithms class, where the big ideas of algorithm design are laid bare for all to see, and the performance model is one instruction, one time unit. “One instruction, one time unit? How quaint!” proclaim the cache-oblivious algorithm researchers and real world engineers. They know that the traditional curriculum, while not wrong, is quite misleading. It’s simply not enough to look at some theoretical computing machine: the next-generation of high performance algorithms need to be in tune with the hardware they run on. They couldn’t be more right.

[email protected]: Anyone who works with floating-points should read this paper.