Physical Programming #105: Duffy Objects

I ended up giving a short 1-hour lecture at work today, about object-oriented programming. OOP is a confusing enough paradigm as it is without having to indoctrinate my kids with more behind the scenes stuff. However, over the course of actually explaining to them about OOP, I came to realise that the reason why I understand OOP better than them is because I understand physical programming.

So, I thought to myself, it would be useful to include a chapter in my book on OOP – a behind the scenes picture.

Then, while reading stuff on the Internet today, I came across something called Duff’s Device. It is something that I had learned about a while ago but had forgotten. And again, I realise that it is a problem only to people who do not understand physical programming aspects as well. Again, it will make a good example chapter for my book.

Seems like I will have to expand the physical programming aspects to include lots of behind-the-scenes examples of common operations.

Advertisements

Physical Programming #104: Parallel Programming

I was attending a training session on parallel programming given by Dr Yan Solihin (from NCSU) today, when it occurred to me that one very important reason to learn physical programming is to do parallel programming. Parallel programming could be considered a sub-set of physical programming.

While physical programming encourages one to consider all physical aspects of programming, parallel programming makes one consider the physical constraints that relate to the execution of instructions across multiple cores and machines. Therefore, this will take into account the memory hierarchy as well as communication limits.

Therefore, this might be a good example to use as a sub-chapter or two of physical programming. It is very important to show how and why parallelism is important and we should stop being lazy programmers.

Physical Programming 103: Floating Point Unit

I had someone tell me that they do not need any specialised hardware for their application and all they needed was just simple hardware that was stable and easy to use. The main reason is because their value-add would be in the software algorithms and interface. I had a quick check on the algorithms and they are largely formulae with real numbers.

I almost fell off my chair.

Regardless of how great an uber algorithm is, it will finally get executed on real-world hardware. If the hardware was not designed to support the uber algorithm, it is going to choke. Take the example of an floating-point unit (FPU). In most desktops, we take floating point calculations for granted but this was not always the case in the past and is not always the case for embedded products today.

If a processor does not have an FPU and the uber algorithm works on single-precision floats, performance is going to get killed because of software emulation for the floating-point operations. You can basically forget about using double-precision. It would produce the same results as one that ran on an FPU but it would suffer from severe performance issues.

Another alternative is to use a fixed-point algorithm instead. This allows the algorithm to be implemented entirely without exploiting any floating-point numbers. However, this would require a redesign of the algorithm from the ground up to support fixed-point operations. This is not the kind of decision that one wants to make towards the end of a development cycle.

If on the other hand, a system comes with a whole host of FPU units, these can be exploited to perform multiple floating-point operations at a time. If the software was not written to cater to this (such as exploiting specific extensions that are available like SSE) then it would be a total waste of processing power and we can save cost by buying a less featured processor instead.

Therefore, the design of an algorithm has to always follow what the hardware allows. It cannot be done independently of the hardware, unless performance is not a concern – as highlighted in an earlier quote:

Quicksort. Divide and conquer. Search trees. These and other algorithms form the basis for a classic undergraduate algorithms class, where the big ideas of algorithm design are laid bare for all to see, and the performance model is one instruction, one time unit. “One instruction, one time unit? How quaint!” proclaim the cache-oblivious algorithm researchers and real world engineers. They know that the traditional curriculum, while not wrong, is quite misleading. It’s simply not enough to look at some theoretical computing machine: the next-generation of high performance algorithms need to be in tune with the hardware they run on. They couldn’t be more right.

[email protected]: Anyone who works with floating-points should read this paper.

Physical Programming 102: Random Access Files

If we wanted to read the contents near the end of a file, what are the considerations?

While it is true that computers are able to access files on a disk in any manner, there are some physical limitations that we need to consider when dealing with random access files. If we are trying to read a file from beginning to end, things are pretty straightforward. However, if we want to read the contents near the end of a file, care needs to be taken.

Most files are stored on some kind of physical media, either in an optical disk or magnetic storage on a hard disk. In this case, the mechanical drive limitations will factor into the file read/write operation. The storage media rotates only in one direction and that is the primary physical limitation. The drive will first need to seek out the correct location before performing the operation. This seek time is measured as latency and is usually significant, as compared to the read/write performance.

Therefore, it is in the programmers interest to avoid unnecessary seeks, such as when accessing a random access file.

Furthermore, the data is stored in the form of sectors or blocks. When data is read from the disk, it is read in blocks. Therefore, if the algorithm had to read a single byte or single sector, it would result in the same amount of disk activity. For efficiency purposes, it would be better to read the data in sectors and process the data in blocks.

Let’s have a hypothetical example of a large file with 512 byte sectors and we need to find a key somewhere in the the last 12KB of data.

In this scenario, the best way to do this would be to:

  1. Get the size of the file.
  2. Subtract 12KB from the size of the file.
  3. Seek to this location.
  4. Process the file from this position.

There are various algorithms that can be used to perform a search for the key within this last 10KB. Regardless of what algorithm is chosen, we need to be aware that the disk will be read in block sizes from the seek position in the direction towards the end of the file. It would not be wise to do it in the opposite direction as it will result in re-seeking the disk if done incorrectly.

Physical Programming 101: Base64 Encoding/Decoding Files

What are some of the various considerations when a person needs to perform base64 encoding/decoding of a file? In this case, the operation is pretty simple and the application would likely be disk limited.

First and foremost, would be the file size. If the files that need to be processed are small files, then it is a simple matter to load the entire file into memory, process it and save it back onto disk. However, if the file is large, it would not be possible to do so without causing an OOM exception.

In such a situation, the file would need to be cut up and processed in smaller blocks. We are lucky in the sense that base64 conversions are amenable to block processing. The basic operation involves mapping three 8-bit values into four 6-bit values. This gives us a hint on the block size that we should adopt.

The ratio of input and output memory blocks should be 3:4 for encoding and 4:3 for decoding. That is a simplistic but valid requirement. For all intents and purposes, this will work except for the last block of data, that will need to be padded according to base64 rules. Unfortunately, while it provides us with a hint on the ratio, it does not help us determine the block size.

Using the smallest block size of 3 bytes and 4 bytes would not make any sense because files are stored in contiguous blocks on disk. Therefore, even if we were to read in 3 bytes, an entire block would be read in by the disk. The same happens for write. If the block size is too small, there would be too much unnecessary looping to access the block that is already in the file buffer. This has the potential to slow things down. If the block size is too large, it would waste memory resources.

Therefore, it makes sense to use blocks that approximate the size of the disk blocks.

That said, in this particular case, the input and output block sizes need to be different in size. So, only one of the blocks can be similar to the disk block size while the other will either be 33% larger or smaller. Since a write operation is typically more expensive than a read operation, it would make sense to align the write block to the disk block size. This allows the blocks to be easily written to disk instead of buffering for data. So, it would be helpful to use a buffer that is a multiple of the disk block size.

In this example, the data would only be processed once and caches will affect things little. In fact, there is unlikely to be much of a performance boost from all these considerations other than avoiding any unnecessary performance hits. That said, at least the selection of the processing buffer sizes will have some logical and rational reason, instead of just randomly picking a number out of the blue.

We also have a unit block size that should be used – multiples of 4KB.

Physical Programming Series

I thought that I should put some of my knowledge to good work by contributing something of value. Seeing that I have recently been citing that physical knowledge of a machine is important in learning how to write good code, I thought that it might be a good idea for me to write a short series on the physical aspects of computers and how they relate to code.

Premature optimization is the root of all evil (or at least most of it) in programming.
(Knuth 1974).

While I agree that a programmer should focus on the functional and security aspects of the application instead of the grit of the hardware, what I am talking about is not premature optimisation. I am talking about learning good programming habits so that a programmer ends up writing code that encourages the compilers to better optimise the application.

Also, I am fully aware that with the trend of increasing complexity in our computing devices as we try to squeeze more out of them, physical limitations begin to play an increasing role in determining performance. The difference between a good app and a bad one could sometimes be down to the size of the data structure chosen to hold it.

A programme could be I/O bound, memory bound, synchronisation bound, or even algorithmically bound. In all cases, knowing the hardware limitations and how to avoid them could move the bottleneck away or even spread them out evenly so that an application does not need to be bound by any one particular aspect.

Also, note that most of the bounding limitations are imposed by hardware – either communication or computational bandwidths. So, there we go – software has physical limitations and we should learn about them – just so that we become better programmers.

More Physical Programming

There is more build-up on what I have been talking about for years – that to write good software, one needs to know the hardware well. It just makes sense. Anyway, this blog has a good quote that I would like to include here.

Quicksort. Divide and conquer. Search trees. These and other algorithms form the basis for a classic undergraduate algorithms class, where the big ideas of algorithm design are laid bare for all to see, and the performance model is one instruction, one time unit. “One instruction, one time unit? How quaint!” proclaim the cache-oblivious algorithm researchers and real world engineers. They know that the traditional curriculum, while not wrong, is quite misleading. It’s simply not enough to look at some theoretical computing machine: the next-generation of high performance algorithms need to be in tune with the hardware they run on. They couldn’t be more right.