More Physical Programming

There is more build-up on what I have been talking about for years – that to write good software, one needs to know the hardware well. It just makes sense. Anyway, this blog has a good quote that I would like to include here.

Quicksort. Divide and conquer. Search trees. These and other algorithms form the basis for a classic undergraduate algorithms class, where the big ideas of algorithm design are laid bare for all to see, and the performance model is one instruction, one time unit. “One instruction, one time unit? How quaint!” proclaim the cache-oblivious algorithm researchers and real world engineers. They know that the traditional curriculum, while not wrong, is quite misleading. It’s simply not enough to look at some theoretical computing machine: the next-generation of high performance algorithms need to be in tune with the hardware they run on. They couldn’t be more right.

KVM Entropy

I increased the SSL performance further by increasing the entropy pool. I tried following the instructions elsewhere but ran into some problems. While it increased the entropy pool on the host machine, it did nothing for the VM.

So, in order to increase the entropy pool for the VM, I had to enable an extra setting:

RNGDOPTIONS="--fill-watermark=90% --feed-interval=1"

That quickly drove up the entropy pool. The net result is better SSL performance with an improvement of about +30% concurrent connections. I don’t know about the quality of the random numbers but for the purpose of my application, it should suffice.

The only consumer of the random numbers is the SSL. It is only use for ephemeral data transfers. It does not matter if someone deciphers the information because by the time they do, it will no longer be very useful.

SSL CipherSuite

Just read a nice article on speeding up SSL operations on Apache. There were a number of different techniques proposed but the one that caught my eye – being immediately usable – was to replace the ciphers with faster ones.

Security has always been a game of trade-offs. In this case, trading off ciphers with higher security and larger keys with smaller and faster ones. As long as it provides a sufficient amount of security, not everyone needs AES256.

So, after configuring Apache away from AES256 to using RC4 instead, I got about a 20% performance boost in the sense that I could now have increased concurrency since the server was now able to process the SSL connections faster.

This got me thinking about embedded security. I think that for embedded servers, like those found on routers and such, using something like RC4-SHA would be more than sufficient for protecting something like password logins from casual snooping. These small embedded devices are just not made for serving highly secure services – unless they came with hardware accelerators.

Something to think about further.

Physical Programming

I had the opportunity to have a chat with some colleagues at work recently about the physical factors affecting programming. Today, I came across a question on Serverfault that asked why performance deteriorated when they went from a 4-core machine to a 24-core machine. To the lay person, it does not seem to make much sense. However, to someone like me, it makes perfect sense.

while most computer scientists and engineers may treat computers as an abstraction of various hardware, the reality of the matter is that these hardware are all bound by the laws of physics.

Taking the example of a 4-core machine running faster than a 24-core machine, it is easy to understand this once you understand how computer memory works. Computer memory is extremely slow. Therefore, in order to perform fast computation, internal cache memory is used inside a microprocessor. There may be several levels of internal cache usually denoted by L1, L2, L3 and so on.

The closer the memory is to the processor, the faster it runs but the smaller it gets. Main memory on a PC can reach multi-gigabyte sizes but they run extremely slow. So, if an application runs well on a 4-core machine, it may suffer when it moves to a 24-core machine because data needs to be exchanged across memory boundaries. In a single socket 4-core machine, all inter-process communication can happen within the L2 cache but for a multi-socket 24-core machine, some of the inter-process communication needs to go through main memory, which is slow.

Taking another example further down the memory hierarchy – hard disk storage – the sizes are even larger reaching up to petabyte scales but their performances are even slower because they are constrained by mechanical drives. I had to explain to my colleagues why it is a bad idea to read data off a hard-disk from the end of the file towards the front because of the spin of the harddisk.

In fact, when choosing buffer sizes for processing data, it is often crucial to understand the underlying hardware. Harddisks are read in sectors, with specific sector sizes that are either multiples of 512 bytes or 4096 bytes. Cache memory is also organised with different cache-line lengths and different amounts of associativity. If something has a 128-bit cache line, it would make sense to use arrays that are whole multiples of it in order to save on crossing boundaries.

This is just one aspect of programming that most software programmers do not know about. While an application would still work regardless of whether these things are taken into account, performance would degrade unless physical factors are considered. I would recommend that this paper be made required reading for anyone who wishes to write high-performance software.

Update 2010-07-11: There is an ACM article that expounds the same thing I talked about.

Surround Microphone

I was watching Glee and a thought occurred to me. It was the final episode of Season 1 at the regional singing competition. Rachel and Finn walked down the aisles through the audience singing their parts. Then, I thought to myself what it would have been like to be seated in the audience and listening to them sing. That was when it occurred to me.

It would be cool if they had wireless microphones on that were location aware and that information was integrated into the sound system of the room. So, as Rachel walked down the left aisle onto the stage, the surround sound system in the auditorium would amplify their voices, but it would sound to everyone in the audience that the source of the voice was from exactly where Rachel was standing.

This should not be something too difficult to do. Let’s explore this idea a bit.

Most of the hardware complexity would be in the speakers and the software complexity would lie in the sound system. The trick would be to transmit a hidden signal within the actual audio transmission. This signal would be picked up by the microphone on the singer. Using some algorithm magic, we can estimate the position of the microphone in the room in relation to the position of the speakers. This can then be used to calibrate the voice stream received by the system to do its surround sound magic.

This has the advantage of being a cheap option. We can do this with present day equipment. This is the passive surround mic.

The hidden signal can either be embedded in the very low frequency or very high frequency range. Knowing how amplifiers work, it would be better to stick things in the lower frequency range. Then, to prevent picking up noise, we can borrow some ideas from infra-red transmitters – to not only transmit a signal but to transmit a digital signal over this low frequency. This allows checksumming and error correction to be done.

The resolution of such a system may not be too great, but it should be scores better than the overall ambient amplification we get today. The response time of the system may also not be that great. However, it will definitely be good enough to deal with most concert situations where people do not move faster than a speeding bullet.

To improve this all, we move onto the active surround mic. We will just need to equip the microphone with a wireless transceiver and mount a bunch of base stations on the speakers. The mic can then triangulate its own position relative to each base station using the strength of the wireless signal and also transmit its voice data to the nearest base stations over the same frequency.

C’est bonne idée, n’est pas?

PS: I’m not sure if such things are already in the market, I am not an audiophile – but they certainly should be!

Instruction vs Data

A software programme is made up of a bunch of things but the main ones are: instruction and data. Instructions tell the programme what to do and is usually stored in the .text portion of the binary. Data is what the programme works on and mainly consists of the heap and stack stored in different parts of the memory.

So, with all else being equal, what is the difference between these two parts of an application and should a programmer care?

Knuth claims that it is important to select an appropriate data structure for a programme and that the selection of the right data structure can make or break an application – or something to that affect. Therefore, it is obvious that the data portion of a programme is a very important consideration. However, the trick is to have a data structure that is small and fast for two reasons.

Firstly, small data structures can be held in cache memory, speeding up processing time by reducing the number of reads and writes between the processor and memory. Anyone who cares about writing good software should already know about this trade-off. So, this should not surprise most people. Unless there is good reason to use a large dynamic data structure, a small static one may produce better performance.

Secondly, small data structures use less memory. Instructions rarely change and can actually be stored in fixed memory on a chip such as flash memory or ROM. In the case of programmes actually embedded into on-chip hardware, it is very important to minimise the use of RAM and to maximise the use of ROM for one reason – cost. It costs six-times as much per bit of RAM than ROM on a chip. It’s because SRAM contains six times as many transistors per bit.

So, for code that is embedded on-chip, it is particularly important to write code that uses less RAM and uses more ROM. So, static look-up-tables in ROM is good, while dynamic pointer linked structures in RAM is not so good.