March 2010 – Technical Intercourse

Software Layers

I had a talk with my apprentice today, on software application architecture. In our case, it had to do with an application that had to interface with a back-end database. I was reviewing her code when I realised that she was replicating a lot of code everywhere – a definite case for code factoring. Everyone likes to talk about factoring but I had to find an easy way to explain it to my apprentice.

I explained to her that it is a good practice to design software in three layers – primitives, middleware and application – and proceeded to describe the layers.

Primitives are low-level functions. This did not seem to make sense to her and so I elaborated that primitives should only do one thing and one thing only. It did not elicit any extra clarity from her and so I decided to put it in context. For our application, the database operations could be made into primitives – insert, update, delete and select. Once I said that, things became obvious to her. This layer would be tied very closely to the low-level architecture, the back-end database in our case.

Middleware serves as glue between application and primitives. This makes sense in context. Our application only supports one database back-end today but we may need to support more database back-ends in the future. This is where middleware comes in. The middleware provides a standard application programming interface (API) that can be used by the application regardless of database back-end. We may have multiple primitives used to access different database back-ends but we can have a single middleware layer that abstracts it all away from the application.

In the case of embedded software, this would be any architecture specific code that need to access the hardware registers and functionality directly. Similarly to databases, these would usually include code that set and get values. A typical example of this would be code that would set, clear and toggle bits in a register. The middleware would be any code that defines processes and functionality by wrapping around the bare-metal primitive code. This could include code that reset, initialised and activated certain functionality.

Applications should be pretty obvious. The application contains all the business processes and logic that is entirely dependent on the application used. Ideally, if everything was done properly, we would be able to use the same set of middleware and primitives across multiple different applications. This should be the idea that we strive for when architecting software applications. I hope that she will always remember this simple rule-of-thumb in her future code.

IPv6 Stress

I was reading this article on seven IPv6 myths today and a random thought entered my head. The article seemed to stress that the only real difference that IPv6 makes in this world is extending the address space from 32-bits to 128-bits. This is a 4x increase in the address space currently used.

This instantly triggered a yellow-flag in my head. This also means that there will be a marked increase in the memory space on networking devices, coupled with an increase in memory bandwidth consumption on the said devices.

What many people may not know is that on a networking device like a router, routes need to be maintained for the duration of a connection. This is usually maintained as some sort of table in the memory. When a network packet comes into a router, its header is examined for the destination address. This is then used to look-up the route to take in the routing table. Then, the network packet is forwarded onto the route on its way to the destination.

So, an IPv6 routing table would need to quadruple its memory space requirements. This would not be so bad if we didn’t take the real-world performance of memory into account. It would also mean that the amount of memory transfers would be quadrupled as well, assuming similar table and lookup patterns.

While I have not researched into this, I do think that there is room for innovation here. We could possibly change the routing table structure by dropping off some of the bits or to partition it up into directories indexed by the significant bits of the address or to hash the IP addresses in such a way as to reduce memory requirements.

This may not mean much to a lot of people but if the route look-up process slows down by a factor of four, it will significantly affect the latency of the network even if its effects on throughput may be minimal. Latency is very important for certain applications, anything that requires near real-time feedback like first-person shooters.

Interesting…

Seeded Hash

For the sake of security, I had to implement a seeded-hash system to secure passwords in a database. While straight-forward hashes are good for ensuring that passwords are not stores in clear text, they are still vulnerable to rainbow-table attacks. A seeded hash helps to reduce this risk.

However, the question occurred on how to actually do a seeded hash. I got my apprentice to look around and we finally found a useful scheme. The seed and password are usually concatenated before being hashed.
hash = Hash(seed + password);
However, if the seed was fixed, then it does not really help much because it would still be susceptible to rainbow attacks if the secret seed ever got out. So, we had to use a random seed. However, a random seed would generate all manners of rubbish unless we could somehow embed the seed in the hash.

Since this was part of a password storage scheme, it would be perfectly alright to embed the seed with the hash because the size of the hash result is fixed. So, any extra data stored with the hash result would be the seed. We could convert everything to Base64 to store it in clear-text on the server. This was the scheme that we used in the end:
seededhash = Base64(Hash(seed + password) + seed);
This way, to do a password match, the application would need to decode the Base64, separate the seed from the hash and then perform the hash operation on the supplied password with the seed to see if it matched the hash.

Caveat: this solution only works in the situation of password matching – where we already know exactly which record to match against and merely need to verify that the information provided is accurate. This would not be useful for indexing purposes, like that used in Git. In such a scenario, straight-forward hashing would still need to be done.

Trust and Verify

I had a chat with my boss the other day, about how good software should be developed. He mentioned that we should use unit-testing and I told him that it was rubbish. Then, he mentioned TDD where tests are written before the actual application and I told him that he risked lying to himself about the quality of the code developed.

Okay, while this is not an entry bashing software testing, I will highlight why testing fails. As one prominent computer security researcher points out:

One researcher with three computers shouldn’t be able to do beat the efforts of entire teams, Miller argued. “It doesn’t mean that they don’t do [fuzzing], but that they don’t do it very well.”

This is because we end up relying on tools to do the testing. Not that there are anything wrong with tools, but the tools are only as good as their operators and the problem lies with the tool operator. When tools are used, the operators generally are either less astute or become lazy. They start to rely on the tool to catch problems and if the tool fails to catch any, they rely on the tool to sign-off on the quality of the code. Tools are inanimate objects and should not be liable for the stuff that they test. The tool operator should always dig in manually to verify the results. Through my 20+ years of active programming, I have been caught by tools so many times and I have developed a distinct distrust towards any tool. I even verify my compiler outputs by disassembling the binaries and checking them by hand once, before running them through a simulator to observe the operations.
objdump -dSC
I argued with my boss that there is only one way to develop good code, that is to just frakin’ write good code, which is not as hard as some people imagine it to be. Writing good code is like writing good prose, there are certain styles that can be followed that will result in better code. Unfortunately, good writing habits are hard to develop and need to be hammered into our apprentices from the very get go. Our universities are failing in this by not driving this point home and trying to teach our programmers to be ever more lazy and rely on automated tools to generate code instead. Okay, code generation is a whole different can of worms and I won’t bother to go into.

I have seen several senior engineers do, what I call, haphazard coding that results in code that sort-of works but without anyone knowing why. A slight change in any part of the code, even in the compiler optimisation settings, can bork out the code. This is usually the result of cut-and-paste coding styles and ineffectual debugging skills. I had to teach my CODE8 apprentice that in debugging, you always follow the flow of the data. There was one instance where I asked her to show me the result that is obtained from the LDAP query and she modified her code to output the data onto the web-application. Unfortunately, this involved obtaining the result from the LDAP query, extracting the relevant fields, reformatting the text before sending it out to the web-app. I then explained to her that this was not the output of the LDAP query. This was the output of the query after several processing steps. Garbage-in-garbage-out. If we do not know that the LDAP query output is accurate, how can we be sure that the reformatting and extraction routines are not frakin’ things up.

In the realm of hardware design, good coding practices are generally easier to enforce because if your code sucks, the hardware tools are not smart enough to do what you want and will usually just bork out. However, if the code is badly written, but just good enough to be understood by the synthesisers the tools will just end up producing very bad hardware design, which will suck up power and slow down performance. As a result, hardware coders are told to only write code in one specific way, in order to produce the hardware that we desire. We get these coding practices drummed into us in a class usually called – design for synthesis.

As for me, I tend to step through code line-by-line in order to see if it is doing what it is supposed to do. Unfortunately, with a language like Java, that can be very difficult to do because of the added complexity of the virtual machine layer. You can compile Java code into bytecode and inspect the bytecode but you won’t know if the JVM will actually run your bytecode the way it is meant to be run because you will just have to trust it to do so.

I subscribe to the principle – “trust and verify”.

Social Reader

I came up with this idea a while ago, while thinking about the whole e-reader craze. Since it will not be going to fruition, I thought that I would just write a blog entry about it. Maybe someone might find more use for it that me. Afterall, I lack the wherewithal to work on this project on my own anyway.

The idea that I came up with was a social reader. Yes, most of you will argue that reading is a very personal activity and wonder why anyone would want to have a social reader. However, there is at least one scenario where reading becomes a more social activity – in a classroom setting. So, this applies more to reading text-books rather than say Tom Clancy or Patricia Cornwell.

So, I had three ideas on how to use e-book readers in a more social way and I will talk about them here. Since this is a technical blog, I will elaborate mention some of the more technical aspects. Most of the modes elaborated depend on the modes supported by the 802.11a/b/g/n wireless chipset.

In the first case, the reader works in broadcast mode. This mode is suitable for use in a classroom where we have a lecturer broadcasting information to a bunch of students. In such a scenario, the reader used by the lecturer could be set in master mode and the students’ readers set to connect to it in infrastructure mode. In such a situation, wireless bandwidth is effectively shared between all the readers but since only one device is doing most of the talking, it should be fine. The reader application can then be programmed to transmit notes and synchronise meta-information to the students. This could be easily accomplished using rsync, or other system, in the background.

In the second case, the readers work in peer-to-peer mode. This mode is suited to discussion groups and small group teaching. In such a scenario, all readers are set to ad-hoc mode. This will allow each device to talk to every other device in the group. The reader can then be programmed to push or pull annotation between devices. In the background, a distributed management system such as git, or any other system, could be used to easily share data in a structured and managed way. the ability to do a diff and patch to your notes and that of your friends, could prove invaluable in changing how group study works in the future.

The third mode is a local-reader mode. This mode is suited to reading in the local common room. In such a scenario, readers can connect to a local book store that holds books only accessible from that geographical location – the boundaries of which can be controlled via modulating the transmit power of the book store device. Readers can download books held at the store and even upload books to the store, allowing people to share books and to leave books behind for others to read.

Now for the bad news – battery power. All these modes require the use of wifi, which is pretty power hungry. However, this is where there is opportunity for innovation. The operating system software could be designed to handle power efficiently and to only activate the wireless when needed – such as during the beacon intervals. Additionally, the physical layer could be replaced with something low-power such as blue-tooth or zig-bee or even possible uwb when it makes sense to do so.

In order for such a social reader device to succeed, it would need to answer the problem of power. Readers are supposed to be able to last days if not weeks. However, all the wireless communication will kill it quickly, even if low-power wireless technologies are employed.

Polipo DNS Issues

I have been using polipo as my proxy server and recently, it has been developing some weird problems. For one thing, it sometimes refuses to connect to websites without any errors on the client side except for time-outs. I know that the sites are up because I can connect to them if I bypass polipo.

The reason that I use polipo is because of resource requirements – much less than say, squid. As an upstream web proxy, it works really well. I have gotten better results out of it than squid but I simply put that down to my lack of squid config-fu. Polipo is much easier to configure as there are less options to play. However, it is also a plain vanilla proxy and does not try to become the proxy for everything as squid does.

After some investigation, I got a clue from the polipo logs – that it was timing out as well with the following message:
Host ftp.osuosl.org lookup failed: Timeout (131072).

This confused me because it was obvious that the website was up and working. So, I dug around and found out that polipo uses its own DNS resolver by default and not the system resolver. The reason that it does this is in order to obtain the TTL information on the domain directly. However, the information still comes from the same DNS server either way.

Turns out that it was a networking problem. By default, polipo would return the AAAA record instead of the A one if both are present. It is designed to prefer IPv6 over IPv4. My home network is IPv6. This can be controlled with either of the following options:
dnsQueryIPv6 = no dnsQueryIPv6 = reluctantly

Since I have not yet enabled IPv6 support through my gateway, I decided to just disable IPv6 for polipo entirely. I might turn it back on once I get IPv6 up at home. I really need to get down to activating my Hurricane gateway tunnel and writing a guide for it.