Physical Programming 101: Base64 Encoding/Decoding Files

What are some of the various considerations when a person needs to perform base64 encoding/decoding of a file? In this case, the operation is pretty simple and the application would likely be disk limited.

First and foremost, would be the file size. If the files that need to be processed are small files, then it is a simple matter to load the entire file into memory, process it and save it back onto disk. However, if the file is large, it would not be possible to do so without causing an OOM exception.

In such a situation, the file would need to be cut up and processed in smaller blocks. We are lucky in the sense that base64 conversions are amenable to block processing. The basic operation involves mapping three 8-bit values into four 6-bit values. This gives us a hint on the block size that we should adopt.

The ratio of input and output memory blocks should be 3:4 for encoding and 4:3 for decoding. That is a simplistic but valid requirement. For all intents and purposes, this will work except for the last block of data, that will need to be padded according to base64 rules. Unfortunately, while it provides us with a hint on the ratio, it does not help us determine the block size.

Using the smallest block size of 3 bytes and 4 bytes would not make any sense because files are stored in contiguous blocks on disk. Therefore, even if we were to read in 3 bytes, an entire block would be read in by the disk. The same happens for write. If the block size is too small, there would be too much unnecessary looping to access the block that is already in the file buffer. This has the potential to slow things down. If the block size is too large, it would waste memory resources.

Therefore, it makes sense to use blocks that approximate the size of the disk blocks.

That said, in this particular case, the input and output block sizes need to be different in size. So, only one of the blocks can be similar to the disk block size while the other will either be 33% larger or smaller. Since a write operation is typically more expensive than a read operation, it would make sense to align the write block to the disk block size. This allows the blocks to be easily written to disk instead of buffering for data. So, it would be helpful to use a buffer that is a multiple of the disk block size.

In this example, the data would only be processed once and caches will affect things little. In fact, there is unlikely to be much of a performance boost from all these considerations other than avoiding any unnecessary performance hits. That said, at least the selection of the processing buffer sizes will have some logical and rational reason, instead of just randomly picking a number out of the blue.

We also have a unit block size that should be used – multiples of 4KB.

Advertisements

Published by

Shawn Tan

Chip Doctor, Chartered Engineer, Entrepreneur, Law Graduate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s