The most obvious challenge is not always the hardest to solve. This is the case of reliable in-place updates. How does RAVENS manage to rewrite “itself”, even if power is cut?
First, there is an important detail to note. Due to how limited are the devices that need this level of efficiency, the running program isn’t copied to RAM. This means that there are some ground rules: you can’t rewrite (i.e. erase the flash page of) running code. This put some restrictions on the update routine: the update code can’t be updated by itself, the update code must not move and as little code as possible should be involved in the update process.
Moreover, because the power supply of those devices can be unreliable and that even low failure percentages can result in thousands of malfunctioning devices, the update routine must be able to restore its state after a failure. That means that the routine should be called as early as possible, as any preceding code can’t be safely updated.
Munin is the component of RAVENS installed on the device. It is a fully integrated updating solution, meaning it downloads, verifies and installs the update. Due to the very different requirements of those two features (the installation needs to run as early as possible, while the download require most of the firmware to be running in order to access the network stack), Munin is split in two sub-components.
The first component is not involved in the installation process, nor the chain of trust and can safely be updated. It’s running as a normal part of the firmware and is tasked with checking for updates and downloading them. Although fairly interesting, I won’t expand on how it does that for now. All we need to know is that after the detection of an update to install, Munin-Userland will download it to the flash, perform some initial verification in order to avoid unecessary downtime and initiate a reboot (this responsability can be transfered to the general firmware without any consequence).
The second component is called as early as possible by the bootloader. The first thing it’ll do is check if an update is pending. If not, it can return the control to the bootloader. If it detects one, it will perform security validation ensuring that the update is designed for the configuration of the device, that it was intended for this specific device and that it was not tampered in any way. Once confirmed as valid, Munin will enter the actual installation phase.
At this point, Munin will check a dedicated page of flash memory and look for the installation flag. By default, this flag is set to 0xFFFFFFFF, and is updated to 0x0 when an update is initiated. If no update was ongoing, the 4 dedicated flash pages are flushed and reset to their default value. Specifically, a progress counter is reset to 0. This progress counter is very useful in order to make the update process resilient against power failures. Once this initialization is performed, the installation flag is set to 0x0 and the update process can proceed. At this point, the resiliency system may kick in, as described below.
The installation of the update is then simulated, in order to do a dry run. We’re calling the code that will parse and install the update while disabling its write access. This way, we can make sure the file is fully valid and that we won’t run into any parsing issue before changing anything. If anything is wrong, the update can thus be aborted without altering the firmware of the device. Although this approach can’t detect every issue (specifically, the instructions to execute may be subtly erroneous), Hugin put a lot of effort in making sure any instruction streams it generate is valid.
Once satisfied, the update will actually install. First, a stream of instructions will rewrite the content of flash with the help of a small memory buffer. The goal is to lay the data in such a way that the data the delta update will need in order to write the final layout any specific place is already there. In other words, if the delta update want to recycle the data previously stored at address 0x1000 to generate the new version of address the data at address 0x2000, those intructions will make sure that the content of 0x1000 is copied at 0x2000. Those instructions are generated with the update payload and thus don’t require Munin to read the compressed payload. The strategy those instructions implement will be described in later post, but suffice to say that this is a new approach that is significantly more efficient than existing methods (in regards of the space necessary to install the update).
Once the data is reorganized, we will start processing the main update payload, which is based on BSDiff. The whole blob is compressed using a custom compression algorithm based on lzfx, called lzfx-4k. Its primary purpose is allowing decompression using only 4 kilobytes of memory (which happens to match exactly the size of a flash page or the small in-memory rewrite buffer 🤔). The blob is thus streamed in our 4kB buffer and the format designed to only require sequential, forward reads.
The installation is then fairly simple: multiple segments follow each other. Each contains two sequences of bytes. One to be combined with existing data (delta), and the other to replace what was there (extra). The format specify a starting address and then, by going forward and by consuming alternating delta and extra sequences, the new firmware can be built.
To go a little bit more in the weeds, the format doesn’t necessarily align with the underlying flash pages. This isn’t a problem but make the installation process a little bit less intuitive. Basically, every time we hit a new page, we will copy the old content of the page (i.e. including the delta data to reuse) to a backup flash page. We then erase the page and write its new content. This process is entirely uncorrelated to the format’s segments.
Once the segments fully consumed, the firmware should be fully updated. We have a final step which is there to make sure the firmware is valid. The firmware ends with a list of ranges of memory, each with an expected SHA-256 hash. Munin then hash the various ranges and check that the hash match the expected value. If not, although it can’t do much, it’s at least able to know that the update ran into issues.
Once the range validation is performed, update is marked as over and the device is rebooted.
Nitty gritty details
Through all this infrastructure, Munin must be able to restore the full context of the updater, including the precise progress of the update in case of interruption. As mentioned earlier, Munin may erase parts of the old firmware, and those parts may be required by the delta update. This make obvious the need for precise bookkeeping restoration.
Munin restore the context through two means. The first is the general memory, which is backed up in a dedicated flash page every time a checkpoint (a progress counter increment) is passed. Munin can then trivially load the content of the backup page to memory. The second is figuring out the part of the update that was being processed. This is performed surprisingly easily by simply performing the update from scratch, but without actually touching the flash. As we pass checkpoints we see the virtual counter increase up to the point it hits the old value. At this point, we know we have the state of the original updater when it last passed a checkpoint. Through careful positioning of the checkpoints, we can make sure that any actions performed between two checkpoints can be performed multiple times safely, no matter what those actions were (i.e. no data is lost).
By combining those two approaches, and some implementation details, we can guarantee context restoration and update resuming no matter when the update was interrupted.
Progress counter increment
Although this architecture may sound all well and good, it is at first glance vulnerable to a power failure at the wrong instant (for instance, while passing a checkpoint). Although the actual counter increment is as close of an atomic operation as possible, passing a checkpoint require multiple complex steps, which may experience interruption. This problem is mitigated through a couple of context-specific approaches.
Progress counter increment during instruction execution
During the first phase of the update, we’re reorganizing the layout of the flash using a stream of instruction. The checkpoint is passed whenever the instruction ERASE is issued as it implies that any content present on the page about to be erased has either been copied to another page, or is stored in our working memory.
The counter is then incremented not once, but twice. The first increment is used to signal that the writes since the last checkpoint have been performed and are no longer necessary. This is helpful as it prevents having to rewrite the page in case of power failure during the checkpoint passing procedure. Once the counter was incremented once, we write the content of our working memory is a persistant buffer so that we may restore it if necessary. Then, we increment the counter again and proceed with the erasure of the page and the normal execution of the instruction stream.
Progress counter increment during delta application
The second stage of the update is substantially simpler, and this is reflected in the resiliency approach. Whenever the update is progressing into a new page, it increments the progress counter once, then copy the existing content of the new page into the backup area, previously used to save the working memory. Once this is done, the counter is incremented again and the update will proceed, erasing the page, decompressing the update data in the working memory and writing the new layout using the update data and the old data stored in the backup area. Just like the previous step, the update can safely fall back on the second checkpoint even is the page was halfway through being rewritten, as the erased data were copied elsewhere and were being used from this location.
Atomic counter increment
There is very few way to update content on a NAND flash in an atomic fashion (as flash require a full page (4-8 kilobytes) to be erased in order to rewrite bits previously set to 0). Our approach involve leveraging the peculiarities of the NAND write limitations: only setting 0 to 1 require erasing a page. Our approach is then to make regular counter increment only turn 1s into 0s. This is a little bit more tricky than it seems as bits are often not stored independently but it is usually possible to perform dword (32 bits) or qword (64 bits) writes. Therefore, our structure contains a large, qword aligned, field of data and writes are performed by progressing through the buffer. This is however extremely space inefficient (32 to 64 bits, or 4 to 8 bytes, being required for each increment), and would easily run into trying to store values larger than the maximum space available. This is solved by a multiplier: a classical integer counter indicating the number of time the buffer was filled. However, to increment this counter requires you to erase the page, which definitely isn’t atomic!
Atomically incrementing the counter multiplier
We achieved this atomic write by using two pages, each with two flags (‘valid’ and ‘expired’, both aligned on the write size). By default, both flags are set to 0xFFFFFFFF (only 1s). Whenever a page is in use, the ‘valid’ flag is cleared and set to 0. This tells the code using the counter which page to use. Whenever the multiplier has to be increased, we erase the unused page (setting everything, including the two flags to 0xFFFFFFFF). We then write the various metadata and the new multiplier. Once those writes are over, we clear the ‘valid’ flag making both page valid for a short while (which is not a problem as either the proper one is used, or the previous checkpoint is reused) and the ‘expired’ flag of the first page is cleared. Once this is done, the multiplier was updated, the field used to perform the ‘fast’ writes reset and the previous page marked as expired.
The reason this system isn’t used for every increment is the cost in time and the fact that this scheme require erasing a flash page, straining its endurance. Using both scheme lets us get our atomic increment while performing fairly infrequent erase operations.
Saving the working memory
An additional vulnerability of our approach is during the first phase, when we’re saving our working memory. Indeed, while writing into the backup buffer, a power cut would force us to restore the state of the previous checkpoint (so that we can repeat any change performed to it in-between). This wouldn’t be possible with a single backup memory space. Our approach to correct this weakness is to use two backup space, chosen depending of the counter value so we’re always alternating. This way, we can always restore the state of the previous checkpoint, even when passing the next.
During the second phase, a similar scheme is used not for functional reasons, but in order to level the wear on the flash, as those two pages are wiped every other checkpoint.
I made a lot of claims about the efficiency of Munin at this point. I’ll go a bit deeper in the weeds of exactly how efficient Munin is in regard to memory (RAM & NAND). I’m ignoring the stack memory used as it’s tricky to measure and easy to optimize, but I would peg it at below 1-2kB, as it’s not used beyond a couple of function variables.
Memory footprint during instruction execution
Any driver trying to write to a flash memory need a buffer of the size of a flash page, in order to save the untouched content when erasing a page. During the instruction execution, Munin is taking over this RAM buffer and it is the sole structured memory buffer explicitly used during this step of the process. No persistent memory is explicitly required (besides the resiliency infrastructure).
Memory footprint during update installation
During update installation, for resiliency reasons, the content of the flash page we’re evaluating is not saved in memory. Just as the previous step, the only RAM buffer we’re using is the 4kB rewrite buffer. This time, it’s used as a decompression buffer and contains all the decompressed data, but is also used by the decompression subroutine. The consumption of data from this buffer is abstracted to make refilling it simpler. The nifty tricks of making this a rotating buffer warrant a custom implementation of the decompression algorithm.
Memory footprint of the resiliency infrastructure
The resiliency infrastructure is a little bit different from the others as it’s not using any RAM, but require a little bit of persistant memory (flash). More specifically, we’re using 5 pages of flash memory with the following breakdown:
- 2 pages used to save the working memory, used in an alternating fashion;
- 2 pages used to store the progress counter, and some minor metadata;
- 1 page of critical metadata which isn’t “really” used by the resiliency infrastructure, but whose update are handled in a resilient fashion, thanks to how important it is (the page contains the device public encryption key, the version and some other security-critical metadata).
Although we didn’t mention the last page up until now, it’s not super interesting as fairly irrelevant to the actual update (mostly used when downloading the update and verifying it) and is using a similar resiliency scheme as the write of the multiplier counter, although we’re using a working memory backup page as backup and writing it back once we’re done.
This was a fairly in-depth exploration of the resiliency infrastructure of Munin, and I hope you found it interesting. As I implied at the beginning of the post, although resiliency appear to be the main impediment to our system, its solution is actually fairly straightforward (counting steps and writing back some data to the persistant storage) and we will see in future posts that some other “minor” problems were much harder to solve.
Multiple times through this post, I touched on other aspects of the update process but didn’t expand on them as this post would get unmanageable otherwise. I’ll do my best to post follow-up on them as soon as I can, but feel free to hit me up on Twitter if you think I glossed over an interesting detail.