Why Memory Allocation Resilience Matters in IoT

Let’s look at how developers can build resilience into their "malloc" approach and what it means for connected device performance.

Carsten Rhod Gregersen

Dec. 14, 22 · Analysis

Like (2)

Memory allocation is one of those things developers don’t think too much about.

After all, modern computers, tablets, and servers count so much space that memory often seems like an infinite resource. And, if there is any trouble, a memory allocation failure or error is so unlikely that the system normally defaults to program exit.

This is very different, however, when it comes to the Internet of Things (IoT). In these embedded connected devices, memory is a limited resource and multiple programs fight over how much they can consume. The system is smaller and so is the memory. Therefore, it is best viewed as a limited resource and used conservatively.

It’s in this context that memory allocation — also known as malloc — takes on great importance in our sector. Malloc is the process of reserving a portion of the computer memory in the execution of a program or process. Getting it right, especially for devices connected to the internet, can make or break performance.

So, let’s take a look at how developers can build resilience into their malloc approach and what it means for connected device performance going forward.

Malloc and Connected Devices: A Short History

Let’s start from the beginning. Traditionally, malloc has not been used often in embedded systems. This is because older devices didn’t typically connect to the internet and, therefore, counted vastly different memory demands.

These older devices did, however, create a pool of resources upon system start which to allocate resources. A resource could be a connection and a system could be configured to allow n connections from a statically allocated pool.

In a non-internet-connected system, the state of a system is normally somewhat restricted and therefore the upper boundaries of memory allocation are easier to estimate. But this can change drastically once an embedded system connects to the internet.

For example, a device can count multiple connections and each can have a different memory requirement based on what the connection is used for. Here, the required buffer memory for a data stream on a connection is dependent on the latency of the connection to obtain a certain throughput using some probability function for packet losses or other network-dependent behavior.

This is normally not a problem on modern high-end systems. But, remember that developers face restricted memory resources in an embedded environment. So, you cannot simply assume there is enough memory.

This is why it is very important in IoT embedded development to think about how to create software that is resilient to memory allocation errors (otherwise known as malloc fails).

Modern Embedded Connected Systems and Malloc

In modern connected embedded systems, malloc is more frequently used and many embedded systems and platforms have decent malloc implementation. The reason for the shift is that modern connected embedded systems do more tasks and it is often not feasible to statically allocate the maximum required resources for all possible executions of the program.

This shift to using malloc actively in modern connected embedded systems requires more thorough and systematic software testing to uncover errors.

Usually, allocation errors are not tested systematically since it is often thought of as something which happens with such a small probability that it is not worth the effort. Since allocation errors are so rare, any bugs can live for years before being found.

Mallocfail: How to Test for Errors

The good news is that developers can leverage software to test allocation errors. A novel approach is to run a program and inject allocation errors in all unique execution paths where allocation happens. This is made possible with the tool mallocfail.

Mallocfail, as the name suggests, helps test malloc failures in a deterministic manner. Rather than random testing, the tool automatically enumerates through different control paths to malloc failure. It was inspired by this Stack Overflow answer.

In a nutshell, this tool overrides malloc, calloc, and realloc with custom versions. Each time a custom allocator runs, the function uses libbacktrace to generate a text representation of the current call stack, and then generates a sha256 hash of that text.

The tool then checks to see if the new hash has already been seen. If it has never been seen, then the memory allocation fails. The hash is stored in memory and written to disk. If the hash — the particular call stack — has been seen before, then the normal libc version of the allocator is called as normal. Each time the program starts, the hashes that have already been seen are loaded in from disk.

This is something that I’ve used first-hand and found very useful. For example, at my company, we successfully tested mallocfail on our embedded edge software development kit. I’m pleased to report that the tool actually managed to identify a few problems in the SDK and its third-party libraries. As a result, the former problems are now fixed and the latter have received patches.

Handling Malloc Fails

Handling allocation errors can be a bit tricky in a complex system. For example, consider the need to allocate data to handle an event. Different patterns exist to circumvent this problem. The most important is to allocate the necessary memory such that an error can be communicated back to the program in case of an allocation failure, and such that some code path does not fail silently.

The ability to handle malloc fails is something that my team thinks about often. Sure, it’s not much of a problem on other devices, but it can cause big issues on embedded devices connected to the internet.

For this reason, our SDK counts the functionality to limit certain resources including connections, streams, stream buffers, and more. This is such that a system can be configured to limit the amount of memory used such that malloc errors are less likely to happen (and then it is just a resource allocation error).

Often, a system running out of memory results in a system struggling to perform. So it really makes sense to lower the probability of allocation errors. This is often handled by limiting which functionality/tasks that can occur simultaneously.

As someone who’s been working in this field for two decades, I believe developers should embrace best malloc practices when it comes to modern embedded connected devices.

My advice is to deeply consider how your embedded device resolves malloc issues and investigate the most efficient way of using your memory. This means designing with dynamic memory allocations and testing as much as possible. The performance and usability of your device count on it.

Why Memory Allocation Resilience Matters in IoT

Why Memory Allocation Resilience Matters in IoT

Let’s look at how developers can build resilience into their "malloc" approach and what it means for connected device performance.

Malloc and Connected Devices: A Short History

Modern Embedded Connected Systems and Malloc

Mallocfail: How to Test for Errors

Handling Malloc Fails

Recommend

Jsonptr: Using Wuffs’ Memory-Safe, Zero-Allocation JSON Decoder

pmem.io: An introduction to pmemobj (part 5) - atomic dynamic memory allocation

pmem.io: An introduction to pmemobj (part 4) - transactional dynamic memory allo...

Context Information In Memory Allocation Requests

#IoT OCTOPUS - Badge for IoT Evaluation

Github Avoid memory allocation when removing dead blocks by tmiasko · Pull Reque...

How to build resilience into your IoT architecture

Detecting memory management bugs with GCC 11, Part 1: Understanding dynamic allo...

Github CTFE core engine allocation & memory API improvemenets by RalfJung ·...

Investigate memory issues with ease – Introducing real-time inspections in dotMe...

About Joyk