C++ and custom memory allocators

In this article I will be discussing ways to implement custom memory allocators in C++. The reason I decided to write this article is because I was struggling to implement custom allocators for a C++ game engine (which is now written in C, Github). There are two problems with making a custom allocator in C++: RAII and the standard library (I will explain what RAII is later). First however, I want to discuss why you would even want to use custom allocators, and why you should use them.

Why custom allocators?

The new and delete operators are very general, they need to be able to allocate big chunks of memory, and small chunks of memory. They also need to be able to do every allocation in your entire program and keep track of the size, etcetera. A common method for increasing performance of a given functionality, is to exploit its constraints. However, new and delete don’t know your applications requirements and so don’t have very many constraints to exploit, which makes them unable to fully do what you need. So, are custom allocators actually faster than general allocators? Yes, a cpp con talk by John Lakos has a bunch of data showing they are faster, and I recommend you watch it [1](link). One important thing with custom allocators is to not just write your own general allocator and use that everywhere, but to use multiple “local” allocators. For example one allocator for each system in your program, or one allocator for big long term allocations and an allocator for small short term allocations.

Here are a few reasons why custom allocators can be faster than general allocators:

Faster allocations and freeing
There are now multiple allocators in your program, so each allocator has less memory to manage. Let’s take allocation as an example: the allocator has to go over all the available memory blocks and find one that is big enough, there are different allocation algorithms that have different time complexities, but regardless, the less available memory blocks to check the faster it is. The same goes for freeing, because the allocator needs to check if there are adjacent free memory blocks that need to be combined into one big free block together with the recently freed one.

Better memory locality
Because the programmer has more control over memory, memory that belongs together can be kept together over the entire duration of a program. Take for example an application that has a few linked lists that get added to whenever a user presses a button, at some point traversing the linked lists is going to be very slow, because the computer has to jump all over the memory. If these linked lists had their own allocators or at least didn’t all share the same allocator with the rest of the application, this problem would be way smaller. Another example is having an array of pointers to polymorphic objects (that can’t be stored in an array directly because they have different sizes), normally this is terrible for performance because those pointers can point to a bunch of different places in memory. If there is a memory arena specifically for these objects, this problem mostly goes away. I say mostly, because going through memory from left to right can still be faster because it is easier for the CPU to predict what needs to be loaded into cache next. If you don’t know why this makes programs faster, the John Lakos talk, explains this [1](link). In short the reason is because of how CPU cache and memory works. You should make sure you understand this because it is key for performance, even something as simple as row-wise or column-wise iteration over a large 2D array can make a difference.

Faster allocation algorithms
There are a multitude of allocation algorithms, here’s a few: Freelist, pool, stack, bump, buddy, etc. Obviously some are faster in some cases and others are faster in other cases, just like any family of algorithms. You know what kind you need, the general allocator does not. Different parts of a program can also use different allocators with different algorithms.

Exploiting allocation size
If the size of allocations of a part of a program is known, a special allocator like a pool allocator can be used. This has several benefits:

No allocation size needs to be stored.
No allocation size has to be checked.
The first free block found is always good; the algorithm never has to loop through all free blocks.
Fragmentation is nigh impossible.

Exploiting allocation temporality
If it is known that an allocator only has to handle long term and big allocations, the algorithm used can favour finding the perfect allocation and free options over speed. Which will result in better memory locality and thus trades runtime speed over startup/shutdown speed.

Exploiting knowledge of multithreading behaviour
The general allocator has to be threadsafe because it doesn’t know if a program is multithreaded, thread safety adds overhead. The allocators in your program however do not need to be threadsafe, because you know if an allocator is going to be used on one or multiple threads. Even a program with multithreading could take advantage of this. Take a game engine with a main thread and a render thread: both could have their own allocator that isn’t threadsafe. The main thread can still use memory from the render thread, it just can’t allocate or free that memory and vice versa.

less fragmentation
Big and small allocations can be separated, also the programmer kind of sorts out the memory by hand which leaves less room for the program to jumble the memory.

Mass freeing
Instead of freeing every struct or class individually an entire allocator can sometimes be freed, which is much faster. In C++ doing this also circumvents the destructor, which can either be fast, or a problem, depending on what you’re doing in the destructor. This can’t always be done but if you have to free a lot of related objects in the middle of your program, this might be an option.

No context switching
When using the global allocator, it communicates to the OS that it wants more memory. When an application first asks for memory the global allocator probably grabs more than you asked for so that next time you ask for memory, it doesn’t have to communicate with the OS. When the allocator asks the OS for more memory the OS’ even more global allocator (for the entire computer with all applications) has to go find memory, we also have to switch from executing our application to executing OS code, this is obviously slow. The global allocator doesn’t want to get too much memory from the OS either, otherwise we would be wasting it. When programming an application, it is usually known how much memory we need and we can grab all of it up front, avoiding slow OS calls at runtime. If an application has large files that it needs to load depending on what the user does, it is probably fine to ask for more memory at runtime, like when the user opens a large image in an application like photoshop; this is still much better than letting the global allocator guess how much memory we need and inevitably get it wrong.

Easier programming
On top of that, allocators can make programming easier. You can have dynamic allocation with almost no overhead as if it’s the stack, if you know you can use an allocator that only has to keep track of a handful of allocations. An example of this would be an allocator that is used only for really fast string manipulation. It’s also easier to add custom memory debugging tools to your own allocators. Lastly you can make your own allocator API’s which you may find better than whatever the language provides (though you can’t really do this with C++ since new and delete are kind of set in stone. Unless you use defines to wrap new and delete into a more advanced API, but you probably shouldn’t do that except to be able to add file and line macros for easy memory debugging).

Writing an allocator in C

Before looking at allocators in C++, let’s look at a simple allocator in C. Quick explanation of pointers for non C/C++ programmers (pointers are hard to use, but conceptually easy to understand so don’t worry about fully understanding the syntax): any type with a * behind it is a pointer, ignore the type, the pointer is the type. A pointer is simply an integer that the computer interprets as a memory address (i.e. it points to a memory address, once again forget the “type” before the pointer, the pointer is the type). Implementing the TODO’s with an actual allocator is left as an exercise for the reader (a simple bump/linear/scratch allocator is about 20 lines, use an online C compiler if you don’t have one).

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

typedef struct Allocator
{
    // TODO
} Allocator;

Allocator CreateAllocator(int arenaSize)
{
    // TODO
}

void DestroyAllocator(Allocator* allocator)
{
    // TODO
}

void* Allocate(Allocator* allocator, int allocationSize)
{
    // TODO
}

void Free(Allocator* allocator, void* allocation)
{
    // TODO
}

int main()
{
    // Create allocator that has 1 kb of memory
    Allocator localAllocator = CreateAllocator(1000);

    // Allocating memory for an int, and getting a pointer to that memory 
    // (remember pointer is the type)
    int* a = Allocate(&localAllocator, sizeof(*a));
    // Going to the actual memory a points to and storing the number 10 there
    *a = 10;

    int* b = Allocate(&localAllocator, sizeof(*b));
    *b = 365;

    // Printing out the integers that are stored in the memory that a and b point to
    printf("A: %i, B: %i", *a, *b);

    // If this was a real application these three lines wouldn't be here because this just 
    // makes the shutdown slower, the OS will clean up our memory anyway.
    // Same goes for these free's if this was somewhere in the middle of our application, 
    // we would just destroy the allocator, and a and b would automatically be freed.
    Free(&localAllocator, a);
    Free(&localAllocator, b);

    DestroyAllocator(&localAllocator);

    return 0;
}

Here is the output of this code:

A: 10, B: 365

This is what was expected. We can allocate anything we want with this, integers, structs, arrays; just memory that can be interpreted in any way we want.

Writing an allocator in C++

In C++ however, when you start using this allocator, you will very quickly notice that there is a problem. Here it is:

C++

class Foo
{
public:
    int data1;
    float data2;
    
    // Constructor
    Foo(int startValue)
    {
        data1 = startValue;
        data2 = startValue * 2.f;
        printf("constructor!\n");
    }

    // Destructor
    ~Foo()
    {
        printf("Destructor!\n");
    }
};

int main()
{
    Allocator localAllocator = CreateAllocator(1000);

    Foo* a = (Foo*)Allocate(&localAllocator, sizeof(*a));

    // Do application stuff

    Free(&localAllocator, a);

    DestroyAllocator(&localAllocator);
}

The output of this program is: nothing. That is because allocating and freeing memory doesn’t automatically call the constructor or destructor. In object oriented languages, memory allocation is mixed with object construction, and memory deallocation is mixed with object destruction. This concept is called RAII: resource acquisition is initialization. By creating a memory allocator we’ve separated allocation from construction again, so here’s how we can fix our code:

C++

int main()
{
    Allocator localAllocator = CreateAllocator(1000);

    Foo* a = (Foo*)Allocate(&localAllocator, sizeof(*a));
    a = new(a) Foo(3);

    a->~Foo();
    Free(&localAllocator, a);

    DestroyAllocator(&localAllocator);
}

Now the program properly outputs:

C++

constructor!
Destructor!

I’ve used the “in place new” operator on line six, which does object creation at the memory address that you give it. To destruct the object the destructor now gets explicitly called. Obviously typing this out for everything we want to create is not the solution. Another (even bigger) problem with this, is that the standard library can’t use our allocator. The easiest solution to this is to override the global new and delete operators, so let’s do that:

C++

#include <vector>

static Allocator overrideAllocator;

void* operator new(size_t size)
{
    std::cout << "New operator overloading\n";
    void* p = Allocate(&overrideAllocator, size);
    return p;
}

void operator delete(void* block)
{
    std::cout << "Delete operator overloading\n";
    Free(&overrideAllocator, block);
}

int main()
{
    overrideAllocator = CreateAllocator(1000);

    Foo* a = new Foo(3);

    delete a;

    {
        std::vector<int> testVector(100);
    }

    DestroyAllocator(&overrideAllocator);
}

This program outputs:

C++

New operator overloading
constructor!
Destructor!
Delete operator overloading
New operator overloading
New operator overloading
Delete operator overloading
Delete operator overloading

We can see that for the “a” variable the new operator override gets called, and then it’s constructor gets called. New automatically calls the constructor so we don’t have to do that in the override, which is a bit strange. Another even stranger thing I accidentally found out, is that the vector constructor uses two dynamic allocations, so much for C++’s “zero overhead abstractions”. I stepped through the code, and I did see two things get allocated, but it was impossible to tell what they were for, because of the convoluted variable names of stl. I think the second allocation is used for having a pointer to the allocation of the array. If the array gets resized and needs to be moved, pointers to it get invalidated, but if you have a pointer to a pointer to the array then that pointer in the second allocation can be updated to point to the new location without its pointer being invalidated. Besides that I stepped through 140 lines of code (excluding closing braces) just to see the construction of a vector (for reference my entire C dynamic array implementation is 114 lines, including empty lines). Regardless, we can now see that the standard library also uses our allocator. There’s still a problem though, and that is that you can only override the new operator once. This means that all we’ve done is replace the global allocator. There’s also an option to overload the new and delete operators, but those can’t be used by standard library constructs.

std::Allocator
The C++ standard library has something called std::Allocator, an std::Allocator instance can be given to a container for it to use. I have not found evidence that there are people who use this, and there are good reasons for that. First std::Allocator has an unchangeable API, and if that doesn’t fit your needs then there’s nothing you can do about that. Here’s a few things that are wrong with the API: allocators are templated; no realloc or aligned alloc; no option for the user of the allocator to keep track of allocation size instead of the allocator. One more issue is that allocate takes an amount of items to be allocated, not an amount of bytes. It gets the size of the items from the template, this means allocator instances are bound to a type (I think rebinding the allocator to a different type is possible, but this is obviously not ergonomic for the programmer. I also couldn’t find a single example of this in practice). If you do want to use std::allocator, your allocator needs to implement the C++ allocator traits, the C++ documentation has an example of this [2]. There is also this article that shows how to create a wrapper that can wrap an allocator in a template that implements the allocator traits [3](link).

The C++ standard library not having good accommodations for custom allocators is the main reason EA wrote their own standard library: EASTL [4]. Here’s a link to what they say about it. I recommend you read the motivations part, and also take a look at the std::allocator and eastl::allocator part, to see just how bloated std::allocator is, while not doing everything it needs to do. To be fair C++ 11 and 17 did make it easier to extend std::allocator, but most of the problems still persist. Andrei Alexandrescu even says in a talk that std::allocator was originally designed as an abstraction over near and far pointers on old computer architectures [5] (there would be too much memory to address with a 16 bit integer, so a 16-bit architecture would use near and far pointers to resolve this [6]).

Alternatives
Ideally everyone rewrites the standard library just like EA so everyone can design their own allocator API that fits their needs. The problem with this, is that doing that can take a lot of time, and most people using C++ don’t want to do that, because otherwise they would be using C. However, this option requires real consideration because making a vector and a hashmap wouldn’t take that much time and would probably get you most of the way there. At first this article was actually going to be about how to do this, and I already wrote a custom smart pointer implementation, so if you want to do that, here it is: custom smart pointers (do note of course that smart pointers aren’t good for performance).

New and delete overloading

Finally, there’s one problem that we need to come back to, and that is selecting different allocators while using the new and delete operators. The global new operators can’t just be overridden, it can also be overloaded (although this is technically a version of in place new, it does what we want it to do). So simply create a new version that takes in an allocator, and do the same for delete! Except delete can’t be overloaded in the same way, an overload of delete like this will only be called if the new overload throws an exception. See this stack overflow thread for more information: thread [7]. So the new operator will have to keep track of which allocator was used, there are a few ways to do this. The new operator could ask for eight bytes more than it needs and store a pointer to the allocator that was used in front of the user memory block. Alternatively you can check which allocator has control over the memory you’re trying to free, but this becomes computationally more expensive the more allocators you have and requires a lot of branching.

Solutions of other languages

Most modern systems programming languages have constructs for making custom allocation easier built into the language, languages like Zig, Odin and JAI. A notable exception to this is Rust, because it focusses more on memory safety rather than speed, although they are working on something [8]. Zig has an allocator interface, however I can’t show it, because Zig is still in development and the documentation for custom allocators is incomplete [9]. However what is interesting about Zig, is that it doesn’t have any runtime, so by default it has no global allocator (unless you link against the c standard library or something similar), which means you are left to ask the OS for memory and manage it yourself. Odin also has custom allocators built-in and a context system which is supposed to make using local allocators easier [10]. And finally C: it doesn’t have the problems of C++, because it does less than C++. Specifically, there’s no RAII because C is not object-oriented, and the C standard library doesn’t have any containers to pass allocators to in the first place.

Finally, here are two resources that weren’t mentioned in the article that are useful for actually implementing the allocator algorithms: Ginger Bill, Memory Allocation Strategies [11] and Mariano Trebino, memory-allocators [12]. John Lakos’ and Andrei Alexandrescu’s talks are also good resources for implementing allocators, [1] and [5] respectively.

References

[1] John Lakos, Local (‘Arena’) Memory Allocators
https://youtu.be/nZNd5FjSquk?si=BBsFuRbT9BhGZSML
[2] C++ reference, Allocator traits
https://cplusplus.com/reference/memory/allo c ator_traits/
[3] Deckhead, Custom C++20 Memory Allocators for STL Containers
https://indiegamedev.net/2022/03/27/custom-c20-memory-allocators-for-stl-containers/
[4] Paul Pedriana (EA), EASTL — Electronic Arts Standard Template Library
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html
[5] Andrei Alexandrescu, std::allocator Is to Allocation what std::vector Is to Vexation
https://youtu.be/LIb3L4vKZ7U?si=kjmoE2Vv1Kx9gFHt
[6] GeeksForGeeks, Near, Far and Huge Pointers in C
https://www.geeksforgeeks.org/what-are-near-far-and-huge-pointers/
[7] Stack overflow, overriding delete with parameters
https://stackoverflow.com/questions/11431685/overriding-delete-with-parameters
[8] Rust allocator traits
https://github.com/rust-lang/rust/issues/32838
[9] Zig documentation on memory
https://ziglang.org/documentation/0.5.0/#Memory
[10] Odin documentation on memory and context system
https://odin-lang.org/docs/overview/#implicit-context-system
[11] Ginger Bill, Memory Allocation Strategies
https://www.gingerbill.org/series/memory-allocation-strategies/
[12] Mariano Trebino, memory-allocators
https://github.com/mtrebi/memory-allocators
[13] Ram Memory icons created by Freepik – Flaticon