2025-02-13

Memory Profiling

A descent into madness…

The problem

Something I work on is supporting high-resolution inverse modelling with Firedrake. This is a Python framework with a DSL for defining partial differential equations that get compiled to native code for solving with PETSc using finite element methods. In reality, there are a lot of moving parts! Importantly, just about all operations can be annotated with their symbolic inverse. This means that we can get tangent linear and adjoint models “for free” out of a forward model.

Our problems have a lot of degrees of freedom, which necessitates running them on HPC. Even there, we sometimes run up against limits, whether they’re time, or simply the available memory. In the case of this particular inverse problem, these are both concerns. That is: the forward problem itself is expensive, and we can’t afford to re-run it unless essential. Secondly, we can’t afford inefficient memory management, whether that comes from actual leaks, or relying heavily on garbage collection.

A crash course in adjoint modelling

We solve a time-dependent, coupled problem with external forcing. Every “operation” is recorded on a tape as a block. This includes solves (for the coupled problem, there would be two solves per timestep), projections (external forcing onto the function space we use for representing the problem), assembly (integrating some value over the domain), and more. The tape is more like a DAG (directed acyclic graph), as blocks have dependencies and produce outputs.

We’re usually running our inverse models as part of a nonlinear optimisation: we have an observation of the final state, and want to recover the initial condition. We define a reduced functional which takes as an input the candidate initial condition, and produces a scalar measure of the error by some criteria. Nonlinear optimisation can take this reduced function, along with its derivative and attempts to minimise the error. The derivative of the reduced functional with respect to the initial condition gives a measure of the sensitivity of the error to different regions of the initial condition. The high level procedure looks like this:

define the forward problem, encapsulating all the relevant physics and external forcing;
run the forward problem, capturing the operations on the tape;
define a reduced functional that assesses the model state by some metric;
stop annotating any further operations to tape;
run nonlinear optimisation, a series of forward and adjoint evaluations of the reduced functional.

Running the reduced functional forward is fairly straightforward. We update the control (i.e. the input field) and re-run all the blocks of the tape. This is generally faster than the initial forward run because solver objects are cached, and the underlying code has already been generated and loaded. Importantly, we must run the reduced functional forward to evaluate the derivative for that input! Conceptually, we “just” flip the direction of the edges in the DAG and run backwards. Each of the blocks has an adjoint operation, but often it requires the result of the forward evaluation of the block. For smaller or simpler problems, it’s possible to dynamically recalculate the blocks on the fly. For our problem, this is intractable. Instead, we cache the forward evaluations of the blocks to disk, and reload them as needed. Obviously this involves a trade-off between I/O speed and computation speed, and there are checkpointing schemes that can automatically hybridise between recomputing and caching on disk (or just keeping the results in memory, another impossibility for sufficiently large problems).

Excessive memory usage

With all of these forward and adjoint tape evaluations, there are several function objects created. In the finite element world, this is essentially the vector of values on the nodes of our finite element, i.e. it is as large as the number of degrees of freedom in our problem. With large problems, there’s a limit of the number of functions that can physically fit in memory, so we must be efficient in deallocating unused functions.

With that said, even for a simple test problem, we were seeing a sudden increase in memory usage when evaluating the reduced functional forward. After this, there would be a slow and steady increase in memory usage with successive forward/adjoint evaluations. For our large problem, this meant that we couldn’t even evaluate the reduced functional forward a single time without hitting the physical memory cap on the HPC nodes.

Diagnosing and tracking down some of the culprits for this involved several tools, some low-level understanding, and some good old detective work.

Memory management and allocation

As mentioned above, the high-level Firedrake framework is written in Python. For the most part, everything is done in Python except for a few places that operations are delegated out to native code. These include: using PETSc to actually solve the underlying systems; finite element assembly (this uses the on-the-fly compilation); other external libraries such as HDF5, or when Python modules defer to native code. One of the strengths of Python is that this delineation between Python-land and native-land is fairly smooth: we can cart around and call into native objects, and pass Python things into native code. However, being a “managed runtime” of sorts, we don’t have complete explicit control over everything.

One of the areas of low-level management that is almost completely paved over in Python is memory. Usually everything just works, and we don’t have to think about it at all. Most of the time, the lowest level manipulation will be the reference counting when writing native-code Python modules. Reference counting, garbage collection, and memory allocation from the Python side can usually be completely ignored. As you might have guessed, we’re not ignoring it today, so here’s a little primer on the topics!

Reference counting

Reference counting is a fairly simple, yet robust technique for memory management. In fact, reference counting with cycle detection is all Python needs. There is a wealth of literature on memory management, and it comes down to optimising for things like programmer involvement (manually keeping track of reference counts), complexity (handling cycles, etc.) and time (try not to fully iterate all allocated objects frequently).

The concept of reference counting is easy: objects that have to be allocated on the heap (think explicit allocations, we’ll get there) have an aptly-named reference count associated with them. This starts at 0, and whenever it remains at 0, the object will be deallocated: nothing is referring to it, so this should be safe. We can demonstrate this in a rather crude way:

class C:
    def __init__(self):
        print("init")
    def __del__(self):
        print("del")

x = C() # prints "init", refcount is 1

import sys
print(sys.getrefcount(x))
# prints 2(!)

# remove the remaining reference
del x # prints "del", as refcount went down to 0

It may be confusing that sys.getrefcount returns 2, but it is just obeying the rules! Because the object x is being referenced in the argument list to the method, its reference count increases. The rules don’t get much more complex. It’s important to remember that being inside containers, or being referred to in class instances also increases the reference count.

For the cases where reference counting is sufficient, it’s quite useful. When something goes out of scope and can be deallocated because its reference count drops to 0, it is. In the context of our problem, if all the function objects are correctly reference counted, we should have fairly manageable memory usage. However, sometimes things get more complicated.

Cycle detection

While reference counting is a fairly robust technique, there are some situations it can’t deal with.

class C:
    def __init__(self):
        self.reference = None

a = C()
b = C()

a.reference = b
b.reference = a

This example illustrates very clearly (if artificially) the creation of a reference cycle. Suppose both a and b were to go out of scope. Because each of them still has a reference to the other, neither is deallocated! This is where (cyclic) garbage collection comes into play. The technical details are quite interesting, but perhaps a bit out of scope for what is already a long article. The key takeaway is that the garbage collector runs only periodically, with thresholds set on the number of objects allocated before it runs. Further, it is generational, so “older” objects are scanned less frequently.

Functions allocated for the evaluation of a block tend to live a reasonably long time, in garbage collector terms. This means that they are often adopted into the older generations. While they may eventually get deallocated, they can sit around simultaneously, using more memory than necessary.

Allocation

The whole topic of memory management in Python (and indeed in general) is quite deep and complex. Python has several allocators of its own, depending on the type and size of the objects being allocated. And further, PETSc has its own reference-counting memory management. The HDF5 library maps files into memory, and so does SQLite (which forms part of the caching layer).

With such a complex ecosystem of dependencies, it is hard (or perhaps impossible) to use a single unified tool to completely understand the memory demands of our models. Instead, I used a variety of different tools in combination with some manual investigation.

Profiling tools

As mentioned, there’s a wide variety of tools available for profiling in general, and for memory usage specifically. Some may be Python-specific, or lower level. It’s even possible to simultaneously profile Python and its native code footprint. The tools also have different focuses: finding leaks, analysing the contributors to the high water mark, or temporal analysis. In the end, some combination of all of these tools helped me to get a picture of the problems we were seeing.

Python Memory Profiler

Memory Profiler is probably the highest-level, easiest, and perhaps fastest tool to use. Unfortunately, it is no longer maintained, but it is still valuable. This tool gives a picture of the process RSS (resident set size) through time. It can run on an entire script, just within sections, or a combination of the two. I think this latter mode is the most useful: it will produce a plot of the RSS through time and highlight the specifically profiled regions. This let me narrow things down to a section of code, for further analysis.

The main downside of Memory Profiler is that there’s no granularity: we can only see the entire process RSS as a whole. This doesn’t give us insights into what is allocating (or not deallocating) any of the memory.

memray

Memray is a fantastic and powerful profiling tool. Rather than only recording the process RSS, it records allocations and deallocations throughout time. It can track Python allocations alone, or also look at those coming from native code too. The recorded profile can be analysed with various tools, such as providing statistics, searching for leaks, or generating a flamegraph. I think in a lot of cases, these analyses could get you a long way, and they did help with some of my diagnosis. However, the flamegraph only breaks down the allocations contributing to the memory high water mark.

With the flamegraph, I could see several allocations coming from numpy (which is used as the backing storage for the finite element functions). However, it was hard to determine if these were simply required for the block evaluation, or part of objects that hadn’t yet been cleaned up. Memray does in fact offer a temporal variant of the flamegraph, where you can highlight the region of interest. I found that even for a very simple reproducing example, it was sufficiently complex that the resulting webpage was simply far too big to render.

Intel VTune

Intel has very powerful profiling included with its oneAPI in the form of VTune. VTune can perform a huge amount of analysis, and is usually for very specific performance tuning (things like CPU cache and instruction vectorisation). It does however include a module to perform a memory profile. As with many Intel products, the documentation is all over the place: it seems to imply that it can decode Python frames and do mixed-mode analysis, but in practice this didn’t work. Regardless, it can show allocations (and deallocations) through time, and attribute them to particular call stacks, functions, or modules.

Even though VTune refused to integrate nicely with Python, its call stack information showed that again, we were seeing allocations from numpy where memory usage was increasing. I probably wasted a bit too much time trying to get anything more specific out of this.

Valgrind massif

Valgrind is a very useful tool for determining memory correctness. It runs your entire application in a “simulator” of sorts, and you pay for that in performance. The benefit is its ability to detect leaks, memory errors, or non-performant accesses. Part of the suite is massif, a memory profiler. As with valgrind itself, your application takes a significant performance hit. What we get out is a series of snapshots through time of the memory consumption, attributed to specific allocating routines. For example, we can see the different Python allocators, PETSc, etc. I think this was most useful in showing that libraries like HDF5 and SQLite didn’t significantly over-allocate. However, it again doesn’t integrate natively with Python, so you lose a lot of granularity about all the allocations in that domain.

Tracemalloc

Finally, we come to Python’s humble tracemalloc module. When in tracing mode, you can take snapshots of all allocations. These snapshots can be analysed individually, showing the largest or most frequent allocators. What was useful for my analysis was the ability to take the difference of allocations, and show which objects (and how many) were allocated in a particular region. Although it doesn’t have knowledge of external allocators, this was still useful by being able to show what was happening in Python, through time.

Tracking down reference cycles

At this point, I had a reasonably clear conclusion that most of the excessive memory use was coming from all the various finite element functions and their backing array data. I did some very primitive tracking of allocations, deallocations and garbage collection cycles:

alloc_counter = 0
class EmptyDataMixin(...):
    def __init__(self):
        global alloc_counter
        print(f"alloc {alloc_counter}")
        self._count = alloc_counter
        alloc_counter += 1

    def __del__(self):
        print(f"dealloc {self._count}")

def gc_callback(phase, info):
    print(phase)

import gc
gc.callbacks.append(gc_callback)

The most telling thing here is that I would see several deallocations during garbage collection cycles. This is a sure sign that there are reference cycles on the objects! Of course, depending on how data is structured, reference cycles can be a necessary evil. I think that for these particular objects, we would like them to be eagerly deallocated by reference counting if at all possible. With that in mind, I used gc.set_debug(gc.DEBUG_SAVEALL) to append all unreachable items to gc.garbage rather than deallocating them. I then used the referrers package to print the referrer trees of these objects. What became immediately clear is that they were all holding references to themselves!

As hinted above, the data carrying class includes mixins that gets inherited with some other stuff. Due to MRO (method resolution order), a mixin can’t define overrides to methods in the base class. Instead, it was doing this by setting the functions to lambdas. The following gives an idea of what was happening, although drastically simplified:

class VecAccessMixin:
    def __init__(self):
        self._counter = 0
        self._get_counter = lambda: self._counter

    @property
    def counter(self):
        self._get_counter()

Because the lambda captures a reference to self in its closure, any class that inherits from VecAccessMixin automatically has a cyclic-self reference, and must necessarily be collected by the cyclic garbage collector. In my humble opinion, that’s not ideal. Fixing that one issue helps significantly on artificial benchmarks, but EmptyDataMixin has several other reference cycles through much more convoluted paths. I’m not hopeful that they can be dealt with so easily, and it might be that the tape evaluation should explicitly del them instead of just letting them fall out of scope.