..

Caching Deadlock

A descent into madness…

The problem

As described in the article on memory profiling, we are running highly parallel models which involve automatic code generation. As it turns out, actually generating the code by traversing the AST created by the Firedrake DSL is fairly expensive. Moreover, we run the model in a time-stepping loop, so the same operation could be repeated several times. For these reasons, there is aggressive caching of this code generation phase. Recently, there was a regression followed by optimisation around the code generation and caching mechanism.

I saw the regression fairly quickly, as we were suddenly getting significantly slower runtimes on our testing suite (something like a 50% increase in time). Luckily, this was a very obvious regression that several groups reported simultaneously, and it was fixed in short order. Following this, the code generation and caching framework was “optimised”, at least in a way that passed all of the unit tests. We, however, simply saw our testing suite completely hanging and hitting the 6 hour time limit on GitHub Actions. The testing suite we run involves several models, some of which have a reasonable degree of parallelisation. In that sense, we test some parts of the Firedrake code much more aggressively than through unit tests alone. The hang was completely reproducible, which meant that I should be able to dive in and figure out what’s going on.

Diagnosis

The first step in the diagnosis felt a little bit like cheating, but it certainly made things easier. We run GitHub Actions through self-hosted runners, which means that I have direct access to them. I connected to the runner, fired up a shell in the Docker container corresponding to the run, and could figure out which of our test cases were deadlocked. This let me try to reproduce the issue in a more minimal environment.

Having our test environment run inside a standardised container makes it easy to investigate failures: just fire up a new instance and start probing. That’s exactly what I did, on a different machine. Of the two failing tests cases, one runs on 8 cores and takes a while, whereas the other only uses 2, and is much quicker. I ran the faster case in a loop until the long one terminated (or not) and could fairly easily recover the deadlock.

To get more information about the deadlock, my first approach is usually a little snippet to dump a traceback upon receiving a signal (obviously this works because I’m using Python):

import faulthandler, signal
faulthandler.register(signal.SIGUSR2)

Once I observe that the process has stalled, I can send it a kill -USR2 and confirm that it is stuck in an MPI operation of some kind. With hindsight, maybe this is sufficient to actually nail down the cause, but it wasn’t at the time. There are a couple of issues: we have mixed Python and native code, which calls back and forth between the two worlds. It can be hard to observe the full state from the Python traceback alone, but it can also be hard to interface native tools (like gdb) with Python. Secondly, deadlocks sometimes don’t manifest where their root is. In this case, as we’ll see, the issue was causing a divergence in the MPI calls between the different ranks. Several MPI calls still lined up, even though they were likely producing junk, and the hang finally occurred when they finally took different operations.

In order to get a view of things from the native side, I attached gdb to the Python processes too. The python-gdb.py extension as described in the Python documentation is a very neat way to mix native and Python call stacks. Unfortunately, I was running a release version of Python so occasionally several frames would be joined together, but it was still sufficient to get an idea of some local variables and the call path.

The cause

As I mentioned in the introduction, this deadlock happened in the caching of runtime code generation. With insufficient caching, this mechanism can easily eat performance. An optimisation had been added where a cache miss meant that the computation should only happen on the root rank, and then get broadcast to the rest:

    if value is CACHE_MISS:
        if bcast:
            value = func(*args, **kwargs) if comm.rank == 0 else None
            value = comm.bcast(value, root=0)
        else:
            value = func(*args, **kwargs)

This would probably work if the caching is atomic across all ranks. Obviously, it wasn’t. The caching is backed by a file, where a value is written and then moved into place atomically. This solves half of the problem: a rank can’t see a half-cached value and completely break. However, if the file is written by a rank before another looks, there’ll be a split in whether it is a hit or a miss. Per the code above, not all ranks will participate in the broadcast, and so begins the deadlock! Curiously, a later collective operation was able to satisfy the broadcast, so all the ranks moved along to different places before getting stuck.

As it turns out, my final diagnosis came from simply printing whether there was a cache hit or miss from every rank! I reported this as a bug and it was promptly addressed, changing the structure to accommodate a split in hit/miss across the ranks:

     root = comm.rank if cache_hit else -1
     root = comm.allreduce(root, op=MPI.MAX)
     if root >= 0:
         # Found a rank with a cache hit, broadcast 'value' from it
         value = comm.bcast(value, root=root)
         cache_hit = True
     else:
         # Cache miss on all ranks, recompute below
         cache_hit = False

This feels non-optimal, because it involves a (potentially expensive) MPI collective on every cache lookup. For the moment, extra cost is the price to pay for parallel correctness. This kind of bug is always subtle, and perhaps obvious once you’ve seen it, but not before that.