..

Network Namespacing for Jupyter Rendering

A descent into madness…

The problem

Let’s start with a motivation for why I’m even looking at network namespacing in the first place. The geoscientific framework I work on has a series of demos/tutorials. These serve several purposes simultaneously: documentation of how to actually use the framework; testing that this style of documentation is always runnable; and as part of the regression test suite.

To achieve these goals, the demos are written as jupytext-formatted Python scripts. This gives them a one-to-one mapping to Jupyter notebooks, with a couple of nice features. Firstly, the syntax is very minimal:

# This is a markdown cell
# And without a linebreak, the cell continues

# This block goes into a single code cell
x = 1
y = 2

# +
# We can manually mark cells so we can use line
# breaks within them
foo = "bar"

# This is in the same cell as above
foo += "foo"
# -

The very small # + and # - markers are easy to manipulate manually, and don’t get in the way when editing the script as plain text. In this format, the script can be edited as a notebook but synchronised as plain text on disk.

The second useful feature is being able to tag cells to run only in the notebook context. This way we can add plotting/diagnostics that aren’t needed when running as a test case, particularly because PyVista can require a display server to work correctly. It’s possible to tag sections that run only in the script context, but I haven’t found any need for this yet. The tags go on the cell markers:

# + tags=["active-ipynb"]
# This plot only exists in the notebook
plt.plot(...)
# -

Rendering the demos

In the name of accessibility and documentation, part of the CI pipeline is to render these notebooks to .ipynb files including cell outputs and plots. There are also runtime artifacts generated by some of the demos, such as animations. With the mkdocs-jupyter plugin, these notebooks can be included in the static site generation workflow for https://gadopt.org. To decouple the generation of the website from the framework’s source, the notebooks are bundled up in a GitHub workflow artifact created by a render-specific workflow. This workflow simply sets up an virtual display that PyVista can use for rendering its plots, the runs all the notebooks in parallel (some of them take a little while) with jupytext --execute.

This brings us to the issue at hand: starting several jupytext execution jobs in parallel runs into a race condition leading to a deadlock. Let’s dive into the rather convoluted startup pipeline to see what’s really going on.

Executing notebooks from the command line

From the jupytext command line, when we request the --execute option, nbconvert is used to do the actual work. Within nbconvert, an ExecutePreprocessor is created, and starts the tasks of creating a kernel instance and a client. This is because it is actually a subclass of NotebookClient, which is in nbclient, “a client library for programmatic notebook execution”.

Now, because jupytext doesn’t pass it through, the NotebookClient first assigns a Kernel Manager before it starts its Kernel Client. The default KernelManager class comes from the Jupyter Client API, and we ask it to start a new kernel (i.e. the actual backend that will be doing the execution). On the face of it, this looks straightforward:

@in_pending_state
async def _async_start_kernel(self, **kw: t.Any) -> None:
    kernel_cmd, kw = await self._async_pre_start_kernel(**kw)
	await self._async_launch_kernel(kernel_cmd, **kw)
	await self._async_post_start_kernel(**kw)

start_kernel = run_sync(_async_start_kernel)

We perform some pre-start procedure, which gives a kernel command that can actually be used to launch the kernel. Once that’s done, any post-start work can be done. Each of these sub-steps is quite involved on its own. We ask the KernelProvisionerFactory for a provisioner instance for our given kernel spec, and then ask the provisioner to pre-launch.

The default provisioner is the LocalProvisioner, which launches the kernel locally, as its name suggests. Don’t worry, we’re finally getting to the race condition! As part of its pre-launch routine, and if the kernel manager wants to, the local provisioner will cache ports (shell, IO pub, stdin, heartbeat and control all get a separate port). Additionally, because the kernel manager inherits the ConnectionFileMixin, these ports get written to a connection file for when the kernel actually starts. So, for each of the desired ports, the provisioner will bind a socket, setting the port argument to 0, effectively asking for a random port from the kernel. It then closes the socket, meaning the port is not reserved in any way. Given that there are only 65,535 ports available for TCP (and even a little less than that when you subtract off privileged ports), there’s quite a high chance of collisions.

Now that we’ve done the setup, we can try to launch the kernel, which hits the main entry point for the IPython Kernel. Essentially, this is the point where the connection file is re-read, and all the ports are bound for ZMQ to use. However, the logic in IPKernelApp._bind_socket dictates that if a port has a non-zero value, it will fail if the port is unavailable to bind. We get a ZMQError, a failed kernel start, and ultimately a failed render workflow. Can we configure our way around this problem?

Jupyter configuration

Jupyter has an extensive configuration framework called Traitlets. When you look at the available options for a component, such as IPython itself, its power and configurability becomes evident. But is it as great as it seems? The local provisioner will write specific port values to the connection file if the kernel manager wants to cache ports. There is indeed a KernelManager.cache_ports configuration option, which defaults to true if the underlying transport is TCP (and false otherwise). However, there doesn’t seem to be a way to pass configuration through the jupytext or nbconvert layers. Maybe we can add it as an argument to the kernel launcher in kernel.json? With some print debugging, it doesn’t seem like this has any effect.

A second trap

Even if we completely bypass the configuration layer and hardcode port caching off, there’s another pitfall lurking. When the connection file is written, any port with the default value of 0 is given a random value (through the same bind-to-0 procedure)! It seems that if we want to use the logic within the IPython Kernel for attempting to bind a random port at startup, we cannot go through the client API layer. And thus, for the moment, this is a dead end.

Sidestepping a race condition

Now that there’s some background on the issue, let’s look at some ways around it. This is the first thing I tried, which is good enough, but certainly ugly and not robust. As the title suggests, this isn’t the final solution, and indeed it’s not what I thought of first. If our race condition arises because all the Jupyter Clients are trying to bind ports for their connection files simultaneously, maybe we can just stagger their startup.

Ideally, one process will write its connection file, then load it and bind all the ports to ZMQ before the next one is allowed to start. There isn’t really a nice (simple) approach for monitoring the startup of a job to make sure it’s definitely ready to go. I’m sure something elaborate could be rigged up with ss or some kind of instrumentation, but the development of such an approach outweighs its benefit. Instead, we can just wait! When the render workflow takes on the order of 45 minutes, what’s an extra 1 second per demo?

For some extra background, the extra work of the render workflow is driven through a Makefile. The rendering takes place in a pattern rule transforming %.py to %.ipynb, and we simply use make -j. By the way, there’s a jobserver which looks like an interesting way of coordinating parallel tasks in recursive Makefiles. I digress, we’re looking for this magic:

%.ipynb: %.py
	(flock 9; \
	sleep 1 && \
	flock -u 9 & \
	python3 -m jupytext --to ipynb --execute $< \
	) 9>render.lock

Essentially we’re using a file-based lock to serialise the startup of the render tasks. Within a single task, a subshell is started with file descriptor 9 pointing at render.lock. To actually start the job, we need to take an exclusive lock on this file using flock from util-linux. From the manpage:

[flock wraps] the lock around the execution of a command … [it locks] a specified file or directory, which is created (assuming appropriate permissions) if it does not already exist. By default, if the lock cannot be immediately acquired, flock waits until the lock is available.

By opening the file descriptor using >, we ensure the file can be created if it does not already exist. We could also use < which only requires read permissions (however the file itself must exist first). Now, we don’t want to wrap the entire jupytext command in the critical region, or we’d completely serialise the whole process. Instead, we wait 1 second and release the lock while simultaneously starting jupytext. This hopefully is enough time to perform the launch procedure before allowing the next task to proceed.

Doing it (more) properly

Hopefully it’s apparent that this serialisation approach does absolutely nothing to address the underlying race condition. I implemented it mostly to avoid having to retrigger failed render workflows. My initial thought (verbatim) was that this could be done with cgroups in order to isolate the Jupyter clients from one another. The technology I was actually looking for was network namespaces, which is similar but not quite the same.

Using the unshare program, we can execute a given program with new namespaces. From network_namespaces(7):

Network namespaces provide isolation of the system resources associated with networking: network devices, IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, …

We can try that out on its own. Because we’re not also asking for a user namespace, this needs to be executed as root. However, we end up in a shell with its own network namespace and a loopback interface in the down state.

$ sudo unshare --net
# ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Even though the render workflow runs in a Docker container, I’d prefer that things run with user privileges just to ensure the demos are sensible. To do this, we also request a user namespace, which gives a distinct set of UIDs, GIDs, and importantly, capabilities. In fact, we go one small step further and use the --map-root-user option. This maps the effective UID and GID within the namespace to that of root. The only real reason we need to do this is to bring up the loopback interface in the namespace so Jupyter can actually map its ports somewhere. Unfortunately, the blurred line between file access inside and outside the namespace means that we also look like root when we’re running our render task, and some paths need to be massaged.

The way --map-root-user is able to give us root permissions despite being an unprivileged command is through the subordinate UID mechanism. The /etc/subuid (and its corresponding subgid) file defines a range of UIDs that can be mapped into child namespaces. Then unshare will set the /proc/[pid]/uid_map file to establish a mapping between UIDs inside the namespace and outside it. With this in place, while we look like root within the namespace (and have many of the capabilities available to root), we’re only able to impact the system as our regular user.

So we finally come to where things are today: the render container unfortunately needs to be given the SYS_ADMIN capability to make use of namespaces, and a wrapper script emulates the behaviour of jupytext while sandboxing it in a network namespace and smoothing over some of the path oddities. In the end, it all works pretty seamlessly and I’ve at least learned a few things on the way!