Network Namespacing for Jupyter Rendering
A descent into madness…
The problem
Let’s start with a motivation for why I’m even looking at network namespacing in the first place. The geoscientific framework I work on has a series of demos/tutorials. These serve several purposes simultaneously: documentation of how to actually use the framework; testing that this style of documentation is always runnable; and as part of the regression test suite.
To achieve these goals, the demos are written as jupytext-formatted Python scripts. This gives them a one-to-one mapping to Jupyter notebooks, with a couple of nice features. Firstly, the syntax is very minimal:
# This is a markdown cell
# And without a linebreak, the cell continues
# This block goes into a single code cell
x = 1
y = 2
# +
# We can manually mark cells so we can use line
# breaks within them
foo = "bar"
# This is in the same cell as above
foo += "foo"
# -
The very small # + and # - markers are easy to manipulate
manually, and don’t get in the way when editing the script as plain
text. In this format, the script can be edited as a notebook but
synchronised as plain text on disk.
The second useful feature is being able to tag cells to run only in the notebook context. This way we can add plotting/diagnostics that aren’t needed when running as a test case, particularly because PyVista can require a display server to work correctly. It’s possible to tag sections that run only in the script context, but I haven’t found any need for this yet. The tags go on the cell markers:
# + tags=["active-ipynb"]
# This plot only exists in the notebook
plt.plot(...)
# -
Rendering the demos
In the name of accessibility and documentation, part of the CI
pipeline is to render these notebooks to .ipynb files including cell
outputs and plots. There are also runtime artifacts generated by some
of the demos, such as animations. With the
mkdocs-jupyter plugin,
these notebooks can be included in the static site generation workflow
for https://gadopt.org. To decouple the generation of the website
from the framework’s source, the notebooks are bundled up in a GitHub
workflow artifact created by a render-specific workflow. This
workflow simply sets up an virtual display that PyVista can use for
rendering its plots, the runs all the notebooks in parallel (some of
them take a little while) with jupytext --execute.
This brings us to the issue at hand: starting several jupytext execution jobs in parallel runs into a race condition leading to a deadlock. Let’s dive into the rather convoluted startup pipeline to see what’s really going on.
Executing notebooks from the command line
From the jupytext command line, when we request the --execute
option, nbconvert is used to do
the actual work. Within nbconvert, an ExecutePreprocessor is
created, and starts the tasks of creating a kernel instance and a
client. This is because it is actually a subclass of
NotebookClient, which is in
nbclient, “a client library for
programmatic notebook execution”.
Now, because jupytext doesn’t pass it through, the NotebookClient
first assigns a Kernel Manager before it starts its Kernel Client.
The default KernelManager class comes from the Jupyter Client
API, and we ask it to start a
new kernel (i.e. the actual backend that will be doing the execution).
On the face of it, this looks straightforward:
@in_pending_state
async def _async_start_kernel(self, **kw: t.Any) -> None:
kernel_cmd, kw = await self._async_pre_start_kernel(**kw)
await self._async_launch_kernel(kernel_cmd, **kw)
await self._async_post_start_kernel(**kw)
start_kernel = run_sync(_async_start_kernel)
We perform some pre-start procedure, which gives a kernel command
that can actually be used to launch the kernel. Once that’s done, any
post-start work can be done. Each of these sub-steps is quite
involved on its own. We ask the KernelProvisionerFactory for a
provisioner instance for our given kernel spec, and then ask the
provisioner to pre-launch.
The default provisioner is the LocalProvisioner, which launches the
kernel locally, as its name suggests. Don’t worry, we’re finally
getting to the race condition! As part of its pre-launch routine, and
if the kernel manager wants to, the local provisioner will cache
ports (shell, IO pub, stdin, heartbeat and control all get a separate
port). Additionally, because the kernel manager inherits the
ConnectionFileMixin, these ports get written to a connection file
for when the kernel actually starts. So, for each of the desired
ports, the provisioner will bind a socket, setting the port argument
to 0, effectively asking for a random port from the kernel. It then
closes the socket, meaning the port is not reserved in any way. Given
that there are only 65,535 ports available for TCP (and even a little
less than that when you subtract off privileged ports), there’s quite
a high chance of collisions.
Now that we’ve done the setup, we can try to launch the kernel, which
hits the main entry point for the IPython
Kernel. Essentially, this is the
point where the connection file is re-read, and all the ports are
bound for ZMQ to use. However, the logic in
IPKernelApp._bind_socket dictates that if a port has a non-zero
value, it will fail if the port is unavailable to bind. We get a
ZMQError, a failed kernel start, and ultimately a failed render
workflow. Can we configure our way around this problem?
Jupyter configuration
Jupyter has an extensive configuration framework called
Traitlets. When you look at the
available options for a component, such as IPython itself, its
power and configurability becomes evident. But is it as great as it
seems? The local provisioner will write specific port values to the
connection file if the kernel manager wants to cache ports. There is
indeed a KernelManager.cache_ports configuration option, which
defaults to true if the underlying transport is TCP (and false
otherwise). However, there doesn’t seem to be a way to pass
configuration through the jupytext or nbconvert layers. Maybe we can
add it as an argument to the kernel launcher in kernel.json? With
some print debugging, it doesn’t seem like this has any effect.
A second trap
Even if we completely bypass the configuration layer and hardcode port caching off, there’s another pitfall lurking. When the connection file is written, any port with the default value of 0 is given a random value (through the same bind-to-0 procedure)! It seems that if we want to use the logic within the IPython Kernel for attempting to bind a random port at startup, we cannot go through the client API layer. And thus, for the moment, this is a dead end.
Sidestepping a race condition
Now that there’s some background on the issue, let’s look at some ways around it. This is the first thing I tried, which is good enough, but certainly ugly and not robust. As the title suggests, this isn’t the final solution, and indeed it’s not what I thought of first. If our race condition arises because all the Jupyter Clients are trying to bind ports for their connection files simultaneously, maybe we can just stagger their startup.
Ideally, one process will write its connection file, then load it and
bind all the ports to ZMQ before the next one is allowed to start.
There isn’t really a nice (simple) approach for monitoring the startup
of a job to make sure it’s definitely ready to go. I’m sure something
elaborate could be rigged up with ss or some kind of
instrumentation, but the development of such an approach outweighs its
benefit. Instead, we can just wait! When the render workflow takes
on the order of 45 minutes, what’s an extra 1 second per demo?
For some extra background, the extra work of the render workflow is
driven through a Makefile. The rendering takes place in a pattern
rule transforming %.py to %.ipynb, and we simply use make -j.
By the way, there’s a jobserver which looks like an interesting
way of coordinating parallel tasks in recursive Makefiles. I digress,
we’re looking for this magic:
%.ipynb: %.py
(flock 9; \
sleep 1 && \
flock -u 9 & \
python3 -m jupytext --to ipynb --execute $< \
) 9>render.lock
Essentially we’re using a file-based lock to serialise the
startup of the render tasks. Within a single task, a subshell is
started with file descriptor 9 pointing at render.lock. To actually
start the job, we need to take an exclusive lock on this file using
flock from util-linux. From the manpage:
[flock wraps] the lock around the execution of a command … [it locks] a specified file or directory, which is created (assuming appropriate permissions) if it does not already exist. By default, if the lock cannot be immediately acquired, flock waits until the lock is available.
By opening the file descriptor using >, we ensure the file can be
created if it does not already exist. We could also use < which
only requires read permissions (however the file itself must exist
first). Now, we don’t want to wrap the entire jupytext command in the
critical region, or we’d completely serialise the whole process.
Instead, we wait 1 second and release the lock while simultaneously
starting jupytext. This hopefully is enough time to perform the
launch procedure before allowing the next task to proceed.
Doing it (more) properly
Hopefully it’s apparent that this serialisation approach does absolutely nothing to address the underlying race condition. I implemented it mostly to avoid having to retrigger failed render workflows. My initial thought (verbatim) was that this could be done with cgroups in order to isolate the Jupyter clients from one another. The technology I was actually looking for was network namespaces, which is similar but not quite the same.
Using the unshare program, we can execute a given program with new
namespaces. From network_namespaces(7):
Network namespaces provide isolation of the system resources associated with networking: network devices, IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, …
We can try that out on its own. Because we’re not also asking for a user namespace, this needs to be executed as root. However, we end up in a shell with its own network namespace and a loopback interface in the down state.
$ sudo unshare --net
# ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Even though the render workflow runs in a Docker container, I’d prefer
that things run with user privileges just to ensure the demos are
sensible. To do this, we also request a user namespace, which gives a
distinct set of UIDs, GIDs, and importantly, capabilities. In fact,
we go one small step further and use the --map-root-user option.
This maps the effective UID and GID within the namespace to that of
root. The only real reason we need to do this is to bring up the
loopback interface in the namespace so Jupyter can actually map its
ports somewhere. Unfortunately, the blurred line between file access
inside and outside the namespace means that we also look like root
when we’re running our render task, and some paths need to be massaged.
The way --map-root-user is able to give us root permissions despite
being an unprivileged command is through the subordinate UID
mechanism. The /etc/subuid (and its corresponding subgid) file
defines a range of UIDs that can be mapped into child namespaces.
Then unshare will set the /proc/[pid]/uid_map file to establish a
mapping between UIDs inside the namespace and outside it. With
this in place, while we look like root within the namespace (and have
many of the capabilities available to root), we’re only able to impact
the system as our regular user.
So we finally come to where things are today: the render container
unfortunately needs to be given the SYS_ADMIN capability to make use
of namespaces, and a wrapper script emulates the behaviour of jupytext
while sandboxing it in a network namespace and smoothing over some of
the path oddities. In the end, it all works pretty seamlessly and
I’ve at least learned a few things on the way!