Sandboxing data crunches, Chapter 2: clone processes

This post, second in a series, offers a Python solution to a Python problem. The broad principles should apply to any single-threaded language on Linux.

In Chapter 1, we began sandboxing. Workbench’s “Renderer” runs a series of Pandas-powered “Steps”. We give each Step a new process, so Steps can’t control Renderer or other Steps.

Workbench Steps’ environment — Python, Pandas, Numpy, Pyarrow — takes 1–2 seconds to start. But Renderer needs to launch several Steps per second, per CPU. We must optimize.

It costs 1–2 seconds between when Renderer calls subprocess.Popen() and when Step sandboxes itself. Importing Pandas, Numpy and Pyarrow takes too long.

We can’t “recycle” Step processes after they run a Step’s untrusted code. A “bad” Step could, say, rewrite Python’s open() built-in function to email our users’ data to the Step author. Once untrusted code has run, its Step process must die.

And logically, if each Step results in a process dying, each Step must involve a process being created. How can we quickly create a process that costs 2 seconds to start?

Let’s see why subprocess.Popen() is slow. It implements the fork-exec UNIX pattern, which actually involves two system calls:

clone() duplicates a process; execve() wipes its memory and makes it run something else
  1. clone() — duplicate all the RAM of the current process. Now there are two processes: the Renderer process and its near-exact clone, the Step process. The Renderer process continues; in it, clone()returns the Step process ID (“pid”). At the same time, the Step process continues at the exact same place; but there, clone() returns 0. Aside from that single integer difference, the two processes are identical.

(Linux implements the UNIX fork()function with its own clone() system call. The pattern is known as “fork-exec”; and on Linux, it boils down to “clone-exec.”)

Fork-exec isn’t unique to Python: it’s how UNIX works.

Surprisingly, clone() is fast — it costs less than 1 millisecond. You’d think copying all the RAM of a big process would be slow, but Linux takes a shortcut: it lets both processes reuse the same physical RAM; in the future, if Step or Renderer try to write to that RAM, they’ll create write to a copy instead. (This is called “copy-on-write.”)

The slow part (in Workbench) is execve() — because it starts Python and imports big modules.

Here’s a fun idea: don’t call execve(). Import the Python modules into Renderer itself and just clone():

Fast but unacceptable

This works (if Renderer is single-threaded). It’s fast. But we can’t use it: the Step process contains a copy of the Renderer process’s memory — including secrets and other users’ data.

We must cram an execve() into this design. That’s the only way to ensure Renderer’s secrets aren’t copied into a Step process.

This is where Pyspawner fits in. Pyspawner calls execve() just once:

This is essentially Python’s “multiprocessing.forkserver” design

When Renderer launches, create a child process with subprocess.Popen(). (Remember: this calls execve(); so the subprocess has none of Renderer’s secrets.) Let’s call the subprocess “spawner”. Spawner is a Python script that imports Pandas, Numpy and Pyarrow (costing 1–2 seconds) and then … waits. It never exits.

With Spawner in place, here’s how Renderer can launch a process:

  1. Renderer signals Spawner to invoke clone().

Python’s multiprocessing.forkserver module implements the same idea, but it uses os.fork(). We at Workbench want clone().

What’s so great about clone()?

The first great feature is CLONE_PARENT. It makes our code simpler. To understand how, you need to understand why os.fork() makes Forkserver complex.

It has to do with process hierarchy. In Linux, there’s a special relationship between a process and its direct parent. The parent can and must waitpid() for its child process. And the parent receives a SIGCHLD signal when its child exits.

Python’s Forkserver uses os.fork(). Renderer can’t waitpid() for the Step to exit, and it won’t receive SIGCHLD signals from Steps. Instead, Renderer must request that Forkserver act in its stead; and Forkserver needs to concurrently handle events from Renderer and signals from Steps. It’s a complicated.

There’s an easier way. Pyspawner uses Linux’s CLONE_PARENT flag when calling clone(). The flag makes Renderer the direct parent of Step:

clone(CLONE_PARENT) makes Step is a child of Renderer

Now, Spawner can ignore all the Steps it creates: they aren’t its children. Renderer can manage Steps directly: they are its children. Simpler Spawner; simpler Renderer. Win-win.

The second great feature: clone() can build Linux containers. Linux containers are otent sandboxing tools.

Containers? I’m getting ahead of myself. First, I’ll mention the tricky details we encountered while building our clone()-based solution.

The ugliest detail is the clone() call itself, which looks like an ASCII artist’s depiction of vomit:

Calling clone() from Python

And then comes shared data. clone(), unlike fork(), jumps to a new function. So the best way to pass data from Pyspawner to each spawned Step is global variables. For instance, here’s how we give each child a fresh stdin, stdout and stderr:

The second icky bit is … Python. Python’s os.fork() invokes some Python-internal C functions — functions we can’t invoke directly. The internal C functions close some open file descriptors; fiddle with thread-local variables; cancel signal handlers; and call custom pre-fork handlers registered by Python libraries.

Most of this doesn’t matter. Pyspawner doesn’t install signal handlers, and it’s careful with file descriptors. Pyspawner is simple.

But some imported code might depend on these Python-internal functions. In our case, we learned import numpy invokes some threading code! Numpy installs a custom pre-fork handler so Python’s os.fork() can circumvent its threading code; but we use clone(), so we can’t easily call Numpy’s pre-fork handler. The result is Undefined Behavior. In our case, everything seemed fine until we invoked from within a Step. The Step froze.

Our workaround: wrestle with Numpy. Workbench launches Spawner with the environment variable OPENBLAS_NUM_THREADS="1". This prevents Numpy from fiddling with threading; so Numpy doesn’t install a pre-fork handler, so Python-internal os.fork()’s C functions are moot; so we can ignore the C functions and call clone().

And finally, we needed to build a communication protocol between Renderer and Spawner.

Our communication channel uses AF_UNIX sockets, so we can pass stdin, stdout and stderr file descriptors.

Unlike Python’s multiprocessing.forkserver, we avoid a named socket. We pass a file descriptor from Renderer as both a command-line argument and an open file descriptor: subprocess.Popen([..., str(spawner_sock.fileno())], ..., pass_fds=(spawner_sock.fileno(),)). We open the file descriptor in Pyspawner using sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM, fileno=socket_fd).

Spawner can’t send stdin, stdout and stderr file descriptors as mere integers: those integers would be meaningless to Renderer. So Pyspawner sends them using SCM_RIGHTS. Luckily, Python’s standard library has helper functions. Spawner sends with multiprocessing.reduction.sendfds(sock, [stdin_w, stdout_r, stderr_r]) and Renderer receives with stdin_w, stdout_r, stderr_r = multiprocessing.reduction.recvfds(sock, 3).

Now, with a great sigh of relief, we deploy and congratulate ourselves. Our Steps start quickly, and each Step gets its own process.

Chapter 3 explains how we use Linux Containers to restrict Step code’s abilities.

Journalist, ex software engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store