Sandboxing data crunches, Chapter 2: clone processes
This post, second in a series, offers a Python solution to a Python problem. The broad principles should apply to any single-threaded language on Linux.
In Chapter 1, we began sandboxing. Workbench’s “Renderer” runs a series of Pandas-powered “Steps”. We give each Step a new process, so Steps can’t control Renderer or other Steps.
Workbench Steps’ environment — Python, Pandas, Numpy, Pyarrow — takes 1–2 seconds to start. But Renderer needs to launch several Steps per second, per CPU. We must optimize.
We can’t “recycle” Step processes after they run a Step’s untrusted code. A “bad” Step could, say, rewrite Python’s
open() built-in function to email our users’ data to the Step author. Once untrusted code has run, its Step process must die.
And logically, if each Step results in a process dying, each Step must involve a process being created. How can we quickly create a process that costs 2 seconds to start?
Let’s see why
subprocess.Popen() is slow. It implements the fork-exec UNIX pattern, which actually involves two system calls:
clone()— duplicate all the RAM of the current process. Now there are two processes: the Renderer process and its near-exact clone, the Step process. The Renderer process continues; in it,
clone()returns the Step process ID (“pid”). At the same time, the Step process continues at the exact same place; but there,
0. Aside from that single integer difference, the two processes are identical.
- The process that judges itself to be a child (because
execve("/path/to/program", …). This erases all the unwanted — and sensitive! — Renderer data from memory and starts the Step program.
(Linux implements the UNIX
fork()function with its own
clone() system call. The pattern is known as “fork-exec”; and on Linux, it boils down to “clone-exec.”)
Fork-exec isn’t unique to Python: it’s how UNIX works.
clone() is fast — it costs less than 1 millisecond. You’d think copying all the RAM of a big process would be slow, but Linux takes a shortcut: it lets both processes reuse the same physical RAM; in the future, if Step or Renderer try to write to that RAM, they’ll create write to a copy instead. (This is called “copy-on-write.”)
The slow part (in Workbench) is
execve() — because it starts Python and imports big modules.
Here’s a fun idea: don’t call
execve(). Import the Python modules into Renderer itself and just
This works (if Renderer is single-threaded). It’s fast. But we can’t use it: the Step process contains a copy of the Renderer process’s memory — including secrets and other users’ data.
We must cram an
execve() into this design. That’s the only way to ensure Renderer’s secrets aren’t copied into a Step process.
This is where Pyspawner fits in. Pyspawner calls
execve() just once:
When Renderer launches, create a child process with
subprocess.Popen(). (Remember: this calls
execve(); so the subprocess has none of Renderer’s secrets.) Let’s call the subprocess “spawner”. Spawner is a Python script that imports Pandas, Numpy and Pyarrow (costing 1–2 seconds) and then … waits. It never exits.
With Spawner in place, here’s how Renderer can launch a process:
- Renderer signals Spawner to invoke
- Spawner clones itself to create a Step process, costing less than 1 millisecond. (Since Step is a clone of Spawner, its memory doesn’t contain secrets or user data.)
- The Step process stops behaving like the Spawner it cloned; it begins sandboxing instead. See — this Step process started instantly!
- Meanwhile, Spawner returns a subprocess handle to Renderer and then awaits Renderer’s next request. Spawner can do this all day….
Python’s multiprocessing.forkserver module implements the same idea, but it uses os.fork(). We at Workbench want
What’s so great about
The first great feature is
CLONE_PARENT. It makes our code simpler. To understand how, you need to understand why
os.fork() makes Forkserver complex.
It has to do with process hierarchy. In Linux, there’s a special relationship between a process and its direct parent. The parent can and must
waitpid() for its child process. And the parent receives a
SIGCHLD signal when its child exits.
Python’s Forkserver uses
os.fork(). Renderer can’t
waitpid() for the Step to exit, and it won’t receive
SIGCHLD signals from Steps. Instead, Renderer must request that Forkserver act in its stead; and Forkserver needs to concurrently handle events from Renderer and signals from Steps. It’s a complicated.
There’s an easier way. Pyspawner uses Linux’s
CLONE_PARENT flag when calling
clone(). The flag makes Renderer the direct parent of Step:
Now, Spawner can ignore all the Steps it creates: they aren’t its children. Renderer can manage Steps directly: they are its children. Simpler Spawner; simpler Renderer. Win-win.
The second great feature:
clone() can build Linux containers. Linux containers are otent sandboxing tools.
Containers? I’m getting ahead of myself. First, I’ll mention the tricky details we encountered while building our
The ugliest detail is the
clone() call itself, which looks like an ASCII artist’s depiction of vomit:
And then comes shared data.
fork(), jumps to a new function. So the best way to pass data from Pyspawner to each spawned Step is global variables. For instance, here’s how we give each child a fresh stdin, stdout and stderr:
The second icky bit is … Python. Python’s
os.fork() invokes some Python-internal C functions — functions we can’t invoke directly. The internal C functions close some open file descriptors; fiddle with thread-local variables; cancel signal handlers; and call custom pre-fork handlers registered by Python libraries.
Most of this doesn’t matter. Pyspawner doesn’t install signal handlers, and it’s careful with file descriptors. Pyspawner is simple.
But some imported code might depend on these Python-internal functions. In our case, we learned
import numpy invokes some threading code! Numpy installs a custom pre-fork handler so Python’s
os.fork() can circumvent its threading code; but we use
clone(), so we can’t easily call Numpy’s pre-fork handler. The result is Undefined Behavior. In our case, everything seemed fine until we invoked
subprocess.run() from within a Step. The Step froze.
Our workaround: wrestle with Numpy. Workbench launches Spawner with the environment variable
OPENBLAS_NUM_THREADS="1". This prevents Numpy from fiddling with threading; so Numpy doesn’t install a pre-fork handler, so Python-internal
os.fork()’s C functions are moot; so we can ignore the C functions and call
And finally, we needed to build a communication protocol between Renderer and Spawner.
Our communication channel uses
AF_UNIX sockets, so we can pass stdin, stdout and stderr file descriptors.
multiprocessing.forkserver, we avoid a named socket. We pass a file descriptor from Renderer as both a command-line argument and an open file descriptor:
subprocess.Popen([..., str(spawner_sock.fileno())], ..., pass_fds=(spawner_sock.fileno(),)). We open the file descriptor in Pyspawner using
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM, fileno=socket_fd).
Spawner can’t send stdin, stdout and stderr file descriptors as mere integers: those integers would be meaningless to Renderer. So Pyspawner sends them using
SCM_RIGHTS. Luckily, Python’s standard library has helper functions. Spawner sends with
multiprocessing.reduction.sendfds(sock, [stdin_w, stdout_r, stderr_r]) and Renderer receives with
stdin_w, stdout_r, stderr_r = multiprocessing.reduction.recvfds(sock, 3).
Now, with a great sigh of relief, we deploy and congratulate ourselves. Our Steps start quickly, and each Step gets its own process.
Chapter 3 explains how we use Linux Containers to restrict Step code’s abilities.