Sandboxing data crunches, Chapter 2: clone processes
This post, second in a series, offers a Python solution to a Python problem. The broad principles should apply to any single-threaded language on Linux.
In Chapter 1, we began sandboxing. Workbench’s “Renderer” runs a series of Pandas-powered “Steps”. We give each Step a new process, so Steps can’t control Renderer or other Steps.
Workbench Steps’ environment — Python, Pandas, Numpy, Pyarrow — takes 1–2 seconds to start. But Renderer needs to launch several Steps per second, per CPU. We must optimize.
We can’t “recycle” Step processes after they run a Step’s untrusted code. A “bad” Step could, say, rewrite Python’s open()
built-in function to email our users’ data to the Step author. Once untrusted code has run, its Step process must die.
And logically, if each Step results in a process dying, each Step must involve a process being created. How can we quickly create a process that costs 2 seconds to start?
Let’s see why subprocess.Popen()
is slow. It implements the fork-exec UNIX pattern, which actually involves two system calls:
clone()
— duplicate all the RAM of the current process. Now there are two processes: the Renderer process and its near-exact clone, the Step process. The Renderer process continues; in it,clone()
returns the Step process ID (“pid”). At the same time, the Step process continues at the exact same place; but there,clone()
returns0
. Aside from that single integer difference, the two processes are identical.- The process that judges itself to be a child (because
clone()
returned0
) callsexecve("/path/to/program", …)
. This erases all the unwanted — and sensitive! — Renderer data from memory and starts the Step program.
(Linux implements the UNIX fork()
function with its own clone()
system call. The pattern is known as “fork-exec”; and on Linux, it boils down to “clone-exec.”)
Fork-exec isn’t unique to Python: it’s how UNIX works.
Surprisingly, clone()
is fast — it costs less than 1 millisecond. You’d think copying all the RAM of a big process would be slow, but Linux takes a shortcut: it lets both processes reuse the same physical RAM; in the future, if Step or Renderer try to write to that RAM, they’ll create write to a copy instead. (This is called “copy-on-write.”)
The slow part (in Workbench) is execve()
— because it starts Python and imports big modules.
Here’s a fun idea: don’t call execve()
. Import the Python modules into Renderer itself and just clone()
:
This works (if Renderer is single-threaded). It’s fast. But we can’t use it: the Step process contains a copy of the Renderer process’s memory — including secrets and other users’ data.
We must cram an execve()
into this design. That’s the only way to ensure Renderer’s secrets aren’t copied into a Step process.
This is where Pyspawner fits in. Pyspawner calls execve()
just once:
When Renderer launches, create a child process with subprocess.Popen()
. (Remember: this calls execve()
; so the subprocess has none of Renderer’s secrets.) Let’s call the subprocess “spawner”. Spawner is a Python script that imports Pandas, Numpy and Pyarrow (costing 1–2 seconds) and then … waits. It never exits.
With Spawner in place, here’s how Renderer can launch a process:
- Renderer signals Spawner to invoke
clone()
. - Spawner clones itself to create a Step process, costing less than 1 millisecond. (Since Step is a clone of Spawner, its memory doesn’t contain secrets or user data.)
- The Step process stops behaving like the Spawner it cloned; it begins sandboxing instead. See — this Step process started instantly!
- Meanwhile, Spawner returns a subprocess handle to Renderer and then awaits Renderer’s next request. Spawner can do this all day….
Python’s multiprocessing.forkserver module implements the same idea, but it uses os.fork(). We at Workbench want clone()
.
What’s so great about clone()
?
The first great feature is CLONE_PARENT
. It makes our code simpler. To understand how, you need to understand why os.fork()
makes Forkserver complex.
It has to do with process hierarchy. In Linux, there’s a special relationship between a process and its direct parent. The parent can and must waitpid()
for its child process. And the parent receives a SIGCHLD
signal when its child exits.
Python’s Forkserver uses os.fork()
. Renderer can’t waitpid()
for the Step to exit, and it won’t receive SIGCHLD
signals from Steps. Instead, Renderer must request that Forkserver act in its stead; and Forkserver needs to concurrently handle events from Renderer and signals from Steps. It’s a complicated.
There’s an easier way. Pyspawner uses Linux’s CLONE_PARENT
flag when calling clone()
. The flag makes Renderer the direct parent of Step:
Now, Spawner can ignore all the Steps it creates: they aren’t its children. Renderer can manage Steps directly: they are its children. Simpler Spawner; simpler Renderer. Win-win.
The second great feature: clone()
can build Linux containers. Linux containers are otent sandboxing tools.
Containers? I’m getting ahead of myself. First, I’ll mention the tricky details we encountered while building our clone()
-based solution.
The ugliest detail is the clone()
call itself, which looks like an ASCII artist’s depiction of vomit:
And then comes shared data. clone()
, unlike fork()
, jumps to a new function. So the best way to pass data from Pyspawner to each spawned Step is global variables. For instance, here’s how we give each child a fresh stdin, stdout and stderr:
The second icky bit is … Python. Python’s os.fork()
invokes some Python-internal C functions — functions we can’t invoke directly. The internal C functions close some open file descriptors; fiddle with thread-local variables; cancel signal handlers; and call custom pre-fork handlers registered by Python libraries.
Most of this doesn’t matter. Pyspawner doesn’t install signal handlers, and it’s careful with file descriptors. Pyspawner is simple.
But some imported code might depend on these Python-internal functions. In our case, we learned import numpy
invokes some threading code! Numpy installs a custom pre-fork handler so Python’s os.fork()
can circumvent its threading code; but we use clone()
, so we can’t easily call Numpy’s pre-fork handler. The result is Undefined Behavior. In our case, everything seemed fine until we invoked subprocess.run()
from within a Step. The Step froze.
Our workaround: wrestle with Numpy. Workbench launches Spawner with the environment variable OPENBLAS_NUM_THREADS="1"
. This prevents Numpy from fiddling with threading; so Numpy doesn’t install a pre-fork handler, so Python-internal os.fork()
’s C functions are moot; so we can ignore the C functions and call clone()
.
And finally, we needed to build a communication protocol between Renderer and Spawner.
Our communication channel uses AF_UNIX
sockets, so we can pass stdin, stdout and stderr file descriptors.
Unlike Python’s multiprocessing.forkserver
, we avoid a named socket. We pass a file descriptor from Renderer as both a command-line argument and an open file descriptor: subprocess.Popen([..., str(spawner_sock.fileno())], ..., pass_fds=(spawner_sock.fileno(),))
. We open the file descriptor in Pyspawner using sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM, fileno=socket_fd)
.
Spawner can’t send stdin, stdout and stderr file descriptors as mere integers: those integers would be meaningless to Renderer. So Pyspawner sends them using SCM_RIGHTS
. Luckily, Python’s standard library has helper functions. Spawner sends with multiprocessing.reduction.sendfds(sock, [stdin_w, stdout_r, stderr_r])
and Renderer receives with stdin_w, stdout_r, stderr_r = multiprocessing.reduction.recvfds(sock, 3)
.
Now, with a great sigh of relief, we deploy and congratulate ourselves. Our Steps start quickly, and each Step gets its own process.
Chapter 3 explains how we use Linux Containers to restrict Step code’s abilities.