When a user submits a Step, our “Renderer” program executes all the Steps in the Workflow. Most steps are pre-built; but users may code their own using Python and Pandas.
The dilemma: how do we safely run untrusted code in our Renderer?
Our Renderer accesses our database and file store to do its job. And it runs Steps … but Steps must never gain that power.
Other web services run users’ code — such as Travis, Jupyter and Heroku. They allocate a virtual machine (VM) per job. VMs would be phenomenally expensive for Workbench. Some Workbench users schedule Step code to run at 5-minute intervals. VM startup and shutdown take minutes. A frequently-scheduled Step would effectively cost us a dedicated VM — on the order of $10 per month per workflow.
We’re sharing the crux of our solution as a Python library: pyspawner. Pyspawner helps a Python process launch secure Python environments.
This series isn’t exclusively about Python or pyspawner. It explains how to secure
exec() calls in any programming language.
How can our trusted Renderer process run untrusted Step code?
Chapter 1: use a subprocess
Often, a simple approach is best. Here’s simplicity itself:
Never do this. This can’t restrict a Linux process’s memory usage. (We can restrict a process’s address space — not physical RAM usage— using setrlimit(). That’s no good for Workbench: our Steps open huge files with mmap(), so we need a huge address space.)
When a Linux system consumes too much memory, the Linux kernel’s out-of-memory process killer kills a process. If our user’s Step consumes too much memory, Linux kernel will kill its process.
How can we make our Renderer do something (like notify the user) after its process is killed? We can’t. We must expect a process may die … so there has to be a second process.
Here’s a two-process approach. The Renderer process lives forever. For each Step, it spawns a subprocess. Each Step subprocess dies after its (untrusted) code finishes.
It’s simple: now, each Step processes will die. If it happens to die from out-of-memory, so be it — Renderer’s
wait() will return
-9. If it dies after exec() completes successfully, Renderer’s
wait() will return
0. (All Linux processes return a number when they die.)
This new design is as simple as we can imagine; but even so, it forces an explosion of new code. We can’t pass Python objects between Renderer and Step. We need a new way to pass inputs and outputs.
The solutions are as old as UNIX itself: temporary files for “big” data and UNIX pipes for “small” data.
Workbench’s “big” data is in data tables that can consume gigabytes. We transmit them as Arrow files because Arrow format has a staggeringly small overhead. (Empowered by mmap(), Renderer and Step securely share the same physical RAM when they both open the same file.) One process writes the file to disk; the other process reads it from disk.
We transmit “small” data (filenames, metadata and parameters) in Thrift messages over standard input and output. One process serializes a message and writes bytes to its side of a pipe; the other process reads bytes from the other side of the pipe and deserializes them into a message.
Here’s why we chose Thrift over obvious alternatives:
- Protobuf is a compelling format for encoding messages. Workbench happens to use Arrow which happens to support the Parquet file format which happens to encode metadata in Thrift format; so Thrift was already a Workbench dependency. (If we’d had a blank slate, I’d have leaned towards Protobuf because it’s more popular.)
- Python’s pickle module is not okay. Using pickle format, Step can output untrusted code that Renderer will execute — which defeats the purpose.
- JSON may seem appropriate for simple messages; but it’s costly as you grow. When Renderer parses Step’s output message, it must validate that (untrusted) message. You can hand-code JSON validation functions … but that’s onerous. You can use a library like JSON Schema … but then you’d need to write schemas. Thrift and Protobuf are made for this; JSON isn’t.
- You can pass data from Renderer to Step through command-line arguments. This is a one-way communication channel and it only passes bytestrings.
One layer of security
The subprocess is our fledgling “sandbox”. Let’s define “sandbox” in terms of what Steps can do:
- A “good” subprocess reads inputs from standard input and disk; writes valid output to standard output and disk; and exits with status code 0. Renderer acts upon its output.
- A “buggy” subprocess reads from standard input and — whoops — prints a stack trace to standard error and exits with a non-zero status code. Renderer acts upon this event.
- Another “buggy” subprocess writes nonsense to standard output. Renderer recovers. (Renderer must not allow injection here. Don’t use Python’s pickle module.)
- Another “buggy” subprocess never exits. Renderer kills it and declares, “timed out”.
- A “bad” subprocess does Something Else. (cue ominous tones: dun-dun-dunnn!)
We need to chip away at “Something Else” until it’s, “Nothing Else.” That’s sandboxing!
Before the Step subprocess runs untrusted code, it should confine itself:
Use seccomp to filter system calls. One benefit is to prevent a “bad” step from invoking something shady like “mount()”. But also, seccomp can prevent Spectre-style attacks. (Docker applies seccomp rules by default; Google Kubernetes Engine doesn’t.)
Audit your compute environment — CPUs, internal services, and (in our case) Kubernetes platform:
- On Kubernetes, disable Hyper-Threading so untrusted code can’t glean data from other processes.
- Secure internal services (like your database) with SSL and passwords. Untrusted code won’t know the password (if you sandbox properly).
- Ensure secret files aren’t readable to the sandbox user. (Kubernetes mounts secrets as world-readable by default. Fix those permissions!)
- Ensure the sandbox user can’t write files Renderer reads internally. (Imagine if untrusted code wrote a configuration file and then Renderer acted upon it….)
- Restrict cloud metadata URLs. From a Google Kubernetes Engine pod, http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-env exposes credentials that let anybody join the cluster and access all users’ data. Don’t let untrusted code access that URL. We enabled Workload Identity to prevent exposing this URL. (A firewall would inspire more confidence; we’ll add build one in Chapter 3.)
Looking good! With these protections in place, bad step code can’t access other users’ data. This is our first security layer.
It’s unbreakable … unless a bug in the Linux kernel lets a non-root process become root. We’ll address that concern with a second security layer: Linux containers.
But first, a hurdle. Workbench encountered hit a Pandas-specific problem: subprocess startup is too darned slow. For a Python-specific problem, we need a Python-specific solution.
Stay tuned! I’m going to flesh out this series: