Sandboxing data crunches, chapter 3: containerize
This is the third and final post in a series. In Chapter 1, we sandboxed using subprocesses. In Chapter 2, we leveraged Linux’s clone() system call to create subprocesses quickly.
Workbench’s Renderer process runs one Step process after another. We sandbox each (untrusted) Step by running it as a non-root user, to deny it access to Renderer’s wealth of passwords and users data.
Most of the time, this will protect our users’ data. But not always.
Every so often, someone finds a Linux-kernel “privilege escalation” bug. A window of vulnerability opens: after the bug is found and before we patch, untrusted Step code can escape its sandbox.
We care about privacy. Our users’ data can’t be safe most of the time: it must be safe all the time.
Let’s add a second security layer. Whenever our first, “non-root” layer is vulnerable, we’ll be protected by the second layer. And whenever our second layer is vulnerable, we’ll be protected by the first layer. (If both layers are vulnerable at the same time, we’ll be in trouble. But think of all the big-ticket data breaches you’ve heard of. The common theme is a single point of failure.)
We chose Linux containers to provide that second layer. Containers prevent even root from accessing users’ data.
For all the hype behind “containers,” the term is confusing. It kind of means, “almost a virtual machine, but using the same Linux kernel as the host.”
Containers’ main mechanism is Linux namespaces.
Namespaces are boundaries around a process. Applied correctly, a namespaced process (or “containerized process”) cannot interact with any part of the system outside its own namespaces. This is at the system-call level: when your process asks the Linux kernel to do something, the Linux kernel handles your system call differently depending on the process’s namespaces.
Some helpful programs out there (like LXC, Firejail and Docker’s runc) spin up containers without much fuss. Workbench can’t use them. They all call
execve(). In Chapter 2, we abandoned that system call.
The good news is: we can containerize in-process. We just need to pass some flags to the
clone() system call.
There are seven categories of namespace:
CLONE_NEWNET. We use them all. Make the walls close in on our Step process!
CLONE_NEWUSER. This makes the Step process run as root … but not the real root! It’s a fake root that only exists for the Step process. As far as the Renderer process is concerned, “root” for the Step process is really UID 100000 — it’s effectively “nobody”, so it can’t access files that aren’t world-readable.
CLONE_NEWUSER is finicky: we need to write a “uid_map” and “gid_map” to assign UID=100000. Renderer writes those maps after it calls
clone() to create a Step process. The Step process must wait for Renderer to write before it can proceed. We synchronize with a UNIX pipe — as per the
It’s a pain. But
CLONE_NEWUSER is the gatekeeper: it grants enough privileges to create all the other namespaces. Let’s plough ahead.
Many namespaces are fire-and-forget. We enable
CLONE_NEWUTS and presto — the Step process can’t change our hostname, even if it somehow manages to become root.
One particularly clever namepsace is
CLONE_NEWPID. It builds a new “PID Namespace.” From the point of view of Step and its child processes, Step’s PID is
1. It’s the “init” process. When Step dies, Linux destroys the namespace — and kills and reaps all Step’s children. That’s great! Without
CLONE_NEWPID, it would be a hassle to deal with Step’s children.
CLONE_NEWPID makes all other processes invisible to our Step process. It can’t send a signal to a process with PID=21, for instance, because it can’t see PID=21. It can’t even see its own parent.
CLONE_NEWCGROUP is another cool namespace: it lets us limit the Step process’s resources. We can limit the number of allowed subprocesses (to prevent a fork bomb) or limit on the amount of physical RAM a Step and its subprocesses may use.
And so on.
Two namespaces are trickier:
How do we prevent a rogue Step process from reading
We already solved this once: file permissions. We set the file permissions such that UID=100000 may not read from the file. And now, with
CLONE_NEWUSER in place, even “root” in a Step process can’t read the file.
But what if a developer adds a new secret and forgets to set file permissions?
… Well, we could always unmount the filesystem, right? If the filesystem is unmounted, the secret file doesn’t exist — so the Step process can’t read it.
This is the promise of
CLONE_NEWNS: mount namespaces.
Unfortunately … it doesn’t seem to be worth the hassle, either in Docker (our development environment) or Kubernetes (our production environment).
The authors of Linux, Docker and Kubernetes are wary of letting unprivileged processes mess with mounts. For good reason — fiddling with
overlayfs can wreak havoc.
If we run our Renderer as a privileged process we can
umount() from within Steps … but that amounts to un-sandboxing our Renderer process (which Kubernetes itself sandboxes, using containers) in a bid to sandbox Steps. Common sense prevails: we won’t unmount.
(In the future, if the Linux community’s research into “rootless containers” pans out, we may switch Workbench to to unmount instead.)
At Workbench, we compromise and use
chroot(). chroot() gives our Step process a new directory tree that we create ahead of time. It doesn’t inspire as much confidence as unmounting. But it’s better than nothing.
We mount a singleton chroot in a Kubernetes init container. In development, we mount the singleton chroot filesystem at the start of each container run.
The new chroot gives us a cool side-benefit: we can constrains Step’s disk usage. We create an ext4-formatted flat file,
edits.ext4. Then we overlay-mount
edits.ext4 on top of our normal chroot structure. Now, when a Step writes files, those writes are directed to
edits.ext4. The flat file can’t grow beyond the filesystem size; so Step’s disk usage is constrained.
Once Step exits, we iterate over all the files written to
edits.ext4 and delete them from the overlay filesystem. (It would be more foolproof to remount with a fresh
edits.ext4 file; sigh.)
This forces a compromise: our chroot is a singleton. One Renderer can’t run two Step process in parallel: we’d need a new chroot per concurrent Step. Oh well. Instead of making one Renderer execute multiple Steps concurrently, we deploy multiple Renderers that each execute one Step at a time.
How can Renderer read and write our database, without giving Step the same privilege?
One layer of defence is a database password: if Step doesn’t know the password, it can’t connect.
CLONE_NEWNET (network namespaces) is way better. With it, the Step process has no network interfaces. Ha!
Of course, some of our Steps (like our Twitter connector) need the Internet. For those, Renderer creates two peered “veth” network devices and sends one to Step’s network namespace. In Renderer’s namespace, we use
iptables to install a firewall and apply network address translation (NAT) to traffic from the Step process’s network device. When Step dies, the network namespace dies — and the network device disappears.
Sadly, we must grant our Renderer process
CAP_NET_ADMIN to create network interfaces. This is the only nonstandard capability we grant our services.
We made compromises earlier; and it’s just as well, because they come in handy with network namespacing:
- Good thing we used chroot (though unmount would be better): now Step can have a different
/etc/resolv.conf. (Workbench’s Step process uses external DNS:
- Good thing we disallow concurrent Step processes per Renderer: now we can hard-code a network-interface name. (Network-interface names must be unique in the Renderer network namespace.)
- Good thing we made an init script for setting up our chroot. That init script is the perfect place to write our iptables rules. (The rules never change because we hard-coded our network-interface names.)
Thanks to Linux containers, a Step process won’t be able to access our users’ files or take down our website — even if it becomes root.
And spinning up these sandboxed Steps is blazing fast —up to a few milliseconds, if networking is enabled.
Given the choice between speed and security, Workbench chose both.