Sandboxing data crunches, Chapter 1: use a subprocess

Chapter 1: use a subprocess

Never do this.
Subprocesses help the main process recover after out-of-memory errors
  • Protobuf is a compelling format for encoding messages. Workbench happens to use Arrow which happens to support the Parquet file format which happens to encode metadata in Thrift format; so Thrift was already a Workbench dependency. (If we’d had a blank slate, I’d have leaned towards Protobuf because it’s more popular.)
  • Python’s pickle module is not okay. Using pickle format, Step can output untrusted code that Renderer will execute — which defeats the purpose.
  • JSON may seem appropriate for simple messages; but it’s costly as you grow. When Renderer parses Step’s output message, it must validate that (untrusted) message. You can hand-code JSON validation functions … but that’s onerous. You can use a library like JSON Schema … but then you’d need to write schemas. Thrift and Protobuf are made for this; JSON isn’t.
  • You can pass data from Renderer to Step through command-line arguments. This is a one-way communication channel and it only passes bytestrings.

One layer of security

  • A “good” subprocess reads inputs from standard input and disk; writes valid output to standard output and disk; and exits with status code 0. Renderer acts upon its output.
  • A “buggy” subprocess reads from standard input and — whoops — prints a stack trace to standard error and exits with a non-zero status code. Renderer acts upon this event.
  • Another “buggy” subprocess writes nonsense to standard output. Renderer recovers. (Renderer must not allow injection here. Don’t use Python’s pickle module.)
  • Another “buggy” subprocess never exits. Renderer kills it and declares, “timed out”.
  • A “bad” subprocess does Something Else. (cue ominous tones: dun-dun-dunnn!)
Step’s process “sandboxes” itself in step 0: it reduces its privileges irreversibly before executing untrusted code
  • On Kubernetes, disable Hyper-Threading so untrusted code can’t glean data from other processes.
  • Secure internal services (like your database) with SSL and passwords. Untrusted code won’t know the password (if you sandbox properly).
  • Ensure secret files aren’t readable to the sandbox user. (Kubernetes mounts secrets as world-readable by default. Fix those permissions!)
  • Ensure the sandbox user can’t write files Renderer reads internally. (Imagine if untrusted code wrote a configuration file and then Renderer acted upon it….)
  • Restrict cloud metadata URLs. From a Google Kubernetes Engine pod, http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-env exposes credentials that let anybody join the cluster and access all users’ data. Don’t let untrusted code access that URL. We enabled Workload Identity to prevent exposing this URL. (A firewall would inspire more confidence; we’ll add build one in Chapter 3.)

--

--

--

Journalist, ex software engineer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Use of endpoints in kubernetes

How to write a Synthetix Improvement Proposal (SIP) for dummies

Terraform at Scale — Modualized Hierachical Layout

“Deity, please don’t be another 2k-line merge request!”

POLLARD’S RHO ALGORITHM

Azure Load Testing: New Service by Microsoft

Azure Load Testing: New Service by Microsoft

Bug-Free Kettle Or Why Clients Need To Be Educated

Merge your branch with the master — Github

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adam Hooper

Adam Hooper

Journalist, ex software engineer

More from Medium

Executing Python MapReduce Jobs using Hadoop inside Docker Containers on Apple Silicon Mac

Updating VARCHAR2 fields in Oracle Database Actions

Table relationships in my Autonomous Database, and the User Interaction, Chris Hoina, Senior Product Manager, ORDS, Database Tools

Demystifying Docker (But For Real This Time)

Inserting Pandas DataFrames into IBM Db2 on Cloud using Python