Multi-Process Docker Containers

April 5, 2023

I think of Docker as a packaging construct, a portable way to describe the runtime dependencies and configuration of a service. That portability means that, in general, a Dockerfile created on my dev machine is very likely to work on your dev machine, and also in production.

I do not rely on Docker to isolate workloads however, especially in multi-tenant environments, or where different types of data need to be secured. Fly.io came to the same conclusion and has an article that explains the various ways to isolate untrusted processes and why they chose a virtual machine-based approach.

The problem with Docker as a packaging construct

I often need to run multiple “sidecar” services alongside my main service. Litestream is the most common example, but Tailscale is usually in there as well. Sometimes things get even more complex: Super Guppy needs NGINX, Tailscale, a FastCGI process, and the actual Ktra server.

On a dev box I could use Docker Compose, but that won’t work on Fly.io (or Flatcar Linux, since Flatcar does not support Docker Compose).

Shell scripts can work, assuming you get all of the error/signal handling correct. Process managers are the better solution to this problem. Fly.io has a web page discussing various process managers, and their Machines feature even includes its own process manager now!

What I wanted was something that worked everywhere, with a handful of features that I knew would make it easier to configure a set of services inside of a container/micro-VM: serialized startup and shutdown; pre-startup and post-run phases; environment variable isolation; and the option to run processes as non-root users.

There were process managers that were close, but none of them met all of my requirements.

Enter Ground Control

Ground Control is my entry into the (apparently quite crowded!) process manager space: a lightweight, portable, Docker-first process manager that lets you run multiple processes inside of a Docker container or micro-VM. Ground Control understands the PID 1 problem, which is especially important if you are running multiple processes on the same machine.

I also included a “break glass” feature in Ground Control, which is quite helpful when a deployment fails. We had this happen frequently when first trying out LiteFS, due to a bug we were tripping over related to how our database migrations were being applied. Fixing the database would have been easy if we could login to the machine, but Fly.io was tearing down the VM as soon as LiteFS exited.

My solution was to add a feature to Ground Control that would check for a BREAK_GLASS environment variable on startup and, if found, would freeze the startup process without running any processes. This allows you to flyctl ssh console into the machine, fix whatever needs to be fixed, then remove the BREAK_GLASS variable to restart the machine into its normal mode (all of which can be done with the flyctl secrets set/unset commands).

Proving that there are no new ideas on the Internet, LiteFS itself got a similar feature around the same time! (a way to keep LiteFS running even if an error occurred) I think that’s great, but I also don’t think that every sidecar service should have its own break glass feature. That feels like something that the process manager and/or hosting platform should take care of for you.

Final thoughts

An unexpected benefit of Ground Control is that it gives me a new layer of abstraction that I can use when thinking about service architecture. Sometimes I want to think about what is running inside the machine: the database synchronizers, log/counter exporting tools, and of course the service itself. For that, I can look at the Dockerfile and resulting image.

Other times I want to treat machines as opaque units in a larger service diagram, with labels like “Front End Service” or the “Subscription Renewal Service.” In those cases, the details of what is running on each machine – the groundcontrol.toml and Dockerfiles – can be ignored.

This allowed the service architecture at my last company to stay simple, even as the overall complexity grew. What used to be a single VM with a number of cron jobs and background processes could instead be multiple VMs, horizontally “scaled” in order to reduce complexity as opposed to increasing performance. Although conveniently this also helps with performance-related scaling as well, since the manner in which you scale the “Subscription Renewal Service” is not the same as how you might scale a partitioned “Front End Service.”

Notably, this does not immediately devolve into a traditional microservice architecture or require something like Kubernetes. Using multiple VMs does not necessarily mean that they have to be tightly-coupled at runtime. More on this in a later article.