The Problem

Have you ever been working on a web application and had to switch to another application. Possibly you need to fix a bug in another application or maybe the first application needs to talk to the second application, it doesn’t really matter but you know that you need to run a second application now. You start to launch the application and you get a nice error message saying that the port is already taken because your organization has standardized on all applications launching on port 8080 and the first application is still running on that port.

Now if you’re just trying to fix a bug in one app you can safely close the first one you had running and fix the bug. What happens, though, if you needed those two services to talk to each other? You could always just assign one to be on port 8080 and the other to be port 8081 (and hope you remember to change the configuration back later). But what are the odds that you’ll remember to change the application back to the original port? What happens if you need not two but three services to talk to each other? Well you just use 8080, 8081, 8082, and so on. Does this smell bad to anyone else or is it just me?

The Solution?

So to fix this problem we can use Docker and put each service into a Docker image (your service is already containerized isn’t it?). We can now run as many services as we want all on port 8080 so long as we don’t bind those services to the host’s port 8080. Problem solved! Well mostly. If we don’t bind our service to any host port how do we access the service? The short answer is that we can’t. This doesn’t sound terribly helpful does it?

We could just run a proxy on ports 80 and 443 (or whatever ports you want) and have that proxy requests based on the URL to a specific service. Say you had a service named ServiceA, you could have the following two URL’s that are proxied to your service: http://localhost/servicea/ and https://localhost/servicea/. In actuality we shouldn’t have the proxy pass HTTP requests directly to your service and should instead return a redirect to the HTTPS variant since we’re good citizens of the web. So now our services are accessible again and everyone is happy!

This sounds great but how do we actually implement it?

The Options

I’m going to assume that the wrapping of a service in a Docker image is already well understood so instead I’m going to focus on the proxy bit.

So for this I came up with three main proxy candidates that seemed like reasonable fits for what I wanted to do. The three that I chose to look at are Nginx, Traefik, and Envoy. It’s worth noting that HAProxy is another option but it wasn’t one that I wanted to dig into. Same thing with Apache. They’re great options but HAProxy is pretty low level (its a Layer 4 proxy) and Apache is just annoying to setup and maintain.

For all three of these proxies I created a Docker Compose file that contained definitions for the proxy, the volume mounts for the configuration, and a Docker network to isolate the proxy’s access to only the services that I want to be made public (this bit wasn’t strictly necessary but it gives me a warm fuzzy feeling inside). The minimum requirements that this places on services integrating with the proxy is that the services all have to be on the same network as proxy. This isn’t too bad though.

I also want to quickly note that to start simple, I’m only dealing with port 80 and HTTP on the host. I wanted to identify a good proxy candidate first and add complexity to the deployment later with a proper way to generate certificates and all that.

Nginx

Nginx seemed like the most logical place to start to me since it’s a proxy that I’m pretty familiar with and have used quite a bit. It’s also incredibly easy to spin up and use which made it seem quite appealing.

I put together a simple Nginx configuration and Docker Compose file that looked something like the following:

upstream serviceA {
  server http://serviceA:8080 fail_timeout=0;
}

server {
  listen 80;
  server_name localhost;

  # redirect server error pages to the static page /50x.html
  #
  error_page 500 502 503 504 /50x.html;
    location = /50x.html {
    root /usr/share/nginx/html;
  }

  location /servicea/ {
    proxy_pass http://serviceA;

    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Forwarded-Prefix /servicea/;
  }
}
version: "3.5"

services:
  proxy:
    image: nginx:stable-alpine

    volumes:
      - type: bind
        source: ./nginx.conf
        target: /etc/nginx/conf.d/default.conf
        read_only: true

    ports:
      - target: 80
        published: 80

    networks:
      - public

networks:
  public:

Like I said, it’s incredibly simple and there are potential rooms for improvement but let’s talk about how this works and what the short comings are.

So a request comes in for http://localhost/servicea/some/path, that request gets proxied to serviceA as http://serviceA:8080/some/path which all seems to match what we wanted. The proper headers are also passed to the service so that it knows it was proxied and where it was proxied from. At a glance this looks like problem solved but there are three major issues with this particular implementation.

The first is that to add a new service, you need to update the configuration for Nginx to include the new service. Conversely, if you are no longer working on a service (let’s say it was decommissioned), you need to remember to remove the configuration for the service. This isn’t awful but it is pretty annoying. One potential workaround for this is to not include any upstreams and to just take the first bit of the path and use that for the service name but this is pretty inflexible and likely to be more problematic than I want to deal with for anything but the simplest of setups.

The second problem is less obvious but is a big deal breaker for me. Nginx creates a connection to all the upstream services defined in the configuration when it starts up. It persists these connections which makes proxying connections to the services faster. This is great except that Nginx will fail to start if the upstream doesn’t respond right away (because it is also starting up or is off for some reason). This is problematic when Docker containers start at different times or the service isn’t on because you aren’t doing any work on it. There are ways to prevent the upstream from causing issues when starting the proxy but they’re tedious to setup and then you run into a new problem. If the service that Nginx is talking to restarts or is recreated, Nginx loses that persistent connection and starts returning 503’s on each request meant to hit the service. You would need to restart Nginx to pick up a connection to the new service. This isn’t an ideal situation to be in when you are actively working on a service and likely to break it (don’t pretend like you won’t!).

The third problem is that there’s no way to scale this if you wanted to play around with horizontal scaling of services. You probably won’t need to do this often but it does definitely happen and to do this you would need to not only use upstreams, but you would also need to update them with every instance you expect to proxy to which means no dynamic scaling and remembering to do more clean up. This completely nullifies the “fix” that I suggested for the first issue.

I’ve never used Nginx Plus but from what I know of it, many of these problems can be solved with that version of Nginx. Having said that, I don’t want to pay $2,500 per year for a local dev environment. It’s a tad out of my budget.

The one really nice thing about this setup is that it places no additional requirements on services to integrate with it (other than them living on the same network). We didn’t have to change the process of running our containers and we didn’t have to add any configuration to the services themselves either. This is a very nice property but not worth the challenges it creates.

Traefik

Traefik is a newer proxy on the block and offers some very compelling features. First and foremost, is that it’s configuration is very minimal to get a working setup. It is also Docker native and can be used in so many environments but the Docker native bit was the part I was most interested in. Traefik is able to be run in a Docker container while also listening to the Docker socket for containers to be started and stopped. If a container is started that has certain metadata attached to it, Traefik will start routing traffic to that container (or load balance it across a cluster of containers). If a container that was being routed to is stopped, Traefik will stop routing traffic to it and load balance across any remaining instances. Sounds pretty great right?

This is what my initial configuration looked like:

defaultEntryPoints = ["http"]
logLevel = "DEBUG"

[docker]
endpoint = "unix:///var/run/docker.sock"
domain = "localhost"
watch = true
exposedbydefault = false

[entryPoints]
  [entryPoints.http]
  address = ":80"
  compress = true
version: "3.5"

services:
  proxy:
    image: traefik:1.4.6-alpine
    restart: always

    volumes:
      - type: bind
        source: ./traefik.toml
        target: /traefik.toml
        read_only: true
      - type: bind
        source: /var/run/docker.sock
        target: /var/run/docker.sock

    ports:
      - target: 80
        published: 80

    networks:
      - public

networks:
  public:

This setup is a little more complex but not by too much. In the Docker Compose file a single container is created that is bound to port 80 on the host machine and two files are mounted into the container. The first is the configuration file for Traefik and the second file is the Docker socket. This is necessary so that Traefik can listen for changes in the Docker environment (such as the a container starting or stopping) and this is how Traefik decides how and when to route traffic to services.

The Traefik config file is setup in a pretty simple way as well. We tell Traefik to use HTTP and to listen to changes via the Docker socket. We also tell Traefik that we do not want it to route traffic to any container unless they explicitly ask to be served traffic. This helps isolate traffic to only where we want it to go. How do we tell Traefik which containers are route-able though?

Unlike Nginx, which placed no requirements on how containers are run, Traefik requires some additional metadata be added to the containers. Thankfully this metadata is just some labels (these labels can be excluded from production situations). Here’s what a sample Docker Compose file with these additional labels:

version: "3.5"

services:
  app:
    image: nginx:stable-alpine

    networks:
      - public

    labels:
      - "traefik.enable=true"
      - "traefik.port=80"
      - "traefik.docker.network=public"
      - "traefik.frontend.rule=PathPrefixStrip:/servicea"

networks:
  public:
    external:
      name: devhost_public

The bit about the networks is the same as with Nginx but the labels are a new thing. The first label says that we want to route traffic to this container. The second says what port to route traffic to. The third says which network we want to use (in this example it wasn’t necessary if you only have a single network but if you’re container lives on multiple networks this is very important). The last label is the routing rule we want to use. This one says that we want to route any traffic with a path that has the prefix /servicea and we want that prefix stripped during routing.

This very simple method of proxying worked pretty well. When a container that is route-able starts it is instantly proxied to and is removed the moment it stops. Having to only worry about adding the service configuration at the place where the container is defined was also really nice since it kept things in a sane place versus mixed up with every other service I might be running all without having to restart Traefik whenever a service changes.

Time for the con of this proxy. Unless you want to setup a DNS server to provide individual domains per service (IE something like servicea.localhost) or you want to add a /etc/hosts entry for each service, you have to use either the PathPrefix or PathPrefixStrip. On the surface that doesn’t seem like a big deal but eventually you realize that means that your service has to also be aware of the root it should be serving from. When I was working on testing this with a JupyterHub server that spawned notebook servers as Docker containers, this became very annoying to deal with and for a Go service written with Gin it was basically impossible. The expectation is that you either configure the service to use a custom root or that it understands and uses the X-Forwarded-Prefix header which is pretty non-standard and uncommon. It was this that really prevented me from wanting to continue with this option. It was just too unwieldy after a while. I did talk with the developers of Traefik about this and their recommendation was to just use host based rules which obviously isn’t a good solution to a fairly common problem.

Envoy

Envoy was the final proxy that I trialed and probably the most complex of all three options. The idea behind Envoy is that you create a mesh of your services and allow them to coordinate among themselves. There are many ways to accomplish this, each with their own levels of difficulty and pros and cons.

One possibility for setting this up looks a lot like previous examples where we have a single Envoy instance and it talks directly to each service. This works alright and is by far the simplest approach to the problem. This does however still require a configuration change for each service or the addition of a control plane that keeps track of services starting and stopping. Unlike Nginx, Envoy expects hosts to become unhealthy so it can handle stopping and starting much better.

Another approach that could be taken is to put an Envoy instance as the entry point to your local network (like with the previous method) and another one in front of each service group. This is more along the lines of what you would see in a production deployment of Envoy but it requires either a static configuration or a control plane like the single Envoy system. This approach gives you a lot of benefits such as service discovery and more control over how services talk to each other and is generally the preferred way to run Envoy.

I really wanted to have the dynamic approach that Traefik provides while having the much greater flexibility and control that Envoy allows for so I started looking at using Istio as a control plane. Istio was created by Google specifically for setting up service meshes with the initial implementation using Envoy as the proxy side-car. They even provide a sample Docker Compose file that you can use to spin up a simple control plane for Envoy. I was able to get the control plane spun up (with a lot of digging into the documentation for it…) but ran out of time with getting Envoy to use it properly for determining what to route where which is why there’s no sample configuration provided here.

I really like how much control Envoy allows you to have but it comes with the hefty price of a great deal of domain knowledge being required to run it well. It’s a lot like building an application in C versus Python. Sure you can create great applications in either language but you have far more control in C on how it performs but its going to take you a lot longer to write the application well.

Wrap Up

So where does this leave things? It doesn’t seem like a conclusion was actually reached does it. So at this point I want to look into finding a simpler control plane than Istio for local development. Istio is really great and has a ton of features but I think it’s a bit overkill for what I’m trying to do. I also really like how the Envoy and control plane option feels very Docker friendly. They expect services to come and go and know how to adapt to that.

Having said that, the simplest approach was definitely to use Traefik. The only way I could get it to work better I think is to employ a DNS service that sits along side it and provides custom domains for each service instead of a path prefix. My goal is to share the way I do development with others, though, and this seems like it might be a bit of an ask to setup something like dnsmasq alongside a normal Docker service. It wouldn’t be the worst but it certainly is more complicated. I want to see if there’s a relatively simple way to setup a dnsmasq service in a Docker container and direct queries to that when it’s up. This would provide a great deal of portability as well as being safer than running it directly on the host.

Todo for Next Time

  • Research a simpler control plane option or see if implementing one is feasible
  • Find out if dnsmasq could be containerized and run that way so that Traefik could be used instead