Early Experiences with Kubernetes: Debugging Unresponsive Nodes

Published in

Affinity

12 min readJul 21, 2017

Here at Affinity we’ve chosen to use Kubernetes to manage our production cluster. Billed as a system for “automating deployment, scaling, and management of containerized applications,” it has been invaluable in helping us scale past just a few machines. It has eliminated an enormous amount of devops work and allowed us to largely focus on building new features and developing our core product into something that provides real value to our users. Kubernetes hides a lot of the complexity associated with running dozens of different services in production, but it’s a mistake to assume that it’s all you need to run a fully-fledged production service.

In our initial naïveté, we expected Kubernetes to somehow expertly manage our cluster straight out of the box, maximally optimizing our resource usage, and automatically detecting failures within the cluster and repairing itself with no human intervention required. But we soon discovered that Kubernetes is a complex system unto itself. It requires configuration, fine-tuning, and programmer guidance and input to operate safely and efficiently, just like any other large system. Kubernetes won’t know how to efficiently use memory and CPU unless you tell it how much memory and CPU a job needs. Kubernetes can recover from certain classes of failures, but you still need a robust monitoring system, beyond some basic options that Kubernetes provides, to ensure the health of your system and minimize downtime.

This is the tale of one of our first infrastructure incidents at Affinity. In it Kubernetes stars as both the villain and hero, causing mischief and obstructing our debugging efforts before supplying a solution to our problem. In the end we realize that it is just one character of a bigger story, a powerful ally in our engineering adventure.

Kubernetes at Affinity:

Our production jobs at Affinity can be clustered into four main groups:

Basic web server, serving all requests to affinity.vc
Syncers, which fetch emails and events for processing
Consumers, which process each email and event to detect things like introductions, and emails that expect a response
cron jobs, to handle things like sending daily round-up emails each morning, or refreshing some interaction data that is too expensive to compute in real-time

As advertised, Kubernetes makes it super easy to deploy all of these jobs to our cluster.

What Kubernetes Does:

(You can skip ahead to Outage! if you’re already familiar with Kubernetes.)

Some of its explicit features we take advantage of include:

Automatic load balancing between machines (nodes in Kubernetes terminology): Given a set of jobs to run Kubernetes will evenly distribute them amongst the nodes in your cluster. Each instance of job running on a node is called a pod.
Easily adding new nodes to a cluster: We’re not yet at the scale where we need our cluster to dynamically scale according to load, but if we notice things are starting to get a bit cramped on our nodes we can bump up a number in a config file and Kubernetes takes care of the rest.
Automatic restarting of jobs: Kubernetes detects when a node goes down and will create new pods on other nodes to replace the ones on the failed node. If a pod randomly fails then Kubernetes will also restart it.
Rolling deployments: Kubernetes has built-in support for rolling out an update, bringing up new instances one a time while decommissioning the old ones. This allows us to deploy new changes in the middle of the day with no downtime.

What Kubernetes Doesn’t Do:

Kubernetes doesn’t do everything for us. There are some features we’d like it to support and other things that are out of scope that then need to be fit into the Kubernetes system. These include:

Intelligently distributing jobs based on CPU or memory usage: By default Kubernetes will evenly distribute pods amongst your available nodes. If you have some jobs that eat up a lot of memory ideally each would live on a different node, but you might get unlucky and have them all assigned to the same node. You can tell Kubernetes how much memory or CPU you expect a job to need, and it will make sure that pods are scheduled to make good use of resources. When you’re just starting, however, it’s difficult to know what are reasonable values to set, so you might just be flying blind for a while.
Redistribute jobs when new resources are added: When a new node is added to the cluster Kubernetes doesn’t move any existing pods to the new node until the next deploy. In larger clusters this isn’t a big problem, but in a small cluster one node failure puts substantially more stress on the other nodes and can be an inefficient usage of resources.
Monitoring: This is a bit disingenuous. Once a cluster is setup, there are couple options that can be set up with literally a single command, but they’re pretty bare-bones: The Kubernetes Dashboard provides a nice looking UI, but all the information here can be fetched from the CLI, and
Heapster provides a better look into the internals of your cluster, surfacing information about CPU, memory and network usage of individual pods. Heapster only collects quantitative information from your cluster, and it has no concept of the higher level Kubernetes concepts.

Kubernetes Dashboard & Heapster data in Grafana (Dashboard photo from https://github.com/kubernetes/dashboard/blob/master/docs/dashboard-ui.png)

Alerting: Kubernetes won’t alert you if one of your jobs keeps failing because you mistyped a command in a config file. Because alerting is highly dependent on the monitoring infrastructure, it lies pretty far outside of the scope of Kubernetes.
Fix the bugs in your code: Bummer.

Outage!

On February 13th our site went down, and we had no idea what was going wrong. At this point we hadn’t set up any monitoring, so our only way to assess the state of the cluster was through the Kubernetes command line interface, kubectl. We checked to make sure that all of our jobs were running and found that we had a lot of pods in an Unknown state, and multiple nodes were reporting as NotReady.

We use ReplicationControllers to make sure that each of our jobs has at least one instance running. When a node went down, Kubernetes would put all the pods running on that node into the Unknown state. To satisfy the contract of the ReplicationControllers, Kubernetes would then reschedule those jobs on other nodes. On this particular day multiple nodes went down and Kubernetes was unable to schedule all of our jobs. Notably, our web servers weren’t running, hence the chaos.

What was happening to our nodes? We had no idea, and Kubernetes wasn’t telling us anything useful. We looked at our EC2 dashboard to see if AWS could tell us anything more about the machines, but curiously AWS said the machines were totally fine. Without any leads concerning what was wrong, we opted for the nuclear option and totally redeployed a new cluster.

While this got our website back up and running, we hadn’t resolved anything and later than day another node in our new cluster went down. Luckily we didn’t have multiple nodes go down again, so we didn’t experience the cascading failures that seemed to cause the outage earlier in the morning.

Unable to find a way to fix the NotReady nodes, we realized that we could simply terminate the EC2 instances through the AWS console. The AWS Auto Scaling Group backing our cluster would quickly bring up a new instance and Kubernetes would discover it and incorporate it into the cluster. Since Kubernetes wouldn’t redistribute jobs, we’d have to make sure to re-deploy after doing this to avoid ending up with two or three idling machines.

Aftermath

We had found a temporary solution for making sure our cluster didn’t fall over again, but it was obviously not ideal. Over the next few days, these are some of steps we took to figure out what was going wrong. Some of them were dead ends we followed that didn’t get us closer to a solution, but did teach us more about the Kubernetes ecosystem.

Kubelet logs

We asked the kubernetes-users Slack channel for advice on how best to debug the issues we were facing. Someone suggested getting access to the Kubelet logs on the nodes that were in the NotReady state. Kubelet is a process that runs on every node in a cluster and is in charge of managing the containers in the pods assigned to node. Getting access to those could possibly give us more detailed information about what was going wrong, but this turned out to be difficult.

We hadn’t set up any sort of centralized logging, so in order to access the logs we needed to ssh into a node and run journalctl -u kubelet. However, we had previously moved our cluster into a private subnet inside our AWS VPC so that all traffic from our website could appear to come to a single IP address. This meant that ssh traffic couldn’t reach the machines. We reverted this change so that our nodes were temporarily publicly accessible, but then we discovered that our deploy machine didn’t have the SSH private key needed to access the cluster. We had originally set up our cluster with kops locally from one of our developer’s laptops, and by default it saves the public SSH key at ~/.ssh/id_rsa.pub to copy to new instances. Eventually we were able to navigate through our network and get access the Kubelet logs, but they were unhelpful.

What was helpful was what we learned about kops, the preferred way of setting up a new Kubernetes cluster on AWS. All of these issues we ran into are solved by kops. When creating a cluster with kops you can provide a public key via the--ssh-public-key option. kops can also set up your cluster in a private AWS subnet via the --topology private option, which ALSO requires you to specify a networking option. (We went with
--networking weave when we recreated our cluster again a few weeks later.) And to solve the issue of not being able to access the nodes in private subnet, you can also use the --bastion option to create a bastion host that has access via AWS’s security groups.

Poor man’s reporting: Slack messages

Wanting to avoid another disaster where our whole site went down, we set up a simple cron job to check for unhealthy nodes in our cluster. Every 5 minutes it processes the output of kubectl get nodes,pods -o json and sends a message to Slack telling us which node is down and which pods were running on that node.

Suspicion and Conviction

Each time a node went down over the next few days we got a list of pods that were running on that node and this eventually led us to a single suspect: The job that powers our Unanswered Emails uses an NLP library that runs on the JVM. To get an idea of how much memory it was using we ssh’d into a Kubernetes node and asked the Docker daemon:

affinity$ ssh admin@172.XX.XXX.XXX
admin@ip-172-XX-XXX-XXX:~$ docker ps
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
admin@ip-172-XX-XXX-XXX:~$ sudo -s
root@ip-172-XX-XXX-XXX:/home/admin$ docker ps
CONTAINER ID  COMMAND                    CREATED      STATUS
1234deadbeef  find-unanswered-emails.rb  3 hours ago  Up 3 hours
...
root@ip-172-XX-XXX-XXX:/home/admin$ docker stats
CONTAINER      CPU % MEM USAGE / LIMIT      MEM %
1234deadbeef  12.31%   3.87 GiB / 7.307 GiB  18.13%
feed5678ba75   0.02%  90.26 MiB / 150 MiB    60.17%

Getting this sort of peek at what Kubernetes was actually doing was pretty cool. Kubelet just starts running the Docker images and we’re able to use Docker to tell us what we want to know. docker ps shows all the current containers running, and we’re able to use that to see the container ID of the unanswered emails job. Then running docker stats gives us an updating dashboard of memory usage by container. And sure enough, there was the unanswered emails container, using up well more than its fair share, and constantly increasing! We ran the job locally and determined that it indeed constantly used more and more memory. Efforts to find the source of a memory leak were unsuccessful though…

Digging around in the EC2 dashboard gave us another piece of information that allowed us to come up with a more complete story. In the Description section of the unhealthy node we clicked the link next to “Root Device” to take us to the page describing the EBS volume backing the instance. Clicking on the Monitoring tab for the EBS volume showed that when K8s-health-check reported the node going down all the monitoring graphs spiked. All of sudden idle time dropped to zero and read throughput sky-rocketed. The evidence seemed to suggest that the unanswered emails job would use up all the memory and cause our instance to start thrashing memory.

To view this data in a little more manageable way, we deployed Heapster into our cluster (which took all of a single command: Kubernetes can be pretty impressive), which collects quantitative data from our cluster and provides Grafana as a graphing front-end with some very simple pre-built dashboards. One of these lets you view memory usage of a single pod over time, and there was the unanswered emails job, steadily climbing. And, as we only discovered at the time of writing this post, Heapster also collects the rate of major page faults, which spikes at the exact time our K8s-health-check reported an issue!

k8s-healh-check telling us that a node in our cluster has gone down.

Memory usage for our Unanswered Emails job increases without bound. Moments after the k8s-health-check message the rate of Major Page Faults spikes on the node.

Resolution

We had found the root cause of our problem. Great! How could we fix it though? It turns out there was a pretty easy solution. Docker allows you set to set memory limits on a container, and it will kill the container if it tries to use more than that. Kubernetes exposes this through memory limits. This provided a way to prevent the unanswered emails job from going too crazy, but not from getting assigned to a node with another memory hungry process. To handle this case you can also give Kubernetes memory (and CPU) requests, which it uses for scheduling decisions.

With our shiny new Heapster monitoring we looked at some recent history of all of our jobs and noted about how much memory they needed to run, and how much they would spike to, and set these as memory requests and limits on all of our specifications. This ended up actually being a pretty tedious and inexact process. Some of our jobs had very spiky memory usage, so it felt wasteful assigning them so much memory. Still, adding an extra machine (with just a few commands) is worth the peace of mind gained by knowing we won’t suffer another outage like we did.

Pain Points & Things to Watch Out For

Our experience with Kubernetes has for the most part been fantastic, but debugging this issue over the course of weeks was fairly painful. A lot of it was caused by our own inexperience, so here are our last bits of advice and wisdom that we gained from this harrowing adventure.

Set memory request and limits on all of your controllers! If you don’t Kubernetes will blindly schedule things, possibly leaving you with multiple memory gobbling jobs on the same node. Be especially careful with any third party services you add from a single YML file. They might not have set them, and they might be placed in a different namespace and you might forget about them.
Deploy Heapster! But watch out for filesystem usage. Heapster can provide a lot of useful data for individual jobs in your cluster, and it’s definitely better than no monitoring at all. But by default it just writes its data to the temporary EBS volume attached to the EC2 instance, so if it ever gets evicted you’ll lose all of the collected data.

Thanks to Adam Perelman for reading through early drafts of this.