Managing Google Cloud Platform Project-Wide SSH Keys

Imagine that you own a bed and breakfast, and each time you have a guest, you add a new universal door code for them. You do this without removing the codes for the previous guests. Sure, some guests will overlap and need their own code, but there are risks when those codes remain active. What if a guest forgot something in their room and returns when someone else is staying there? Or worse, what if someone had a bad stay and returned to vent their anger. Unless the code were recent, it would be difficult to find out who was responsible, and this problem would only worsen over time.

Graphic showing the complexity and problems that can arise from having persistent SSH keys active project-wide.

How does this relate to SSH keys? When a user SSHs into any VM in a Google Cloud Platform Project, their public key is added to the project’s metadata, meaning this key could be used to access any VMs running in that project. Google makes it easy to block project-wide SSH keys from a given VM, but there's currently no automatic management of these keys, so they persist indefinitely.

Some Topics That Might be Helpful

This post touches on concepts around Google Cloud Platform, Google Kubernetes Engine, Security Practices at Shopify, and Secure Shell Protocol. I will briefly explain each of these before we dig into the details of building SSH-Pruner.

Google Cloud Platform (GCP)

GCP provides compute resources when we need them to power our workloads (over one million shops and counting). You can read more about how Shopify works with Google in this post about Shopify’s infrastructure collaboration with Google.

Google Kubernetes Engine (GKE)

Just as GCP gives us the computing power we need to run all of those shops, Kubernetes helps us deploy, manage, and scale our applications to use that power.

Security Practices at Shopify

One of the reasons I love working at Shopify is our policy to default to open internally, which means we trust our employees to make smart decisions. This culture of trust flows down into development practices and protocols. We ensure that our products are secure, but we also trust our developers and don't unnecessarily impede development flows. This requires a lot of thought on the part of the various Security teams at Shopify. We put up "guardrails" to warn developers when they're about to do something that may not be secure and "paved roads" to make it easy for them to work and be effective in a secure way.

Secure Shell Protocol

Secure Shell Protocol (SSH) is a way to verify identity to log in remotely from one host to another. This blog post gives a quick explanation and example. Julia Evans has some amazing zines to cover interesting networking tools, and I really dig the one about SSH.

The Potential Problem with SSH Keys

Many developers need to use the SSH protocol to securely connect from one computer to another. After completing work on the remote machine, we close the connection and move on. Before working on a security team, I didn’t think about managing SSH keys, and I suspect this is common.

In approaching this problem, I wondered why developers need to access low-level administration of VMs and servers. The main goal of a security team is to integrate secure practices (through tooling and automation) so that developers don’t need to think about it while they work. Shopify is a high-trust environment so locking developers out of all VMs completely wasn’t the best option. While it’s important to minimize the need to SSH into VMs, there are some situations that require it. We still needed to address the problem for those cases. Instead of removing a tool that developers find useful, the ultimate solution is to create or utilize tools better suited to their needs. To assess this, we use logging metrics to monitor how often developers use the Google Compute Engine service (which manages those VMs and facilitates SSH access) and gain insight into the most common use cases. We use the data gathered to plan out effective alternative options for most developers.

Why do developers need SSH to manage VMs?

Most often, individuals are creating these keys to debug issues with Kubernetes or docker, but there are various reasons one might end up needing to SSH into a virtual machine.

Why do we need to worry about them?

When someone uses SSH to connect to a VM in Google Compute Engine, their public key is saved in the project-wide metadata. This allows a user (with the matching private key) to connect to any of the Linux VMs in the project with root access unless they have project-wide public SSH keys blocked. This isn’t ideal, and there’s some inherent risk to manage. If one of these keys ends up in the wrong hands, there’s a potential to cause damage before being detected.

Note: Google Compute Engine has a feature called OS Login that allows administrators to control SSH access using IAM roles bound to an individual, group, or service account's identity in GCP. OS Login doesn't currently work with Google Kubernetes Engine, so we had to find our own solution.

Shopify’s Solution to Effective Key Management

My first task as a member of the Infrastructure Security team was to champion a project to create an application to silently and effectively manage SSH keys that persist inside of our Google Cloud Platform project-wide metadata. I worked on this project mostly on my own, with regular support and reviews from my mentor. This high-level of trust and autonomy allowed me to learn about the problem, think through the best way to solve it, and build new skills.

There was an attempt at solving the problem that I was able to work from. The initial fix was to iterate all projects and delete all SSH keys for any projects with compute enabled. We could then run the application at a regular interval and ensure that no keys were persisting.

There were a couple of problems with this approach:

Google’s own keys were being deleted with the others.
Some scripts would use a key and assume that it persisted for some time. Our program could cause a failure before they completed.
Adding keys back to the project metadata is a slow operation that must be repeated for the number of nodes times the number of keys. When many keys need to be re-added at once, they all take a very long time.

So how could we maintain maximum security for our projects but avoid the problems above?

Note: We do a lot of security through Google groups, but SSH keys could allow access even after group membership changes. For many of our other systems, we use OAuth 2 to authenticate users. Ideally we wouldn't need to SSH in at all.

For the solution to be successful, it had to meet the following criteria:

Ensure that SSH keys do not persist longer than necessary.
Have minimal to no impact on developer workflows.
Distinguish between Shopify's keys and the keys that Google uses to manage the infrastructure. Make sure both exist only where they are supposed to, and don't delete Google's keys.
Create useful logging so we can better understand why/when/who is creating these keys.

Let’s go back to the bed and breakfast analogy. You could issue temporary codes for the duration of each guest’s visit that expire once they no longer need access. It would also be prudent to keep an easily searchable list of which code is assigned to which guest so that if a door code is used in an unexpected way, you could easily trace the guest who is responsible. Thus controlling the duration of time a guest has access to their room. This is very similar to the concept of “just-in-time access” well known in computer security.

We looked into the option of adding an expiration date to the public keys that don’t already have one. This would allow developers to create a key that would last as long as they need and provide a way for us to ensure keys aren’t sticking around longer than a given threshold.

Unfortunately, when we started testing this out, we realized that it wouldn’t work as expected. SSH-Pruner would go through the keys in the project metadata and remove any that were invalid or expired. Then it would add an expireOn to any that didn’t have one. After checking all of the keys, the metadata would be overwritten with the changes. The next time a developer tries to access the VM using SSH, the gcloud command line tool used to access Google Compute Engine would simply generate a new key without an expireOn and we'd be back where we started.

To test this, we did the following:

Selected a non-production project.
Connected to a VM using SSH twice, once with an expireOn and once without.
Checked the project metadata.
Ran SSH-Pruner on the project. Note it removes the key we added without an expireOn and leaves the other.
Checked the metadata.
Connected to a VM in the project again without specifying expireOn.
Note that a second copy of the SSH key is added to the metadata with no expireOn. This defeats the purpose of adding an expireOn to keys without one because the next time the owner of the key uses gcloud compute ssh to access an instance, the key would be added to the metadata without an expireOn.

Next we needed to figure out how to reimplement the original solution without causing the same problems as before.

Here's what we needed to do to make this happen:

Create an application that would iterate through the projects with compute enabled and remove any SSH keys in the project wide metadata without a valid expiration date.
Remove any of our keys stored in places where only Google's keys are expected.
Preserve the unexpired keys in the form we received them
Select an iteration time that would leave us well-protected but also have minimal impact on developers.
Create robust and comprehensive tests for this application.

We looked to see what we could reuse from the original project (while taking advantage of the opportunity for me to work on my Go skills) and decided the best path would be to start fresh and create a brand new project. This allowed me to learn more about writing complex Go applications while ensuring that I was following best practices and using the most up-to-date API versions. It also forced me to be conscious of what each piece of the code was doing, avoiding the mistake of copying and pasting someone else’s code without a complete understanding of the functionality.

What does SSH-Pruner actually do when it's run?

Uses the Google Cloud APIs to grab all the projects for our organization.
Iterates through the projects and checks which projects have compute.googleapis.com enabled, indicating that they could contain VMs.
Gets the metadata from each project that had compute enabled.
Skips any project metadata objects that aren't called “ssh-keys” (where our keys go) or “sshKeys” (where Google's keys go).
Reads the metadata line by line.
Parses each line into a struct.
Adds only keys that have not expired to a new metadata object.
Sets the commonInstanceMetadata to the new, pruned metadata.
Produces effective, readable logs throughout this process.

For a common problem, this isn’t a straightforward solution! There is a handy guide in the Google Cloud API for managing SSH keys, but right at the top, you will see the following warning:

Caution: Managing SSH keys in metadata is only for advanced users who are unable to use other tools such as OS Login to manually manage SSH keys. If you manage SSH keys in metadata yourself, you risk disrupting the ability of your project members to connect to instances. Additionally, you risk allowing your instance to be accessed by users who aren't part of your project. For more information, see risks of manual key management.

Advanced user? That's questionable. Unable to use other tools? Yep. Risks? Sure, but they already exist.

The Google Cloud guide gives an overview of doing what we did in our project but skims over some of the details, especially when using the Go Google APIs. This is where we were able to do some improvising and creative coding. We were able to quickly identify the methods we would need to use and then fill in the stuff in between.

We're already running SSH-Pruner as a cron job on some projects and will be rolling it out to the rest in the near future. This will require a slow rollout to ensure that we are not removing keys so frequently that it is causing a disruption to work flows. It will still require some creativity on the part of my team to ensure that we are providing the highest level of security with the lowest impact on developers.

Key Takeaways

While working on this project I was reminded that there’s value in testing early and often. It prevents investing too much time in an imperfect solution.

If you are dealing with sensitive data or accessing important resources/systems, it's always worthwhile ensuring you are taking precautions to maintain secure practices. Sometimes a simple problem can have a more complex solution, but it's worth being proactive and overprotective when it comes to security.

Often security vulnerabilities aren’t an issue until someone exploits them, and when you look at something like persistent SSH keys you might think it's not worth addressing because the risk seems so low. One of the things that I love about working in Trust at Shopify is that we aim to do more than just put out fires. We use our passion and interest in the field to seek out ways to address vulnerabilities before they become a problem. I think that this issue is a great example of the simple, yet impactful projects that our teams take on.

Cailyn is a Dev Degree intern at Shopify, currently on the Network Platform team after an eight month placement working on multi-tenancy on the Infrastructure Security team, and a 12-month placement as a backend Ruby on Rails developer on the Orders team. She started at Shopify in the Fall of 2018 and has been learning, growing, and having a blast ever since. Want to learn like Cailyn? Check out our Dev Degree program.