6 minute read
The traditional model of running large-scale computer systems divides work into Development and Operations as distinct and separate teams. This split works reasonably well for computer systems that are changed or updated very rarely, and organizations sometimes require this if they’re deploying and operating software built by a different company or organization. However, this rigid divide fails for large-scale web applications that are undergoing frequent or even continuous change. DevOps is the term for a movement that’s gathered steam in the past decade to bring together these disciplines.
Until about a year ago, Shopify followed the traditional model and felt the pain of having ownership separated across teams. Developers were responsible for deploying changes, while three separate teams owned scaling, monitoring, and maintaining the runtime infrastructure respectively. Having many distinct teams with sometimes divergent goals trying to run the same production system created short-term chaos and made it hard to align on long-term goals.
We thought carefully about how to solve this problem in the right way. Running a large-scale web platform requires very deep operational skills in key areas such as networking, data storage, server management, scaling infrastructure, and transaction processing, so Shopify still required people dedicated to expertise in these areas. On the other hand, the company was building out products and features at blistering speed, so we couldn't accept any kind of organizational or technical barriers that would slow the rate of innovation.
The team examined approaches used at other large-scale companies, such as the models established by Facebook’s Production Engineering and Google’s Site Reliability Engineering (SRE). Both of these models involve a specialized team with a roughly 50% split between manual work and software development, either embedded with or working alongside feature development teams. (A detailed overview of the SRE model can be founded in the Google SRE book.)
We adapted ideas to suit our scale with three key goals in reorganizing the teams:
- Focus on automation over manual toil
- Reducing the number of disconnected teams with operational responsibility; and,
- Producing ready to use tools and infrastructure for all Shopify developers
The resulting model was a new organization called Production Engineering. This organization brought together the various teams that previously split operations responsibility, along with a roughly equal number of developers from other development teams.
How The Shift Changed Things
The shift changed areas of responsibility in several meaningful ways.
Feature development teams gained much more self-service access to build and monitor the runtime infrastructure for their applications. Along with complete control over deployment, they could now set up the instrumentation and monitoring they needed and were on call to handle any availability problems with their applications.
Production Engineering became responsible for building and maintaining common infrastructure required by all applications at Shopify. This includes the deep technical infrastructure components such as the networking and data persistence stacks, managing the physical assets (thousands of servers and related networking and storage gear) in our data centres, and running the low level software infrastructure such as load balancers and our large-scale Docker container fleet that runs many of Shopify’s key applications.
Both Production Engineering and all the product development teams shared responsibility for ongoing operation of our end user applications. This means all technical roles share monitoring and incident response, with escalation happening laterally to bring in any skill set required to restore service in case of problems.
Production Engineering is also responsible for building world-class developer infrastructure to make development, deployment, monitoring, and maintenance a frictionless and delightful experience for all Shopify developers. This includes providing tools to support local development and testing, continuous integration and test infrastructure, and automating all repetitive tasks across development, deployment, and ongoing maintenance.
I want to underline the point about our focus on tools as it was key to the reorganization. Asking development teams to simply take over aspects of operations on top of their regular work would simply move the work around without solving any problems. Creating a Production Engineering team dedicated to building tools to make development, deployment, and maintenance tasks either completely automated or as simple as possible helped reduce the overall manual toil for all developers at Shopify.
One Year Later
After about a year of operating under this new structure, we’ve seen clear signs of progress with threefold improvements in deploy speed and release frequency of core applications, without impacting reliability. Developers at Shopify now release software changes into production on average about 150 times a day and the core back end Shopify commerce platform deploys new releases 30-40 times every day. Both are over triple what we’d been doing before forming Production Engineering. Sharing ownership of the production system created a strong sense of pride of ownership across the development team, resulting in higher quality and stability.
Dedicated teams within Production Engineering are now free to focus on building out long term projects such as next generation networking infrastructure, massive scale data storage sharding, and automating everything from simple deployments to full data center failover scenarios. Staffing the team with a strong dose of development skills has shifted the focus to building out automation rather than manual service wherever possible, which has a larger upfront cost that has been paying off over the longer term. For example, we’ve developed a common infrastructure for high scale load tests with automated weekly load tests of production applications to ensure we’re always ready to scale. An incident response bot has also been created this past year to ensure consistent and rapid response to service interruptions.
Although not as easy to quantify, many team members have reported quality of life improvements. No longer having a small group of people on the critical path for monitoring and incident response has reduced burnout and resulted in a more sustainable operation. A greater pool of people on call allowed call rotations so most people are never on call more than one week in six, whereas the hardest hit teams previously did every second or third week. Extra days off are provided for each week of on call to help maintain balance and well being.
A year after the launch of Production Engineering at Shopify, progress has been steady but there is more work to do. In particular, as the Engineering team scales up, we’re focusing on training to make sure new team members benefit from the lessons of the past. With hundreds of different software applications under development, both tools and infrastructure are now needed for a wider variety of applications from small web applications up to big data pipelines.
We continue to reinvent how to organize and build software as we grow, so watch this space for more stories as the journey continues. Any big change comes with complications and can create new challenges. We've seen numerous benefits, but in my next post we’ll talk about some of the challenges we faced launching Production Engineering, the solutions we used, and the lessons we’ve learned for other companies attempting such a shift.