Shopify is an all-in-one commerce platform that serves over 1M+ merchants in approximately 175 countries across the world. Many of our merchants prepare months in advance for their biggest shopping season of the year, and they trust us to help them get through it successfully. As our merchants grow and their numbers increase, we must scale our platform without compromising on our stability, performance, and quality.
With Black Friday and Cyber Monday (BFCM) being the two biggest shopping events of the year and with other events on the horizon, there is a lot of preparation that Shopify needs to do on our platform. This effort needs a key driver to set expectations for many teams and hold them accountable to complete the work for their area in the platform.
I’m an Engineering Program Manager (EPM) with a focus on platform quality and was one of the main program managers (PgM) tapped midway in the year to drive these efforts. For this initiative I worked with three Production Engineering leads (BFCM leads) and three other program managers (with a respective focus in resiliency, scale, and capacity) to:
- understand opportunities for improvement
- build out a program that’s effective at scale
- create adjustments to the workflow specifically for BFCM
- execute the program
- start iterating on the program for next year.
Understanding our Opportunities
Each year, the BFCM leads start a large cross company push to get the platform ready for BFCM. They ask the teams responsible for critical areas of the platform to complete the following prep:
- focus on addressing the deprioritized maintenance from earlier in the year
- run game days
- load test
- scale-up capacity resources for any critical areas.
Looking at the past years, the BFCM leads chosen to champion this in spend a significant time on administrative, communication, and reporting activities when their time is better spent in the weeds of the problems. Our PgM group was assigned to take on these responsibilities so that these leads could focus on investigating the technical challenges and escalations.
Before jumping into solutions, our PgM group looked into the past to find lessons to inform the future. In looking at past retrospective documents we found some common themes over the years that we needed to keep in mind as we put together our plan:
- Shopify needs to prepare in advance for supporting large merchants with lots of popular inventory to sell.
- Scaling trends weren’t just on the two main days. Sales were spreading out through the week, and there were pre sale and post sale workflows where we needed to be well tested for how much load we could sustain without performance issues.
- There were some parts of the platform tied to disruptions in the past that would require additional load testing to give us more confidence in their stability.
With Shopify moving to Digital by Default and the increasing number of timezones to consider, there were more complexities to getting the company aligned to the same goals and schedule. Our PgM group wanted to create structure around coordinating a large scale effort, but we also wanted to start thinking about how maintenance and prep work can be done throughout the year, so we’re ready for any large shopping event regardless of the time of year.
Building the Program Plan
Our PgM group listed all the platform preparation tasks the BFCM leads asked developers to do in the past. Then we highlighted items that had to happen this year and took note of when they needed to happen. After this, we asked the BFCM leads to highlight the important things critical for their participation and then we assigned the rest of the work for our PgM group to manage.
Once we had those details documented, we created a communication plan calendar (a.k.a spreadsheet) to see what was next, week over week. We split the PgM work into workstreams then we each selected ones respective to our areas of focus. In my workstream I had two main responsibilities:
- Put together a plan to get people assigned to do the platform preparation work that the BFCM leads wanted them to.
- Determine what kind of PRs should or should not be shipped to production in the month before and after BFCM.
For platform preparation work listed earlier, I asked teams to identify which areas of the platform that need prepping for the large shopping event. Even with a reduced set of areas to focus on there were still quite a bit of people that I would need to get this prep work assigned to. Instead of working directly with every single person, I used a distributed ownership model. I asked each GM or VP with critical areas to assign a champion from their department to work with. Then I reached out to the champions to let them know of the prep work that needed to be done. They then either assigned the work themselves or they assigned people from their team. To keep track of this ownership I built a tracking spreadsheet and set up a schedule to report on progress week over week.
In the past, our Deploys team would lock the ability to automatically deploy to production for a week. Since BFCM is becoming more spread out year after year, we realized we needed to adjust our culture around shipping in the last two months of the year to make sure we could confidently provide merchants with a resilient platform. Merchants were also needing to train up staff further in advance of the year so we also had to consider slowing down new features that could require extra training for their staff. To start tackling these challenges I asked:
- Teams to take inventory of all of our platform areas and highlight which areas were considered critical for the merchant experience.
- That we set up a rule in a bot we call Caution Tape to comment a thorough risk to value assessment on any new PRs created between November to December in repos that had been flagged as critical to successful large shopping events.
If the PRs were proposing a merchant facing feature the Caution Tape bot message asked that they document the risks vs the value to shipping around BFCM and that they only ship if approved by a director or GM in their area. In many cases the people creating these PRs either investigated a safer approach, got more thorough reviews, or decided to wait until next year to launch the feature.
To artificially slow down the rate of items being shipped to production we planned to reduce the amount of PRs that could be shipped in a deploy and increase the amount of time they would spend in canaries (pre-production deploy test). On top of this we also planned to lock deploys for a week around the BFCM weekend.
Executing the Program
How do you rally 2500+ people around a mission who are responsible for 1000+ deploys across all services? You state your expectations and then repeat many times in different ways.
1. Our PGM and BFCM lead group had two main communication options where we started engagement with the rest of the engineering group working on platform prep:
- Shared Slack channels for discussions, questions, and updates.
- GitHub repos for assigning work.
2. Our PGM group put together and shared internal documentation on the program details to make it easier to onboard participants to the program.
3. Our PGM group shared high-level announcements, reminders and presentations throughout the year, increasing in frequency leading up to BFCM, to increase awareness and engagement. Some examples of this were:
- progress reports on each department posted in Slack.
- live-Streamed and recorded presentations on our internal broadcasting channel to inform the company about our mission and where to go for help.
- emails sent to targeted groups to remind people of their respective responsibilities and deadlines.
- GitHub issues created and assigned to teams with a checklist of the prep work we had asked them to do.
To make sure our BFCM leads had the support they needed our PgM group had regular check-in meetings with them to get a pulse on how they were feeling things were going. To make sure our PgM group was on top of the allocated tasks each week we had meetings at the start and end of each week. Then we hosted office hours along with the BFCM leads for any developers that wanted facetime to flag any potential concerns about their area.
Celebrations and Lessons Learned
Overall I’d say our program was a success. We had a very successful BFCM with sales of $5.1+ billion from the more than one million Shopify-powered brands around the world. We found that our predictions for which areas would take the most load were on target and that the load testing and spinning up of resources paid off.
From our internal developer view we had success in the sense that shipping to the platform didn’t need to come to a full stop. PR reviews were at an all time high which meant that developers focus on quality was at an all time high. For the areas where we did have to slow down on shipping code for features we found that our developers had more time to work on the other important aspects of work that needs to be done in engineering. Teams were able to
- focus more on clean up tasks
- write blog posts
- put together strategic roadmaps and architecture design docs
- plan team building exercises.
Overall we still did take a hit in developer productivity and we could have been a bit more relaxed on how long we enforced the extra risk to value assessment on our PRs and expectations on deploying to production. Our PGM team hopes to find a more balanced approach for this for next year's plan.
From a communication standpoint, some of the messaging to developers was inconsistent on whether or not they could ship to critical areas of the platform during November and December. Our PGM group also ended up putting together some of the announcement drafts last minute so in future years we want to have this included in our communication plan from the start with templates ready to go.
Our PGM group is hoping to have a retrospective meeting later this year with the BFCM leads to see how we can adjust the program plan for next year. We will be taking everything we learned and find opportunities where we can automate some of the work or distribute the work throughout the year so we can be always ready for any large shopping event in the year.
If you have a large initiative at your company, consider creating a role for people technical enough to be dangerous that can help drive engineering initiatives forward and work with your top developers to maximise their time and expertise to solve the big complex problems and get shit done.
Lisa Vanderschuit is an Engineering Program Manager who manages the engineering theme of Code Quality. She has been at Shopify for 6 years, working on areas from editing and reviewing Online Store themes to helping our engineering teams raise the bar of code quality at Shopify.
How does your team leverage program managers at your company? What advice do you have for coordinating cross company engineering initiatives? We want to hear from you on Twitter at @ShopifyEng.
We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.