Data is key to making great decisions at scale, so it’s no surprise that all new hires in RnD (Research & Development) at Shopify are invited to take a 90-minute hands-on workshop to learn about Shopify’s data practices, architecture, and some SQL. Since RnD is a multi-disciplinary organization, participants come from a variety of technical backgrounds ( engineering, data science, UX, project management) and sometimes non-technical disciplines that work closely with engineering teams. We split the workshop into two groups: a beginner-friendly group that assumes no prior knowledge of SQL or other data tools, and an intermediate group for those with familiarity.
Beginners have little-to-no experience with SQL, although they may have experience with other programming languages. Still, many participants aren’t programmers or data analysts. They’ve never encountered databases, tables, data models, and have never seen a SQL query. The goal is to empower them to responsibly use the data available to them to discover insights and make better decisions in their work.
That’s a lofty goal to achieve in 90 minutes, but creating an engaging, interactive workshop can help us get there. Here are some of the approaches we use to help participants walk away equipped and excited to start using data to power their work.
1. Answer the Basics: Where Does Data Come From?
Before learning to query and analyze data, you should first know some basics about where the data comes from and how to use it responsibly. Data Scientists at Shopify are responsible for acquiring and preprocessing data, and generating data models that can be leveraged by anyone looking to query the data. This means participants of the workshop, who aren’t data scientists, won’t be expected to know how to collect and clean big data, but should understand how the data available to them was collected. We take the first 30 minutes to walk participants through the basics of the data warehouse architecture:
- Where does data come from?
- How does it end up in the data warehouse?
- What does “raw data” vs. “cleaned/modelled data” mean?
Data Warehouse Architecture
We also caution participants to use data ethically and responsibly. We warn them not to query production systems directly, rely on modelled data when possible, and respect data privacy. We also give them tools that will help them find and understand datasets (even if that “tool” is to ask a data scientist for help). By the time they are ready to start querying, they should have an appreciation for how datasets have been prepared and what they can do with them.
2. Use Real Data and Production Tools
At Shopify, anyone in RnD has access to tools to help them query and analyze data extracted from the ecosystem of apps and services that power Shopify. Our workshop doesn’t play around with dummy data—participants go right into using real data from our production systems, and learn the tools they’ll be using throughout their career at Shopify.
If you have multiple tools available, choose one that has a simple querying interface, easy chart-building capabilities, and is connected (or can easily be connected) to your data source. We use Mode Analytics, which has a clean console UI, drag-and-drop chart and report builders, clean error feedback, and a pane for searching and previewing data sources from our data lake.
In addition to choosing the right tool, the dataset can make or break the workshop. Choosing a complex dataset with cryptic column names that is poorly documented, or one that requires extensive domain knowledge will draw the attention of the workshop away from learning to query. Instead, participants will be full of questions about what their data means. For these reasons, we often use our support tickets dataset. Most people understand the basic concepts of customer support: a customer who has an issue submits a ticket for support and an agent handles the question and closes the ticket once it’s been solved. That ticket’s information exists in a table where we have access to facts like when the ticket was created, when it was solved, who it was handled by, what customer it was linked to, and more. As a rule of thumb, if you can explain the domain and the structure of the data in 3 sentences or less, it’s a good candidate to use in a beginner exercise.
3. Identify Your Objectives
To figure out what your workshop should touch, it’s often helpful to think about the types of questions you want your participants to be able to answer. At the very least, you should introduce the SELECT, FROM, WHERE, ORDER BY, and LIMIT clauses. These are foundational for almost any type of question participants will want to answer. Additional techniques can depend on your organization’s goals.
Some of the most common questions we often answer with SQL include basic counts, trends over time, and segmenting metrics by certain user attributes. For example, a question we might get in Support is “How many tickets have we processed monthly for merchants with a Shopify Plus plan?”, or “What was the average handling time of tickets in January 2020?”
For our workshop, we teach participants foundations of SQL that include the keywords SELECT, DISTINCT, FROM, WHERE, JOIN, GROUP BY, and ORDER BY, along with functions like COUNT, AVG, and SUM. We believe this provides a solid foundation to answer almost any type of question someone outside of Data Science could be able to self-solve.
4. Translate Objectives Into Real Questions
Do you remember the most common question from your high school math classes? If it was anything like my classroom, it was probably, “When will we actually need to know this in the real world?” Linking the techniques to real questions helps all audiences grasp the “why” behind what we’re doing. It also helps participants identify real use cases of the queries they build and how they can apply queries to their own product areas, which motivates them to keep learning!
Once you have an idea what keywords and functions you want to include in your workshop, you’ll need to figure out how you want to teach them to your participants. Using the dataset you chose for the workshop, construct some interesting questions and identify the workshop objectives required for each question.
Identifying Objectives and Questions
It also helps to have one big goal question for participants to aim to answer. Ideally, answering the question should result in a valuable, actionable insight. For example, using the domain of support tickets, our goal question can be “How have wait times for customers with a premium plan changed over time?” Answers to this question can help identify trends and bottlenecks in our support service, so participants can walk away knowing their insights can solve real problems.
A beginner-friendly exercise should start with the simplest question and work its way toward answering the goal question.
5. Start With Exploration of Data Sources
An important part of any analysis is the initial exploration. When we encounter new data sources, we should always spend time understanding the structure and quality of the data before trying to build insights off it. For example, we should know what the useful columns for our analysis will be, the ranges for any numerical or date columns, the unique values for any text columns, and the filtering we’ll need to apply later, such as filtering out test tickets created by the development team.
The first query we run with participants is always a simple “SELECT * FROM {table}”. Not only does this introduce participants to their first keywords, but it gives us a chance to see what the data in the table looks like. We then learn how to select specific columns and apply functions like MIN, MAX, or DISTINCT to explore ranges.
6. Build on Each Query to Answer More Complex Questions
Earlier, we talked about identifying real questions we want to have participants answer with SQL. It turns out that it only really requires one or two additional keywords or functions to answer increasingly-difficult questions from a simple query.
We discussed how participants start with a simple “SELECT * FROM {tickets_table}” query to explore the data. This forms the foundational query used for the rest of the exercise. Next, we might start adding keywords or functions in this sequence:
Question → Objective
- How many tickets were created last month? → Add COUNT and WHERE.
- What is the daily count of tickets created over the last month? → Add GROUP BY and date-parsing functions (e.g. Presto’s DATE_TRUNC function to convert a timestamp to a day).
- How do the metrics reported in #2 change by the customer’s plan type? → Add JOIN and GROUP BY multiple columns.
Each question above can be built from the prior query and the addition of up to two new concepts, and already participants can start deriving some interesting insights from the data!
7. Provide Resources to Keep the Learning Journey Going
Unfortunately, there’s only so far we can go in a single 90-minute workshop and we won’t be able to answer every question participants may have. However, there are many free resources out there for those who want to keep learning. Before we part, here’s a list of my favourite learning resources from beginner to advanced levels.
Why Teach SQL to Non-Technical Users?
I haven’t run into anyone in RnD at Shopify who hasn’t used SQL in some capacity. Those who invest time into learning it are often able to use the available data tools to 10x their work. When employees are empowered to self-serve their data, they have the ability to quickly gather information that helps them make better decisions.
There are many benefits to learning SQL, regardless of discipline. SQL is a tool that can help us troubleshoot issues, aggregate data, or identify trends and anomalies. As a developer, you might use SQL to interact with your app’s databases. A UX designer can use SQL to get some basic information on the types of users using their feature and how they interact with the product. SQL helps technical writers explore pageview and interaction data of their documents to inform how they should spend their time updating documents that may be stale or filling resource gaps. The opportunities are endless when we empower everyone.
Learn SQL!
Learn SQL from the ground up with these free resources! Don’t be discouraged by the number of resources shared below - pick and choose which resources best fit your learning style and goals. See the comments section to understand the key concepts you should learn. If you choose to skip a resource, make sure you understand any concepts listed before moving on.
Database theory
Database management systems allow data to be stored and accessed in a computer system. At Shopify, many applications store their data inRelational Database Management Systems (RDMS). It can be useful to understand the basics behind RDMS since much of the raw data we encounter will come from these databases, which can be “spoken to” with SQL. Learning about RDMS also exposes you to concepts like tables, keys, and relational operations that are helpful in learning SQL.
Source |
Comments |
This free course introduces databases for developers. Complete the 1st lesson to learn about data and tables. Complete the rest of the lessons if they interest you, or continue on with the rest of the resources |
|
This free MOOC from Stanford University is a great resource for tailoring your learning path to your specific goals. Unfortunately as of March 2020, the Lagunita online learning platform was retired and they are working on porting the course over to edx.org. Stay tuned for these resources to come back online! When they become available, check out the following videos: Introduction, Relational Model, Querying Relational Databases Some terms you should be familiar with at this point: data model, schema, DDL (Data Definition Language), DML (Data Manipulation Language), Columns/attributes, Rows, Keys (primary and foreign keys), Constraints, and Null values |
|
While the Stanford MOOC is being migrated, this course may be the next best thing. It covers many of the concepts discussed above |
|
This is another comprehensive course that covers relational models and querying. This is part of a specialization focusing on business intelligence warehousing, so there’s also some exposure to data modelling and normalization (advanced concepts!) |
Querying with SQL
Before learning the SQL syntax, it’s useful to familiarize yourself with some of the concepts listed in the prior section. You should understand the concept of a table (which has columns/attributes and rows), data types, and keys (primary and foreign keys).
Beginner resources
Source |
Comments |
This series of interaction lessons will take you through the basics of SQL. Lessons 1-6 cover basic SQL exercises, and lessons 7-12 cover intermediate-level concepts. The remaining exercises are primarily for developers and focus on creating and manipulating tables. You should become familiar with SQL keywords like: SELECT, FROM, WHERE, LIMIT, ORDER BY, GROUP BY, and JOIN Intermediate-level concepts you should begin to understand include: aggregations, types of joins, unions |
|
This is a great read that breaks down all the keywords you’ve encountered so far in the most beginner-friendly way possible. If you’re still confused about what a keyword means, how to use it properly, or how to get the syntax right for the question you’re trying to answer (what order do these keywords go in?), this is a great resource to help you build your intuition. |
|
W3 has a comprehensive tutorial for learning SQL. I don’t recommend you start with w3, since it doesn’t organize the concepts in a beginner-friendly way as the other resources have done. But, even as an advanced SQL user, I still find their guides helpful as a reference. It’s also a good reference for understanding all the different JOIN types. |
|
For the audio-visual learners, this video lesson covers the basics of database management and SQL |
Intermediate resources
Source |
Comments |
|
Despite its name, this resource covers some topics I’d consider beyond the basic level. Learn about subqueries and aggregations.This site also has good notes on joins if you’re still trying to wrap your head around them. |
Complete the exercises here for some more hands-on practice |
|
Learn about the WITH keyword (also known as Common Table Expressions (CTE)) for writing more readable queries |
|
At Shopify, we use Mode to analyze data and create shareable reports. Mode has its own SQL tutorial that also shares some tips specific to working in the Mode environment. I find their section on the UNION operator more comprehensive than other sources. If you’re interested in advanced topics, you can also explore their advanced tutorials.> Complete some of their exercises in the Intermediate and Advanced sections. Notably, at this point you should start to become comfortable with aggregate functions - functions like COUNT, SUM, MIN, MAX, AVG, GROUP BY and HAVING keywords, CASE statements, joins, and date formats |
Here’s a full list of interactive exercises available:
- https://mode.com/sql-tutorial/sql-union/
- https://sqlzoo.net/wiki/SELECT_within_SELECT_Tutorial#Exercises
- https://sqlbolt.com/lesson/introduction
- http://www.sqlcourse.com/
- https://www.w3resource.com/sql-exercises/
- https://www.codewars.com/ (choose SQL as your language)
And finally, bookmark this SQL cheat sheet for future reference
Advanced resources
The journey to learning SQL can sometimes feel endless - there are many keywords, functions, or techniques that can help you craft readable, performant queries. Here are some advanced concepts that might be useful to add to your toolbelt.
Note: database systems and querying engines may have different functions built-in. At Shopify, we use Presto to query data across several datastores - consult the Presto docs on available functions
Source |
Comments |
Window functions are powerful functions that allow you to perform a calculation across a set of rows in a “window” of your data. The Mode SQL tutorial also shares some examples and practice problems for working with window functions |
|
Common Table Expressions, aka CTEs, allow you to create a query that can be reused throughout the context of a larger query. They are useful for enhancing the readability and performance of your query. |
|
|
The EXPLAIN keyword in Presto’s SQL engine allows you to see a visual representation of the operations performed by the engine in order to return the data required by your query, which can be used to troubleshoot performance issues. Other SQL engines may have alternative methods for creating an execution plan |
Grouping sets can be used in the GROUP BY clause to define multiple groupings in the same query |
|
Coalesce is used to evaluate a number of arguments and return the first non-NULL argument seen. For example, it can be used to set default values for a column holding NULL values. |
|
Explore functions for working with string values |
|
|
Explore functions for working with date and time values |
These “best practices” share tips for enhancing the performance of your queries |
|
Chris Saxon shares tips and tricks for advancing your SQL knowledge and tuning performance |
If you’re passionate about data at scale, and you’re eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.