Deleting the Undeletable

At Shopify, we analyze a variety of events from our buyers and merchants to improve their experience and the platform, and empower their decision making. These events are collected via our streaming platform (Kafka) and stored in our data warehouse at a rate of tens of billions of events per day. Historically, these events were collected by a variety of teams in an ad-hoc manner and lacked any guaranteed structure, leading to a variety of usability and maintainability issues, and difficulties with fulfilling data subject rights requests. The image below depicts how these events were collected, stored in our data warehouse, and used in other online dashboards.

An animated gif showing how events were collected, stored in the data warehouse, and used in dashboards. The image represents Analytical Events as three yellow envelopes that are on the left hand side of the image. The Kafka pipeline in the centre of the image is represented by a blue cylindrical shape. The data warehouse, on the right hand side of the image, is a represented by a set of six grey circles stacked on top of each other. Below the data warehouse is a computer screen which represents the dashboards.  The animation shows yellow envelopes passing through the Kafka pipeline and continues to the data warehouse for sign up events or the dashboard for POS transactions
How events were collected, stored in our data warehouse, and used in other online dashboards in the old system.

Some of these events contained Personally Identifiable Information (PII), and in order to comply with regulation, such as European General Data Protection Regulation (GDPR), we needed to find data subject’s PII within our data warehouse and to access or delete (via privacy requests) them upon request in a timely manner. This quickly escalated to a very challenging task due to:

  • Lack of guaranteed structure and ownership: Most of these events were only meaningful to and parsable by their creators and didn’t have a fixed schema. Further, there was no easy way to figure out ownership of all of them. Hence, it was near impossible to automatically parse and search these events. Let alone accessing and deleting PII within them.

  • Missing data subject context: Even knowing where PII resided in this dataset isn’t enough to fulfill a privacy request. We needed a reliable way to know to whom this PII belongs and who is the data controller. For example, we act as a processor for our merchants when they collect customer data, and so we are only able to process customer deletion requests when instructed by the merchant (the controller of that personal data).

  • Scale: The size of the dataset (in order of Petabytes) made it difficult, costly and time consuming to do any full search. In addition, it continuously grows at billions of events per day. Hence any solution needs to be highly scalable to keep up with incoming online events as well as processing historic ones.

  • Missing dependency graph: Some of these events and datasets power critical tasks and jobs. Any disruption or change to them can severely affect our operations. However, due to lack of ownership and missing lineage information readily and easily available for each event group, it was hard to determine the full scope of disruption should any change to a dataset happen.

So we were left with finding a needle in an ever growing haystack. These challenges, as well as other maintainability and usability issues with this platform, brought up a golden opportunity for the Privacy team and the Data Science & Engineering team to collaborate and address them together. The rest of this blog post focuses on our collaboration efforts and the technical challenges we faced when addressing these issues in a large organization such as Shopify.

Context Collection

Lack of guaranteed schemas for events was the root cause of a lot of our challenges. To address this, we designed a schematization system that specified the structure of each event including types of each field, evolution (versions) context, ownership, as well as privacy context. The privacy context specifically includes marking sensitive data, identifying data subjects, and handling PII ( that is, what to do with PII).

Schemas are designed by data scientists or developers interested in capturing a new kind of event (or changing an existing one). They’re proposed in a human readable JSON format and then reviewed by team members for accuracy and privacy reasons. As of today, we have more than 4500 active schemas. This schema information is then used to enforce and guarantee the structure of every single event going through our pipeline at generation time.

Above shows a trimmed signup event schema. Let’s read through this schema and see what we learn from it:

The privacy_setting section specifies whose PII this event includes by defining a data controller and data subject. Data controller indicates the entity that decides why and how personal data is processed (Shopify in this example). Data subject designates whose data is being processed that’s tracked via email (of the person in question) in this schema. It’s worthwhile to mention, generally when we deal with buyer data, merchants are the data controller and Shopify plays the data processor role (a third party that processes personal data on behalf of a data controller).

Every field in a schema has a data-type and doc field, and a privacy block indicating if it contains sensitive data. The privacy block indicates what kind of PII is being collected under this field and how to handle that PII.

Our new schematization platform was successful in capturing the aforementioned context and it significantly increased privacy education and awareness among our data scientists and developers because of discussions on schema proposals about identifying personal data fields. In the vast majority of cases, the proposed schema contained all the proper context, but when required, or in doubt, privacy advice was available. This exemplified that when given accurate and simple tooling, everyone is inclined to do the right thing and respect privacy. Lastly, this platform helped with reusability, observability, and streamlining common tasks for the data scientists too. Our schematization platform signified the importance of capitalizing on shared goals across different teams in a large organization.

Personal Data Handling

At this point, we have flashy schemas that gather all the context we need regarding structure, ownership, and privacy for our analytical events. However, we still haven’t addressed the problem of how to handle and track personal information accurately in our data warehouse. In other words, after having received a deletion or access privacy request, how do we fetch and remove PII from our data warehouse?

The short answer: we won’t store any PII in our data warehouse. To facilitate this, we perform two types of transformation on personal data before entering our data warehouse. These transformations convert personal (identifying) data to non-personal (non-identifying) data, hence there’s no need to remove or report them anymore. It sounds counterintuitive since it seems data might be rendered useless at this point. We preserve analytical knowledge without storing raw personal data through what GDPR calls pseudonymisation, “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information”. In particular, we employ two types of pseudonymisation techniques: Obfuscation and Tokenization.

It’s important to stress that personal data that’s undergone pseudonymisation and could be attributed to a natural person by the use of additional information, directly or indirectly, is still considered personal information under GDPR and requires proper safeguards. Hence when we said, we won’t have any PII in our data warehouse, it wasn’t entirely precise. However, it allows us to control personal data, reduce risk, and truly anonymize or remove PII when requested.

Obfuscation and Enrichment

When we obfuscate an IP address, we mask half of the bytes but include geolocation data at city and country level. In most cases, this is how the raw IP address was intended to be used for in the first place. This had a big impact on adoption of our new platform, and in some cases offered added value too.

In obfuscation, identifying parts of data are either masked or removed so the people whom the data describe remain anonymous. Our obfuscation operators don’t just remove identifying information, they enrich data with non-personal data as well. This often removes the need for storing personal data at all. However, a crucial point is to preserve the analytical value of these records in order for them to stay useful.

Looking at different types of PII and how they’re used, we quickly observed patterns. For instance, the main use case of a full user agent string is to determine operating system, device type, and major version that are shared among many users. But a user agent can contain very detailed identifying information including screen resolution, fonts installed, clock skew, and other bits that can identify a data subject, hence they’re considered PII. So, during obfuscation, all identifying bits are removed and replaced with generalized aggregate level data that data analysts seek. The table below shows some examples of different PII types and how they’re obfuscated and enriched.

PII Type

Raw Form

Obfuscated

IP Address

207.164.33.12

{

"masked": "207.164.0.0", "geo_country": "Canada"

}

User agent

CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 Instagram 8.4.0 (iPhone7,2; iPhone OS 9_3_2; nb_NO; nb-NO; scale=2.00; 750x1334

{

"Family": "Instagram", "Major": "8",
"Os.Family": "iOS",
"Os.Major": "9",
"Device.Brand": "Apple",
"Device.Model": "iPhone7"

}

Latitude/Longitude

45.4215° N, 75.6972° W

45.4° N, 75.6° W

Email

 john@gmail.com

behrooz@example.com

REDACTED@gmail.com

 REDACTED@REDACTED.com

A keen observer might realize some of the obfuscated data might still be unique enough to identify individuals. For instance, when a new device like an iPhone is released, there might be few people who own that device, hence, leading to identification especially combined with other obfuscated data. To address these limitations, we hold a list of allowed devices, families, or versions that we’re certain have enough unique instances (more than a set threshold) and gradually add to this list (as more unique individuals are part of that group). It’s important to note that this still isn’t perfect anonymization, and it’s possible that an attacker combines enough anonymized and other data to identify an individual. However, that risk and threat model isn’t as significant within an organization where access to PII is more easily available.

Tokenization

Obfuscation is irreversible (the original PII is gone forever) and doesn’t suit every use case. There are times when data scientists require access to the actual raw PII values of PII (imagine preparing a list of emails to send a promotional newsletter). To address these needs, we built a tokenization engine that exchanges PII with a consistent random token. We then only store tokens in the data warehouse and not the raw PII. A separate secured vault service is in charge of storing the token to PII mapping. This way, if there’s a delete request only the mapping in the vault service needs removing and all the copies of that corresponding token across the data warehouse become effectively non-detokenizable (in other words, just a random string).

To understand the tokenization process better let’s go through an example. Let’s say Hooman is a big fan of AllBirds and GymShark products, and he purchases two pairs of shoes from AllBirds and a pair of shorts from GymShark to hit the trails! His purchase data might look like the table below before tokenization:

Email
Shop
Product
...
hooman@gmail.com
allbirds
Sneaker
hooman@gmail.com
Gymshark
Shorts
hooman@gmail.com
allbirds
Running Shoes
 
After tokenization is applied the before table will look like the table below:

Email

Shop

Product

...

Token123

allbirds

Sneaker

Token456

Gymshark

Shorts

Token123

allbirds

Running Shoes

There are two important observations in the after tokenization table:

  1. The same PII (hooman@gmail.com) was replaced by the same token(Token123) under the same data controller (allbirds shop) and data subject (Hooman). This is the consistency property of tokens.
  2. On the other hand, the same PII (hooman@gmail.com) got a different token (Token456) under a different data controller (merchant shop) even though the actual PII remained the same. This is the multi-controller property of tokens and allows data subjects to exercise their rights independently among different data controllers (merchant shops). For instance, if Hooman wants to be forgotten or deleted from allbirds, that shouldn’t affect his history with Gymshark.

Now let’s take a look inside how all of this information is stored within our tokenization vault service shown in table below.

Data Subject
Controller
Token
PII
hooman@gmail.com
allbirds
Token123
hooman@gmail.com
hooman@gmail.com
Gymshark
Token456
hooman@gmail.com
...
...
...
... 
The vault service holds token to PII mapping and the privacy context including data controller and subject. It uses this context to decide whether to generate a new token for the given PII or reuse the existing one. The consistency property of tokens allows data scientists to perform analysis without requiring access to the raw value of PII. For example, all orders of Hooman from GymShark could be tracked only by looking for Token456 across the orders tokenized dataset.

Now back to our original goal, let’s review how all of this helps with deletion of PII in our data warehouse (reporting and accessing PII requests is similar except, instead of deletion of target records, they’ll be reported back). If we store only obfuscated and tokenized PII in our datasets, essentially there will be nothing left in the data warehouse to delete after removing the mapping from the tokenization vault. To understand this let’s go through some examples of deletion requests and how it will affect our datasets as well as tokenization vault.

Data Subject Controller Token PII
hooman@gmail.com allbirds Token123 hooman@gmail.com
hooman@gmail.com Gymshark Token456 hooman@gmail.com
hooman@gmail.com Gymshark Token789
222-333-4444
eva@hotmail.com
Gymshark
Token011
IP 76.44.55.33
Assume the table above shows the current content of our tokenization vault, and these tokens are stored across our data warehouse in multiple datasets. Now Hooman sends a deletion request to Gymshark (controller) and subsequently Shopify (data processor) receives it. At this point, all that’s required to delete Hoomans PII under GymShark is to just locate rows with the following condition:

DataSubject == ‘hooman@gmail.com’ AND Controller == Gymshark

Which results in the rows identified with a star (*) in the table below:

Data Subject Controller Token PII
hooman@gmail.com allbirds Token123 hooman@gmail.com
* hooman@gmail.com Gymshark Token456 hooman@gmail.com
* hooman@gmail.com Gymshark Token789 222-333-4444
eva@hotmail.com Gymshark Token011 IP 76.44.55.33
Similarly, if Shopify needed to delete all Hooman’s PII across all controllers (shops), it would need to only look for rows that have Hooman as the data subject, highlighted below:
Data Subject Controller Token PII
* hooman@gmail.com allbirds Token123 hooman@gmail.com
* hooman@gmail.com Gymshark Token456 hooman@gmail.com
* hooman@gmail.com Gymshark Token789 222-333-4444
eva@hotmail.com Gymshark Token011 IP 76.44.55.33
Last but not least, the same theory applies to merchants too. For instance, assume (let’s hope that will never happen!) Gymshark (data subject) decides to close their shop and ask Shopify (data controller) to delete all PII controlled by them. In this case, we could do a search in with the following condition:

Controller == Gymshark

Which will result in rows indicated in table:

Data Subject Controller Token PII
hooman@gmail.com allbirds Token123 hooman@gmail.com
* hooman@gmail.com Gymshark Token456 hooman@gmail.com
* hooman@gmail.com Gymshark Token789 222-333-4444
* eva@hotmail.com Gymshark Token011 IP 76.44.55.33
Notice in all of these examples, there was nothing to do in the actual data warehouse since once the mapping of token ↔ PII is deleted, tokens effectively become consistent random strings. In addition, all of these operations can be done in fractions of a second whereas doing any task in a petabyte scale data warehouse can become very challenging, and time and resource consuming.

Schematization Platform Overview

So far we’ve learned about details of schematization, obfuscation, and tokenization. Now it’s time to put all of these pieces together in our analytical platform. The image below shows an overview of the journey of an event from when it’s fired until it’s stored in the data warehouse:

An animated gif overview of the journey of an event from when it’s fired until it’s stored in the data warehouse. On the left hand side is Analytical Events represented by three yellow envelopes. In the center of the image is a cylindrical object that represents the Scheme Repository. An arrow from the Scheme Repository points downward to the Kafka pipeline which is represented by a blue cylindrical object. On the right hand side of the image is the Tokenization Vault that is represented by a blue square with a vault lock. Underneath the vault is the data warehouse represented by six grey circles stacked on top of each other.

In this example:

  1. A SignUp event is triggered into the messaging pipeline (Kafka)
  2. A tool, Scrubber, intercepts the message in the pipeline and applies pseudonymisation on the content using the predefined schema fetched from the Schema Repository for that message
  3. The Scrubber identifies that the SignUp event contains tokenization operations too. It then sends the raw PII and Privacy Context to the Tokenization Vault.
  4. Tokenization Vault exchanges PII and Privacy Context for a Token and sends it back to the Scrubber
  5. Scrubber replaces PII in the content of the SignUp event with the Token
  6. The new anonymized and tokenized SignUp event is put back onto the message pipeline.
  7. The PII free SignUp event is stored in the Data warehouse.

In theory, this schematization platform can allow a PII free data warehouse for all new incoming events; however, in practice, there still exists some challenges to be addressed.

Lessons from Managing PII at Shopify Scale

Despite having a sound technical solution for classifying and handling PII in our data warehouse, Shopify scale made adoption and reprocessing of our historic data a difficult task. Here are some lessons that helped us in this journey.

Adoption

Having a solution versus adopting it are two different problems. Initially, with a sound prototype ready, we struggled getting approval and commitment from all stakeholders to implement this new tooling and rightly so. Looking back at all of these proposed changes and tools to an existing platform, it does seem like open heart surgery, and of course, you’d likely face resistance. There’s no bulletproof solution to this problem, or at least one that we knew! Let’s review a few factors that significantly helped us.

Make the Wrong Thing the Hard Thing

Make the right thing the default option. A big factor in the success and adoption of our tooling was to make our tooling the default and easy option. Nowadays, creating and collecting unstructured analytical events at Shopify is difficult and goes through a tedious process with several layers of approval. Whereas creating structured privacy-aware events is a quick, well documented, and automated task.

“Trust Me, It Will Work” Isn’t Enough!

Proving scalability and accuracy of the proposed tooling was critical to building trust in our approach. We used the same tooling and mechanisms that the Data Science & Engineering team uses to prove correctness, reconciliation. We showed the scalability of our tooling by testing it on real datasets and stress testing under order of magnitudes higher load.

Make Sure the Tooling Brings Added Value

Our new tooling is not only the default and easy way to collect events, but also offers added value and benefits such as:

  • Streamlined workflow: No need for multiple teams to worry about compliance and fulfilling privacy requests
  • Increased data enrichment: For instance geolocation data from IP, family, or device info from user agent strings is the information that data scientists are often after in the first place.
  • Shared privacy education: Our new schematization platform encourages asking about and discussing privacy concerns. They range from what’s PII to other topics like what can or can’t be done with PII. It brings clarity and education that wasn’t easily available before.
  • Increased dataset discoverability: Schemas for events allow us to automatically integrate with query engines and existing tooling, making datasets quick to be used and explored.

These examples are a big driver and encouragement in adoption of our new toolings.

Capitalizing on Shared Goals

Schematization isn’t only useful for privacy reasons, it helps with reusability and observability, reduces storage cost, and streamlines common tasks for the data scientists too. Both privacy and data teams are important stakeholders in this project and it made collaboration and adoption a lot easier because we capitalized on shared goals across different teams in a large organization.

Historic Datasets are several petabytes of historic events collected in our data warehouse prior to the schematization platform. Even after implementing the new platform, the challenge of dealing with large historic datasets remained. What made it formidable was the sheer amount of data that was hard to identify an owner, reprocess, and migrate without disturbing the production platform. In addition, it’s not particularly the most exciting kind of work either, hence it’s easy to get deprioritized.

A dependency graph showing a partial view of interdependency between analytical jobs. The graph is large and has many branches represented by green, pink, black, blue, and white boxes. Numerous black lines terminating with arrows connect the dependencies.
Intricate interdependencies between some of the analytical jobs depending on these datasets

The above image shows a partial view of the intricate interdependency between some of the analytical jobs depending on these datasets. Similar to adoption challenges, there’s no easy solution for this problem, but here are some practices that helped us in mitigating this challenge.

Organizational Alignment

Any task of this scale goes beyond the affected individuals, projects, or even teams. Hence an organizational commitment and alignment is required to get it done. People, teams, priorities, and projects might change, but if there’s organizational support and commitment for addressing privacy issues, the task can survive. Organizational alignment helped us to put out consistent messaging to various team leads that meant everyone understood the importance of the work. With this alignment in place, it was usually just a matter of working with leads to find the right balance of completing their contributions in a timely fashion without completely disrupting their roadmap.

Dedicated Task Force

These kinds of projects are slow and time consuming, in our case, it took over a year and several changes at individual and team levels happened. We understood the importance of having a team and project, so we didn’t depend on individuals. People come and go, but the project must carry on.

Tooling, Documentation, and Support

One of our goals was to minimize the amount of effort individual dataset owners and users needed to migrate their datasets to the new platform. We documented the required steps, built automation for tedious tasks, and created integrations with tooling that data scientists and librarians were already familiar with. In addition, having Engineering support with hurdles was important. For instance, on many occasions when performance or other technical issues came up, Engineering support was available to solve the problem. Time spent on building the tooling, documentation, and support procedures easily paid off in the long time run.

Regular Progress Monitoring

Questioning dependencies, priorities, and blockers regularly paid off because we found better ways. For instance, in a situation where x is considered a blocker for y maybe:

  • we can ask the team working on x to reprioritize and unblock y earlier.
  • both x and y can happen at the same time if the teams owning them align on some shared design choices.
  • there's a way to reframe x or y or both so that the dependency disappears.

We were able to do this kind of reevaluation because we had regular and constant progress monitoring to identify blockers.

New Platform Operational Statistics

Our new platform has been in production use for over two years. Nowadays, we have over 4500 distinct analytical schemas for events, each designed to capture certain metrics or analytics, and with their own unique privacy context. On average, these schemas generate roughly 20 billions events per day or approximately 230K events per second with peaks of over 1 million events per second during busy times. Every single one of these events is processed by our obfuscation and tokenization tools in accordance to its privacy context before being accessible in the data warehouse or other places.

Our tokenization vault holds more than 500 billions distinct PII to token mappings (approximately 200 TeraBytes) from which tens to hundreds of millions are deleted daily in response to privacy or shop purge requests. The magical part of this platform is that deletion happens instantaneously only in the tokenization vault without requiring any operation in the data warehouse or any other place where tokens are stored. This is the super power that enables us to delete data that used to be very difficult to identify, the undeletable. These metrics and the ease of fulfilling privacy requests proved the efficiency and scalability of our approach and new tooling.

As part of onboarding our historic datasets into our new platform, we rebuilt roughly 100 distinct datasets (approximately tens of petabytes of data in total) feeding hundreds of jobs in our analytical platform. Development, rollout, and reprocessing of our historical data altogether took about three years with help from 94 different individuals signifying the scale of effort and commitment that we put into this project.

We believe sharing the story of a metamorphosis in our data analytics platform to facilitate privacy requests is valuable because when we looked for industry examples, there were very few available. In our experience, schematization and a platform to capture the context including privacy and evolution is beneficial in analytical event collection systems. They enable a variety of opportunities in treating sensitive information and educating developers and data scientists on data privacy. In fact, our adoption story showed that people are highly motivated to respect privacy when they have the right tooling at their disposal.

Tokenization and obfuscation proved to be effective tools in helping with handling, tracking and deletion of personal information. They enabled us to efficiently delete the undeletable at a very large scale.

Finally, we learned that solving technical challenges isn’t the entire problem. It remains a tough problem to address organizational challenges such as adoption and dealing with historic datasets. While we didn’t have a bulletproof solution, we learned that bringing new value, capitalizing on shared goals, streamlining and automating processes, and having a dedicated task force to champion these kinds of big cross team initiatives are effective and helpful techniques.

Additional Information

Behrooz is a staff privacy engineer at Shopify where he works on building scalable privacy tooling and helps teams to respect privacy. He received his MSc in Computer Science at University of Waterloo in 2015.  Outside of the binary world, he enjoys being upside down (gymnastics) 🤸🏻, on a bike  🚵🏻‍♂️ , on skis ⛷, or in the woods. Twitter: @behroozshafiee

Shipit! Presents: Deleting the Undeletable

On September 29, 2021, Shipit!, our monthly event series, presented Deleting the Undeletable. Watch Behrooz Shafiee and Jason White as they discuss the new schematization platform and answer your questions.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.