5 data strategies for getting started with review environments
Review environments are a powerful tool in the Spaceship platform designed to make it easier than ever to test and evaluate new features before releasing them to production.
Quite simply, review environments enable you to set up a temporary, private copy of your application for testing. While they sound similar to staging environments, the big difference is that they are not contested resources because your team can easily and automatically spin review environments up and down as needed.
That all sounds well and good, but how do you actually get started in adopting review environments? These environments are easy to spin up and use at a moment’s notice, but there is a prerequisite to getting started – you need a strategy for what data you’ll use. Having this strategy in place is critical because there are several different ways you can bring data into a review environment and the nuances of each one will dictate how your team collaborates and tests within these environments.
What are these different strategies? And which one is right for your team? Here’s a look at what you need to know about each one.
1) No pre-existing data
Starting from scratch is usually the default stance for teams. In this case, you’ll use new data for each review environment, which means your team will have to onboard and sign up for accounts in the product you’re testing each time.
One of the biggest positives to this approach is that it forces your team to make onboarding easy, in large part because you’ll think about it a lot. Companies don’t often pay regular attention to onboarding because it’s something you only go through once, but the truth of the matter is most companies have more customers going through this process at any time than they do anything else. As a result, this experience is an important one to pay attention to and simplify where possible. Importantly, this is the only approach that makes it easy to catch issues in the onboarding process.
Beyond making onboarding easy, starting without any pre-existing data also offers the simplest process for implementing review environments.
You need to build up data to test anything of complexity, so having to manufacture new data every time can end up taking a lot of effort and potentially cause slow downs in testing. In turn, the longer it takes to get everything setup, the less likely you’ll be to test smaller changes.
Additionally, starting with new data each time is generally a bad idea for data-driven applications like BI tools and executive dashboards where data is the most meaningful part of the solution. That’s because beyond requiring significant data entry, using mock data can make it difficult to distinguish a good user experience from a bad one since you won’t be able to find issues in how certain data points relate to one another.
Finally, this approach doesn’t test for issues with how existing data gets migrated (since there is no existing data) and doesn’t offer good support for load testing (since the load on the database is not a realistic representation of what you’d expect to see in production).
2) Seed data
Using seed data to power your review environments offers a good opportunity to make testing faster compared to starting from scratch. In this case, you’ll create a unique set of seed data that can mimic your actual data and then pull that information from databases for each review environment depending on the situation at hand.
Seed data allows you to dedupe common scenarios to make testing faster since this approach can create good, reproducible test cases across team members. While this would take a while to do with pre-existing data, with seed data you only have to create data related to whatever issue you’re debugging. This nuance makes the process much faster and enables easier collaboration across environments.
Further, seed data allows you to easily tie in user personas employed throughout the product development process (i.e. Jim the administrator, Jane the owner, Bill the user, etc.) for even more consistency throughout the entire development lifecycle.
The biggest con to using seed data is that it creates yet another asset your team needs to develop and maintain – the script to seed the database. Interestingly, this maintenance can be both a pro and con if you tie it into your development team’s user personas. That’s because it requires the whole team to know those personas well, which aids in communication across the team around what you’re developing and who it’s for.
Similar to starting without any pre-existing data, using seed data for your review environments can prove tricky for most data-driven applications like BI tools – unless you have a subject matter expert who can create meaningful seed data. Even then, it’s important to keep in mind that the primary purpose of those tools is typically to find surprising things within data and seed data is still simply manufactured information.
This approach also does not test the migration of existing data and is typically not great for load testing unless the seed data is highly accurate to what’s in production (otherwise it can’t test the right levels of availability).
3) User databases
The next option allows individual users to link their personal databases to a review environment so they can QA against their own dataset, which lowers the level of maintenance compared to using seed data.
The most obvious benefit to this approach is that most people are intimately familiar with the data they developed themselves. On top of that familiarity, there is also no script to maintain since everyone will use their personal database.
Notably, this approach enables you to identify issues in the migration of existing data if those issues are present in the individual’s dataset.
Allowing individuals to use their own databases creates complexity in managing the process of regularly linking and then unlinking those databases to review environments.
Next, this approach means that testing quality is tied to the quality of each individual’s database, including their creativity in creating data and the amount of data they’ve created. Specifically, everyone might be using wildly different sets of users and scenarios and some individuals may even end up with data in non-realistic states compared to production. When this happens, it can be difficult to determine if any issues are due to problems in the code or problems in the data. These differences in data can also make it difficult to reproduce any issues across team members.
Lastly, this approach is still not ideal for load testing due to the differences in quantity and types of data compared to what’s in production. And while using seed data enables you to test integrations with external services, using individual user databases does not offer a good way to test these integrations.
4) Staging data
Another option is to bring in data from your staging environment, which is typically closer to real data across many scenarios than either seed data or data from developer databases.
Because staging data better mimics real data (it typically goes through the same standards as production data), it will be possible to catch issues related to data migration.
Compared to the previous option, this approach will also be more exhaustive than any one person’s individual dataset. And since all team members will use the same set of data, it becomes much easier to reproduce any issues across the team.
Staging data is the closest representation of real data among the options so far, but it can still be out of date with production data. It is also only an approximation of how users behave, as it’s populated by internal product teams and is not based on real actions. This difference is important to note since users typically find new ways to use products that end up surprising the teams that built them, and these surprising use cases are often where issues creep up. Without looking at real user behaviors, you’re likely to miss any issues that occur in these unexpected use cases.
Once again, staging data still doesn’t have the appropriate scale to offer a good representation for load testing. It is also not the best approach for testing integrations with external services, since those integrations typically don’t get linked properly in the staging data.
5) Anonymized product data
The last option is to anonymize your actual product data for use in review environments. This approach provides the most realistic setup by far and away, but it still has its cons.
Using real user data, rather than an approximation of what users might do, offers the most diverse and realistic scenarios for testing – including covering those unexpected use cases with which actual customers will inevitably surprise your team. This data also provides the best basis for reproducing customer issues.
Additionally, anonymized product data delivers the only solid foundation for load testing, since row counts and tables should actually reflect what’s in production. Along the same lines, because it reflects real data in production, it is also the best approach for testing data-driven applications.
Anonymized product data is the most complex of all the options for powering review environments. That’s because sanitizing production data to make it safe for use in this testing environment requires going field by field to scrub sensitive elements (e.g. names) without making everything invalid. Ultimately, this scrubbing will require a significant amount of work.
Using real data also carries the risk that you might accidentally leak information or interact with customers in an unexpected way, such as sending a bulk notification to real users or accidentally charging them real money.
The rewards of using this real data are high, but so are the risks. As a result, unless you’re doing load testing or have a data-driven application, anonymizing product data is typically not the best idea.
Getting started with review environments
Each approach for getting data into review environments has its pros and cons, and which one is right for your team will depend on a number of factors. This makes it essential to evaluate each approach based on the needs of your team and the structure of your application.
That said, the most common approach we see is to start with seed data because it makes it easy for your team to reproduce issues across environments and can simplify the process for getting new team members started once they’re educated on your user personas. The other big attraction to this approach is that you always have an ephemeral data store, so there’s no ongoing cost once you spin down a review environment.
No matter which way you go, once you land on a strategy that works for your team, review environments offer a powerful way to support your development lifecycle.
Interested in learning more about review environments? Contact Spaceship today to get the full details and see how you can get started.