Learning to Talk About Inaccuracy for a New Data Engineer
About a month ago, the engineering team at Mighty Networks was impacted by China’s now-defunct one child policy. A parent of our data engineer was having health problems. Because there were no other children to help out, he was forced to relocate his family back to China.
It took us around six months to hire him. With a bunch of data projects now in the pipeline, we couldnt’t go through the hiring process again. Fortunately, I was excited to step into the breach. I’ve never formally trained as a data engineer, but I built a data warehouse from scratch for another startup, and I’ve always had a passion for numbers.
Still, I’ve definitely struggled a little bit in the new role. One of the things I’ve struggled with most is how to communicate numbers to the business team. It’s fine to make pretty visualizations, but how do I communicate the subtlety in the data? How do I communicate the fact that the numbers are as accurate as we can get, but there are still some sources of error ever-present in the system?
I came up with the following guidelines to help me talk to the business team, and I thought they might be useful to other programmers who are in a similar position.
Sources of Error
There are two categorical sources of error in any data analysis system.
- Data warehouse replication problems
- Bugs and algorithmic errors
Data Replication Issues
Inaccuracy of the first type is unavoidable, and is a universal problem with data warehouses. Data warehouses are typically pulling in huge amounts of data from many sources, then transforming it and analyzing it. In our case, we have jobs that should pull data hourly, but these jobs can fail due to infrastructural errors, such as an inability to requisition server resources from Amazon. So we have jobs that run daily as a fallback mechanism, and we have jobs to pull all the data for each table that can be run manually.
Typically, the data should be no more than an hour off of real time.
When ingestion jobs fail, the data can be recovered by future jobs. Typically, data replication errors do not result in any long-term data loss.
Bugs and Algorithmic Errors
It’s important to remember that the data analysis system is ultimately just software. As with any software project, bugs are inevitable. Bugs can arise in several ways and in several different places.
- Instrumentation. The instrumentation can be wrong in many ways. New features may not have been instrumented at all. Instrumentation may be out of date with the assumptions in the latest release. Instrumentation could be conditionally incorrect, leading to omitted data or semi-correct data.
- Ingestion. The ingestion occurs in multiple steps. The data has to be correctly propagated from the database, to the replicated database, to the data pipeline, to the data warehouse. Errors in ingestion often occur when only part of this process has been updated. In our case, fields must be added to RedShift, to Kinesis Firehose (for events), to Data Pipeline (for db records), then they must be exposed in Looker.
- Transformation and Analysis. The presentation of advanced statistics rests on several layers of analysis and aggregation. A small typo, or mistake in one place can lead to a cascade of errors when that mistake effects a huge amount of data.
How to Talk About Inaccuracy
The best way to talk about inaccuracy is to talk about what steps you have taken to validate the data.
- What did you do to validate the instrumentation? How did you communicate the requirements and purpose of the new events to the developers? Did you review their pull requests and ensure that the events were actually instrumented?
- What did you do to validate the ingestion? Did you see events coming in on a staging environment? Did you participate in testing the new feature then verify that your tests percolated through to staging analytics? Did you read the monitoring logs?
- What did you do to validate the analysis? Did you compare the resulting data to the data in another system? Did you talk through the results with a colleague? Did you double-check the calculations that underlie your charts? Even when they were created by other/former developers? Did you create an intermediate chart and verify the correctness at that level of analysis? When you look at the data from another angle/table, do the results make sense with your new results?
Don’t dwell on the sources of error. Talk about what you have done to minimize the sources of error. In the end, this is software. Software evolves. The first release is always buggy, and we are always working to refine, fix bugs, and improve.
Make a plan to validate each data release like the rest of the team validates the consumer-facing product. Use unit tests, regression tests, and spot=checks with production to validate your process.
Top Line Numbers vs. Other Numbers
In general, you can never be sure that a number is absolutely 100% correct due to the assumptions in the process, and the fact that you must rely on the work of many other developers. Most charts should be used in aggregate to paint a picture of what is happening. No single number should be thought of as absolute. If possible, you should try to present confidence intervals in charts or use other tools that represent the idea of fuzziness.
But as in everything, there are exceptions.
With particularly important numbers, if the amount of data that goes into them is relatively small, then we can manually validate the process by comparing the results with the actual production database. The point is that for the most important, top-line calculations, you should be extremely confident in your process. You should have reviewed each step along the way and ensured to the best of your ability that the number is as close as possible to the real number.
When you’re trying to communicate the accuracy of your data to the business team…
- Focus on what you have done to validate the numbers.
- Keep in mind that the data analysis process is software that evolves toward correctness as all software does.
- Validate data analysis like you validate any other software.
- Where it’s possible and important, do manual validation against production so you can have high confidence in your top line numbers.