Learning to Talk About Inaccuracy for a New Data Engineer

About a month ago, the engineering team at Mighty Networks was impacted by China’s now-defunct one child policy. A parent of our data engineer was having health problems. Because there were no other children to help out, he was forced to relocate his family back to China.

It took us around six months to hire him. With a bunch of data projects now in the pipeline, we couldnt’t go through the hiring process again. Fortunately, I was excited to step into the breach. I’ve never formally trained as a data engineer, but I built a data warehouse from scratch for another startup, and I’ve always had a passion for numbers.

Still, I’ve definitely struggled a little bit in the new role. One of the things I’ve struggled with most is how to communicate numbers to the business team. It’s fine to make pretty visualizations, but how do I communicate the subtlety in the data? How do I communicate the fact that the numbers are as accurate as we can get, but there are still some sources of error ever-present in the system?

I came up with the following guidelines to help me talk to the business team, and I thought they might be useful to other programmers who are in a similar position.

Sources of Error

There are two categorical sources of error in any data analysis system.

  1. Data warehouse replication problems
  2. Bugs and algorithmic errors

Data Replication Issues

Inaccuracy of the first type is unavoidable, and is a universal problem with data warehouses. Data warehouses are typically pulling in huge amounts of data from many sources, then transforming it and analyzing it. In our case, we have jobs that should pull data hourly, but these jobs can fail due to infrastructural errors, such as an inability to requisition server resources from Amazon. So we have jobs that run daily as a fallback mechanism, and we have jobs to pull all the data for each table that can be run manually.

Typically, the data should be no more than an hour off of real time.

When ingestion jobs fail, the data can be recovered by future jobs. Typically, data replication errors do not result in any long-term data loss.

Bugs and Algorithmic Errors

It’s important to remember that the data analysis system is ultimately just software. As with any software project, bugs are inevitable. Bugs can arise in several ways and in several different places.

  1. Instrumentation. The instrumentation can be wrong in many ways. New features may not have been instrumented at all. Instrumentation may be out of date with the assumptions in the latest release. Instrumentation could be conditionally incorrect, leading to omitted data or semi-correct data.
  2. Ingestion. The ingestion occurs in multiple steps. The data has to be correctly propagated from the database, to the replicated database, to the data pipeline, to the data warehouse. Errors in ingestion often occur when only part of this process has been updated. In our case, fields must be added to RedShift, to Kinesis Firehose (for events), to Data Pipeline (for db records), then they must be exposed in Looker.
  3. Transformation and Analysis. The presentation of advanced statistics rests on several layers of analysis and aggregation. A small typo, or mistake in one place can lead to a cascade of errors when that mistake effects a huge amount of data.

How to Talk About Inaccuracy

The best way to talk about inaccuracy is to talk about what steps you have taken to validate the data.

  • What did you do to validate the instrumentation? How did you communicate the requirements and purpose of the new events to the developers? Did you review their pull requests and ensure that the events were actually instrumented?
  • What did you do to validate the ingestion? Did you see events coming in on a staging environment? Did you participate in testing the new feature then verify that your tests percolated through to staging analytics? Did you read the monitoring logs?
  • What did you do to validate the analysis? Did you compare the resulting data to the data in another system? Did you talk through the results with a colleague? Did you double-check the calculations that underlie your charts? Even when they were created by other/former developers? Did you create an intermediate chart and verify the correctness at that level of analysis? When you look at the data from another angle/table, do the results make sense with your new results?

Don’t dwell on the sources of error. Talk about what you have done to minimize the sources of error. In the end, this is software. Software evolves. The first release is always buggy, and we are always working to refine, fix bugs, and improve.

Make a plan to validate each data release like the rest of the team validates the consumer-facing product. Use unit tests, regression tests, and spot=checks with production to validate your process.

Top Line Numbers vs. Other Numbers

In general, you can never be sure that a number is absolutely 100% correct due to the assumptions in the process, and the fact that you must rely on the work of many other developers. Most charts should be used in aggregate to paint a picture of what is happening. No single number should be thought of as absolute. If possible, you should try to present confidence intervals in charts or use other tools that represent the idea of fuzziness.

But as in everything, there are exceptions.

With particularly important numbers, if the amount of data that goes into them is relatively small, then we can manually validate the process by comparing the results with the actual production database. The point is that for the most important, top-line calculations, you should be extremely confident in your process. You should have reviewed each step along the way and ensured to the best of your ability that the number is as close as possible to the real number.

TLDR

When you’re trying to communicate the accuracy of your data to the business team…

  • Focus on what you have done to validate the numbers.
  • Keep in mind that the data analysis process is software that evolves toward correctness as all software does.
  • Validate data analysis like you validate any other software.
  • Where it’s possible and important, do manual validation against production so you can have high confidence in your top line numbers.

When Code Duplication is not Code Duplication

Duplicating code is a bad thing. Any engineer worth his salt knows that the more you repeat yourself, the more difficult it will be to maintain your code. We’ve enshrined this in a well-known principle called the DRY principle, where DRY is an acronym standing for Don’t Repeat Yourself. So code duplication should be avoided at all costs. Right?

At work I recently came across an interesting case of code duplication that merits more thought, and shows how there is some subtlety needed in application of every coding guideline, even the bedrock ones.

Consider the following CSS, which is a simplified version of a common scenario.

.title {
  color: #111111;
}
.text-color-gray-1 {
  color: #111111;
}

This looks like code duplication, right? If both classes are applying the same color, then they do the same thing. If the do the same thing, then they should BE the same thing, right?

But CSS and markup in general presents an interesting case. Are these rules really doing the same thing? Are they both responsible for making the text gray? No.

The function of these two rules is different, even though the effect is the same. The first rule styles titles on the website, while the second rule styles any arbitrary div. The first rule is a generalized style, while the second rule is a special case override. The two rules do fundamentally different things.

Imagine a case where we optimized those two classes by removing the title class and just using the latter class. Then the designer changes the title color to dark blue. To change the title color, the developer now has to replace each occurrence of .text-color-gray-1 where it styles a title. So, by optimizing two things with different purposes, the developer has actually made more work.

It’s important to recognize in this case that code duplication is not always code duplication. Just because these two CSS classes are applying the same color doesn’t mean that they are doing the same thing. In this case, the CSS classes are more like variables than methods. They hold the same value, but that is just a coincidence.

What looks like code duplication is not actually code duplication.

But… what is the correct thing?

There is no right answer here. It’s a complex problem. You could solve it in lots of different ways, and there are probably three or four different approaches that are equally valid, in the sense that they result in the same amount of maintenance.

The important thing is not to insist that there is one right way to solve this problem, but to recognize that blithely applying the DRY principle here may not be the path to less maintenance.

How Can Comments be Controversial?

Comments add information to your code. They don’t impact the execution of that code. So how can they be bad? I believe that more commenting is better, and that comments are a vital tool in maintaining an agile codebase, but I’ve met many experienced developers who argue with me about comments.

The Pragmatic Programmer is an excellent book. It outlines best practices that have solidified over the last thirty years or so as software development has grown as a field. The authors share these practices as a series of tips that are justified by real world anecdotes. In a section on comments the authors say

“The DRY principle tells us to keep the low-level knowledge in the code, where it belongs, and reserve the comments for other, high-level explanations. Otherwise, we’re duplicating knowledge, and every change means changing both the code and the comments. The comments will inevitably become out of date, and untrustworthy comments are worse than no comments at all”

They make the point that obsolete comments are worse than no comments at all. This is undoubtably true, but it has been bastardized by many programmers as an excuse for not commenting. I’ve heard this excuse, along with two others to justify a lack of comments in code.

  1. The code is self-commenting.
  2. Out of date comments are worse than no comments.
  3. You shouldn’t rely on comments.

In this post I want to address each of these points. I think that thorough commenting speeds up the rate of consumption of code, and that obsolete comments are a failure in maintenance, rather than in comments in general.

1. The code is self-commenting

This is the excuse I hear most often from developers who don’t comment their code. The thinking is that clear variable and method names replace the need for comments. There are situations where this is true. For very short methods with a single purpose, I see that no additional comments may be necessary.

But most methods are not very short, and most methods achieve several things that relate to one overall purpose. So in the first place, very few methods fit the bill.

This is not a valid complaint because good comments only make the code easier to read. While the code may be legible on its own, a comment that explains the purpose of a method, or the reason for
a string operation, can only make it easier to read. The author of Clean Code agrees due to the amount of time we spend reading code.

“the ratio of time spent reading vs. writing is well over 10:1. We are constantly reading old code as part of the effort to write new code. Because this ratio is so high, we want the reading of code to be easy, even if it makes the writing harder.”

Because we spend more time reading code than writing it, a responsible programmer knows that the extra comment in a method which takes very little time to write may well save another programmer many hours of learning.

2. Out of date comments are worse than no comments

To refute this argument, lets take cars as an analogy. If you buy a car, then drive it for twenty thousand miles without an oil change, it will break down. If you did that, the fault wouldn’t be in cars in general. Rather, the fault is in the lack of maintenance.

Similary, obsolete comments are not a failure of comments in general, but a failure of the programmer who updated the code without updating the comments. That other programmers have bad practices does not justify continuing that bad practice.

Obsolete comments are a broken window. If you see them, it is your job to fix them.

So while obsolete comments ARE worse than no comments, that argument is an argument against bad coding practices, not comments in general.

3. You shouldn’t rely on comments

The argument that a programmer shouldn’t rely on comments is a subtle dig at another programmer’s skill. They’re saying “a good programmer can understand this code with no comments.”

There is a grain of a valid point there. If you consider yourself a senior programmer, then you should be accustomed to wading through large volumes of bad code and parsing out the real meaning and functionality.

But even for senior programmers, comments make it easier to read and understand the code. It’s much easier to interpret a one line summary of a block of code than to parse that block line by line. It’s much easier to store that one line in your head than it is to store the whole block in your head.

The real reason why this is the worst of excuses is that not all programmers working on your code are senior programmers. Some are green graduates straight out of college. Yes, a more experienced programmer could understand the code without comments, but it will cost that graduate hours to figure out what they could have gotten in minutes with good comments.

My Commenting Principle

Comment your code so that a green programmer can understand it.

Or…

Comment your code so that your boss can understand it.

What does it take to make your code readable by a green graduate? Imagine them going through it line by line. If there are no comments, then they need to understand every function call, and every library routine. They need to understand a bit about your architecture and date model.

Now imagine that every single line is commented.

In that scenario, a green programmer can read your file by reading one comment after the next. If they need more information, they can still drill into the code and get the details, but they can understand the overview just through comments.

The closer you can get to that, the better.

Don’t listen to the arguments against thorough commenting habits. Just point the detractors here.

Why are Side Effects Bad?

Side effects are any observable change to the state of an object. Some people might qualify this by saying that side effects are implicit or unintended changes to state.

Mutator methods, or setters, which are designed to change the state of an object, should have side effects. Accessor methods, or getters, should not have side effects. Other methods should generally try to avoid changing state beyond what is necessary for the intended task.

Why are these guidelines best practices though? What makes side effects so bad? Students have asked me this question many times, and I’ve worked with many experienced programmers who don’t seem to understand why minimizing side effects is a good goal.

For example, I recently commented out a block of my peer’s code because it had detrimental side effects. The code was intended to add color highlighting to log entries to make his debugging easier. The problem with the code was that the syntax highlighting xml was leaking into our save files. He was applying the highlight to a key, then using that key both in the log and in the save file. Worse still, he wrote this code so that it only occurred on some platforms. When I was debugging a cross platform feature, I got very unpredictable behavior and ultimately traced it back to this block.

This is an example where code was intended for one purpose, but it also did something else. That something else is the side effect, and you can see how it caused problems for other developers. Since his change was undocumented and uncommented, I spent hours tracking it down. As a team, we lost significant productivity due to a side effect.

Side effects are bad because they make a code base less agile. Side effects cause bugs that are difficult to find, and lead to code that is more difficult to maintain.

Before I continue, let me be clear that all code style guidelines should be broken sometimes. For each situation, a guideline or design pattern may be better or worse, and I recognize that we are always working in shades of gray.

Generally, a block of code should be written for one purpose. If it is a method, then it should do one thing. If another thing needs to be done with an object, then that should be encapsulated in another method.

Here’s a hypothetical example that I’ve seen played out hundreds of times in my career.

A class needs to do task X. A programmer may write a method to do task X, but he accidentally includes logic that also does task Y. Later, he may see that he needs to do task Y all alone. So he writes a method to do task Y. That’s where the problem is compounded.

Later still, the definition of task Y changes. So another programmer has to rewrite task Y. He goes to the class, changes the method for task Y alone, does a few quick tests, and proceeds on his merry way.

Then mysterious bugs start occurring. QA can’t really track them down to one thing because they only occur sporadically after task Y. Finally, it takes many man-hours to remove task Y from the method for task X.

In this example, the side effect led to code duplication, which led to trouble when updating the code, which led to bugs that cost many hours to track down. The fewer side effects you introduce, the easier your code will be to maintain.

These two examples show how side effects can derail development and why they are so inimical. We all write code with side effect occasionally, but it’s our job to figure out how to do it in a way that doesn’t make the code more difficult to maintain.