Tips for Managing Joins in Looker

Looker is a fantastic product. It really makes data and visualizations much more manageable. The main goal of Looker is to allow people who aren’t data analysts to do some basic data analysis. To some extent, it achieves this, but there are limits to how far this can go. Ultimately, Looker is a big graphical user interface for writing SQL and generating charts. Under-the-hood, it’s programmable by data engineers, but it’s limited by the fact that non-technical users are using it.

The major design challenge for Looker is joins. A data engineer writes the joins into what Looker calls “explores”. Explores are rules for how data can be explored, but ultimately just a container for joins. When someone creates a new chart, they start by selecting an explore, and thus selecting the joins that will be used in the chart.

They pick the join from a dropdown under the word “Explore”. This is the main design bottleneck. Such a UI encourages users to have only a limited number of joins that can fit in the vertical resolution of the screen. This means limiting the number of explores, and hence limiting the ways tables are joined. This encourages using pre-existing joins for new charts.

This creates two problems.

  1. A non-technical user will not understand the implication of choosing an explore. They may not see that the explore they chose limits how the data can be analyzed. In fact, a non-savvy user may pick the wrong explore entirely, and create a chart that is entirely wrong.
  2. The joins may evolve over time. A programmer might change a join for a new chart, and this may make old charts incorrect.

The problem is that SQL joins are fundamentally interpretations of the data. Unless a join occurs on id fields AND is a one-to-one relationship, then a join interprets the data in some way.

So how can you limit the negative impact of re-using joins?

1. Encourage simple charts

Encourage your teammates to make charts as simple as possible. If possible, a chart should show a single quantity as it changes over a single dimension. This should eliminate or minimize the use of joins in the chart, thus making it far more future-proof.

2. Give explores long, verbose names

Make explore names as descriptive as possible. Try to communicate the choice that a user is making when they choose an explore. For instance, you might name one explore “Products Today” and another one “Product Events Over Time”. These names might indicate that the first explore looks at the products table, but the second explore shows events relating to products joined with a time dimension.

One of the mistakes I made while first starting out with Looker is naming the explores with single word names. I now see that short names create maintenance nightmares. Before assessing the problems with a given chart, I need to know which explore the maker chose for it, and because the names were selected so poorly, the choice was often incorrect.

I hope these ideas help you find a path to a maintainable data project. To be honest, I have a lot of digging-out to do!

Learning to Talk About Inaccuracy for a New Data Engineer

About a month ago, the engineering team at Mighty Networks was impacted by China’s now-defunct one child policy. A parent of our data engineer was having health problems. Because there were no other children to help out, he was forced to relocate his family back to China.

It took us around six months to hire him. With a bunch of data projects now in the pipeline, we couldnt’t go through the hiring process again. Fortunately, I was excited to step into the breach. I’ve never formally trained as a data engineer, but I built a data warehouse from scratch for another startup, and I’ve always had a passion for numbers.

Still, I’ve definitely struggled a little bit in the new role. One of the things I’ve struggled with most is how to communicate numbers to the business team. It’s fine to make pretty visualizations, but how do I communicate the subtlety in the data? How do I communicate the fact that the numbers are as accurate as we can get, but there are still some sources of error ever-present in the system?

I came up with the following guidelines to help me talk to the business team, and I thought they might be useful to other programmers who are in a similar position.

Sources of Error

There are two categorical sources of error in any data analysis system.

  1. Data warehouse replication problems
  2. Bugs and algorithmic errors

Data Replication Issues

Inaccuracy of the first type is unavoidable, and is a universal problem with data warehouses. Data warehouses are typically pulling in huge amounts of data from many sources, then transforming it and analyzing it. In our case, we have jobs that should pull data hourly, but these jobs can fail due to infrastructural errors, such as an inability to requisition server resources from Amazon. So we have jobs that run daily as a fallback mechanism, and we have jobs to pull all the data for each table that can be run manually.

Typically, the data should be no more than an hour off of real time.

When ingestion jobs fail, the data can be recovered by future jobs. Typically, data replication errors do not result in any long-term data loss.

Bugs and Algorithmic Errors

It’s important to remember that the data analysis system is ultimately just software. As with any software project, bugs are inevitable. Bugs can arise in several ways and in several different places.

  1. Instrumentation. The instrumentation can be wrong in many ways. New features may not have been instrumented at all. Instrumentation may be out of date with the assumptions in the latest release. Instrumentation could be conditionally incorrect, leading to omitted data or semi-correct data.
  2. Ingestion. The ingestion occurs in multiple steps. The data has to be correctly propagated from the database, to the replicated database, to the data pipeline, to the data warehouse. Errors in ingestion often occur when only part of this process has been updated. In our case, fields must be added to RedShift, to Kinesis Firehose (for events), to Data Pipeline (for db records), then they must be exposed in Looker.
  3. Transformation and Analysis. The presentation of advanced statistics rests on several layers of analysis and aggregation. A small typo, or mistake in one place can lead to a cascade of errors when that mistake effects a huge amount of data.

How to Talk About Inaccuracy

The best way to talk about inaccuracy is to talk about what steps you have taken to validate the data.

  • What did you do to validate the instrumentation? How did you communicate the requirements and purpose of the new events to the developers? Did you review their pull requests and ensure that the events were actually instrumented?
  • What did you do to validate the ingestion? Did you see events coming in on a staging environment? Did you participate in testing the new feature then verify that your tests percolated through to staging analytics? Did you read the monitoring logs?
  • What did you do to validate the analysis? Did you compare the resulting data to the data in another system? Did you talk through the results with a colleague? Did you double-check the calculations that underlie your charts? Even when they were created by other/former developers? Did you create an intermediate chart and verify the correctness at that level of analysis? When you look at the data from another angle/table, do the results make sense with your new results?

Don’t dwell on the sources of error. Talk about what you have done to minimize the sources of error. In the end, this is software. Software evolves. The first release is always buggy, and we are always working to refine, fix bugs, and improve.

Make a plan to validate each data release like the rest of the team validates the consumer-facing product. Use unit tests, regression tests, and spot=checks with production to validate your process.

Top Line Numbers vs. Other Numbers

In general, you can never be sure that a number is absolutely 100% correct due to the assumptions in the process, and the fact that you must rely on the work of many other developers. Most charts should be used in aggregate to paint a picture of what is happening. No single number should be thought of as absolute. If possible, you should try to present confidence intervals in charts or use other tools that represent the idea of fuzziness.

But as in everything, there are exceptions.

With particularly important numbers, if the amount of data that goes into them is relatively small, then we can manually validate the process by comparing the results with the actual production database. The point is that for the most important, top-line calculations, you should be extremely confident in your process. You should have reviewed each step along the way and ensured to the best of your ability that the number is as close as possible to the real number.

TLDR

When you’re trying to communicate the accuracy of your data to the business team…

  • Focus on what you have done to validate the numbers.
  • Keep in mind that the data analysis process is software that evolves toward correctness as all software does.
  • Validate data analysis like you validate any other software.
  • Where it’s possible and important, do manual validation against production so you can have high confidence in your top line numbers.