SoundCloud, I love you, but you’re terrible

I finally started using SoundCloud for a new jazz/electro project called Fynix. I casually used it in the past under my own name, in order to share WIP tracks, or just odd stuff that didn’t fit on bandcamp. But I never used it seriously until recently. Now I am using it every day, and trying to connect with other artists. I am remixing one track a week, listening to everything on The Upload, and liking/commenting as much as I can.

SoundCloud is the best social network for musicians right now. But it still has a terrible identity crisis. Most of the services seem to be aimed at listeners, or aimed at nobody in particular.

So in this post, I’m going to vent about SoundCloud. It’s a good platform, but with a few changes it could be great.

1. I am an artist. Stop treating me like a listener.

Is it really that difficult for you to recognize that I am a musician, and not a listener? I’ve uploaded 15 tracks. It seems like a pretty simple conditional check to me. So why is my home feed cluttered up with reposts? Why can’t I easily find the new tracks by my friends?

This is the core underlying problem with SoundCloud. It has two distinct types of users, and yet it treats all users the same.

2. Your “Who to Follow” recommendations suck. They REALLY suck.

I’ve basically stopped checking “Who to Follow” even though I want to connect with as many musicians as possible. The recommendations seem arbitrary and just plain stupid.

The main problem is that, as a musician, I want to follow other musicians. I want to follow people who will interact with me, and who will promote my work as much as I promote theirs. Yet, the “Who to Follow” list is full of seemingly random people.

Is this person from the same city as me? No. Do they follow lots of people / will they follow back? No. Are they working in a genre similar to mine? No. Do they like and comment on lots of tracks? No.

So why the heck would I want to follow them?

3. Where are my friends latest tracks?

This last one is just infuriating. When I log in, I want to see the latest tracks posted by my friends. So I go to my homescreen, and it is pure luck if I can find something posted by someone I actually talk to on SoundCloud. It’s all reposts. Even if I unfollow all the huge repost accounts, I am stuck looking at reposts by my friends, rather than their new tracks.

Okay, so let’s click the dropdown and go to the list of users I am “following”. Are they sorted by recent activity? No. They are sorted by the order in which I followed them. To find out if they have new tracks, I must click on them individually and check their profiles. Because that is really practical.

Okay, so maybe there’s a playlist of my friends tracks on the Discover page? Nope. It’s all a random collection of garbage.

As far as I can tell, there is no way for me to listen to my friends’ recent tracks. This discourages real interactions.

Ultimately, the problem is data, and intelligence. SoundCloud has none.

You could blame design for these problems. The website shows a lack of direction, as if committees are leading the product in lots of different directions. SoundCloud seems to want to focus on listeners, to compete in the same space as Spotify.

But even if that’s the case, it should be trivial to see that I don’t use the website like a regular listener. I use it like a musician. I want to connect and interact with other musicians.

And this is such a trivial data/analytics problem that I can only think that they aren’t led by data at all. Maybe this is just what I see because I lead our data team, but it seems apparent to me that data is either not used, or used poorly in all these features.

For instance, shouldn’t the “Who to Follow” list be based on who I have followed in the past? I’ve followed lots of people who make jazz/electro music, yet no jazz/electro artists are in my “Who to Follow” list. I follow people who like and comment on my tracks, yet I am told to follow people who follow 12 people and have never posted a comment.

The most disappointing thing is that none of this is hard.

4. Oh yeah, and your browser detection sucks.

When I am browsing your site on my tablet, I do not want to use the app. I do not want your very limited mobile site. I just want the regular site (and yes, I know I can get it with a few extra clicks, but it should be the default).

Tips for Managing Joins in Looker

Looker is a fantastic product. It really makes data and visualizations much more manageable. The main goal of Looker is to allow people who aren’t data analysts to do some basic data analysis. To some extent, it achieves this, but there are limits to how far this can go. Ultimately, Looker is a big graphical user interface for writing SQL and generating charts. Under-the-hood, it’s programmable by data engineers, but it’s limited by the fact that non-technical users are using it.

The major design challenge for Looker is joins. A data engineer writes the joins into what Looker calls “explores”. Explores are rules for how data can be explored, but ultimately just a container for joins. When someone creates a new chart, they start by selecting an explore, and thus selecting the joins that will be used in the chart.

They pick the join from a dropdown under the word “Explore”. This is the main design bottleneck. Such a UI encourages users to have only a limited number of joins that can fit in the vertical resolution of the screen. This means limiting the number of explores, and hence limiting the ways tables are joined. This encourages using pre-existing joins for new charts.

This creates two problems.

  1. A non-technical user will not understand the implication of choosing an explore. They may not see that the explore they chose limits how the data can be analyzed. In fact, a non-savvy user may pick the wrong explore entirely, and create a chart that is entirely wrong.
  2. The joins may evolve over time. A programmer might change a join for a new chart, and this may make old charts incorrect.

The problem is that SQL joins are fundamentally interpretations of the data. Unless a join occurs on id fields AND is a one-to-one relationship, then a join interprets the data in some way.

So how can you limit the negative impact of re-using joins?

1. Encourage simple charts

Encourage your teammates to make charts as simple as possible. If possible, a chart should show a single quantity as it changes over a single dimension. This should eliminate or minimize the use of joins in the chart, thus making it far more future-proof.

2. Give explores long, verbose names

Make explore names as descriptive as possible. Try to communicate the choice that a user is making when they choose an explore. For instance, you might name one explore “Products Today” and another one “Product Events Over Time”. These names might indicate that the first explore looks at the products table, but the second explore shows events relating to products joined with a time dimension.

One of the mistakes I made while first starting out with Looker is naming the explores with single word names. I now see that short names create maintenance nightmares. Before assessing the problems with a given chart, I need to know which explore the maker chose for it, and because the names were selected so poorly, the choice was often incorrect.

I hope these ideas help you find a path to a maintainable data project. To be honest, I have a lot of digging-out to do!

Pride in Software Craftsmanship

As I spend more and more time in Silicon Valley, my views on software management are changing. I read Radical Candor recently, and while I agree with everything in it, I feel like it over-complicates things.

This meditation has been pushed in part by my passion for food. I like going to new restaurants. It brings me joy to try something new, even if it’s not a restaurant that would ever be considered for a Michelin Star. Even crappy looking restaurants can serve great food.

I am often awed by the disconnect between various parts of the restaurant business and the quality of the food. Some restaurants are spotlessly clean, have have beautiful decor, and amazing service… but the food is mediocre. The menu is bland and uninspired, and the food itself is prepared with all the zeal that a minimum wage employee can manage.

Then I’ll go to a dirty looking greek joint down the road, and the service will be awful… but the menu is inspired. It’s not the standard “greek” menu, but it’s got little variations on the dishes. And when the food comes out (finally), maybe it isn’t beautiful on the plate, but the flavors come together to make something greater than the ingredients and the recipe.

What seems to distinguish a good restaurant from a crappy one is pride. At restaurants that I return to, there is someone there, maybe a manager, maybe a cook, maybe the chef who designed the menu, who takes great pride in his work.

There’s a diner by my old house, for instance, where the food is … diner food. There’s no reason to go back to the restaurant… except for the manager. The man who runs the floor, seats the patrons, deals with the kitchen, and does all the little things that make a restaurant tick. He manages to make that particular diner worth going to. And for a guy who has two young kids, that’s terrific.

I am starting to think that the same basic principle applies to software engineers. I’ve met brilliant engineers with all sorts of characteristics. Some of them have a lot of education and read all the latest guides. Others have little education, and don’t read at all. The main thing that makes them good engineers is that they take pride in their work. They care about the quality of their work, regardless of how many people are going to use it, or how much time they put into it. They write quality code because their work matters.

So when it comes to managing software projects, I’m starting to think that all of these systems boil down to two basic steps.

  1. Put your engineers in a position to take pride in their work.
  2. Get out of the way.

Obviously, the first step is non-trivial. It’s why there are so many books on the topic. But at the end of the day, pride is what matters.

Sometimes It’s Okay to NOT Write Unit Tests

I recently lost about two-and-a-half days to unit/integration tests. At Mighty Networks, we are pretty proud of our test coverage, and we make writing tests part of the development process. Developers are required to write tests for every feature they implement, but in the past few days I’ve seen that this policy needs to be applied flexibly.

A few months ago we wrote a pretty expansive integration with the iTunes Store. Since we allow people to sell subscriptions through our app, we required a pretty complex integration. Apple’s developer APIs are notoriously crappy, so this required a full team effort. One developer wrote a series of tests for our Apple integration.

In theory, the tests are very thorough. But getting real data for testing is virtually impossible. So the developer faked up a json file, then wrote a preprocessor to generate fake data in a format that looked like Apple’s format. Then he wrote tests.

You may see where this is going already.

The tests he wrote essentially tested his preprocessor. Rather than testing the actual methods used in the integration with Apple, the tests looked at the values generated by the preprocessor. Essentially, by writing a clever object to fake Apple data, he removed the actual integration from the tests.

The tests looked correct. They seemed to show that our Apple code worked. But really they were mostly testing the test code itself.

So when I modified a related system, and added a few tests, I suddenly saw a massive cascade of failures all over the place. The failures were of different types too. Sometimes there was a null value, or an unexpected ID, or an error seemingly from Apple.

It took me a day to figure out that the Apple integration itself wasn’t failing, only the preprocessor wasn’t set up to actually work with the rest of the system. Then I took another day-and-a-half to pull out the worst part of the system and replace the non-tests.

I don’t blame the developer who wrote the tests. It’s a common mistake, and we all did it at least once.

In part I blame Rails, because it encourages black/white thinking about software development.

The developer followed the rule that he needs to write tests for every new feature. When he integrated with Apple, he diligently wrote his tests.

The problem arose when he realized that he couldn’t run production code to get real data. He didn’t know how to write a test for the algorithm, so he wrote code that generates Apple-like data, then tested that.

The developer failed to see that writing tests is a guideline, rather than a rule. In this case, it is very difficult to test every part of the integration. It’s acceptable to write tests of the core process, without testing specific return values and specific pieces of data. The tests gave the impression of working code, and full test coverage. But they hid a few problems with the integration by testing for specific values, rather than algorithmic correctness.

So what can we do?

Senior developers need to encourage junior developers to talk about problems that arise when following “the rules.” Senior developers need to encourage an environment where it’s okay to admit that portion of the code just can’t be tested. Or at least to see that a portion of the code can’t be tested in the same way as most of the code. Senior developers need to encourage critical thinking and analysis in situations where the strict interpretation of the rules may not lead to the best results for the development process.

Learning to Talk About Inaccuracy for a New Data Engineer

About a month ago, the engineering team at Mighty Networks was impacted by China’s now-defunct one child policy. A parent of our data engineer was having health problems. Because there were no other children to help out, he was forced to relocate his family back to China.

It took us around six months to hire him. With a bunch of data projects now in the pipeline, we couldnt’t go through the hiring process again. Fortunately, I was excited to step into the breach. I’ve never formally trained as a data engineer, but I built a data warehouse from scratch for another startup, and I’ve always had a passion for numbers.

Still, I’ve definitely struggled a little bit in the new role. One of the things I’ve struggled with most is how to communicate numbers to the business team. It’s fine to make pretty visualizations, but how do I communicate the subtlety in the data? How do I communicate the fact that the numbers are as accurate as we can get, but there are still some sources of error ever-present in the system?

I came up with the following guidelines to help me talk to the business team, and I thought they might be useful to other programmers who are in a similar position.

Sources of Error

There are two categorical sources of error in any data analysis system.

  1. Data warehouse replication problems
  2. Bugs and algorithmic errors

Data Replication Issues

Inaccuracy of the first type is unavoidable, and is a universal problem with data warehouses. Data warehouses are typically pulling in huge amounts of data from many sources, then transforming it and analyzing it. In our case, we have jobs that should pull data hourly, but these jobs can fail due to infrastructural errors, such as an inability to requisition server resources from Amazon. So we have jobs that run daily as a fallback mechanism, and we have jobs to pull all the data for each table that can be run manually.

Typically, the data should be no more than an hour off of real time.

When ingestion jobs fail, the data can be recovered by future jobs. Typically, data replication errors do not result in any long-term data loss.

Bugs and Algorithmic Errors

It’s important to remember that the data analysis system is ultimately just software. As with any software project, bugs are inevitable. Bugs can arise in several ways and in several different places.

  1. Instrumentation. The instrumentation can be wrong in many ways. New features may not have been instrumented at all. Instrumentation may be out of date with the assumptions in the latest release. Instrumentation could be conditionally incorrect, leading to omitted data or semi-correct data.
  2. Ingestion. The ingestion occurs in multiple steps. The data has to be correctly propagated from the database, to the replicated database, to the data pipeline, to the data warehouse. Errors in ingestion often occur when only part of this process has been updated. In our case, fields must be added to RedShift, to Kinesis Firehose (for events), to Data Pipeline (for db records), then they must be exposed in Looker.
  3. Transformation and Analysis. The presentation of advanced statistics rests on several layers of analysis and aggregation. A small typo, or mistake in one place can lead to a cascade of errors when that mistake effects a huge amount of data.

How to Talk About Inaccuracy

The best way to talk about inaccuracy is to talk about what steps you have taken to validate the data.

  • What did you do to validate the instrumentation? How did you communicate the requirements and purpose of the new events to the developers? Did you review their pull requests and ensure that the events were actually instrumented?
  • What did you do to validate the ingestion? Did you see events coming in on a staging environment? Did you participate in testing the new feature then verify that your tests percolated through to staging analytics? Did you read the monitoring logs?
  • What did you do to validate the analysis? Did you compare the resulting data to the data in another system? Did you talk through the results with a colleague? Did you double-check the calculations that underlie your charts? Even when they were created by other/former developers? Did you create an intermediate chart and verify the correctness at that level of analysis? When you look at the data from another angle/table, do the results make sense with your new results?

Don’t dwell on the sources of error. Talk about what you have done to minimize the sources of error. In the end, this is software. Software evolves. The first release is always buggy, and we are always working to refine, fix bugs, and improve.

Make a plan to validate each data release like the rest of the team validates the consumer-facing product. Use unit tests, regression tests, and spot=checks with production to validate your process.

Top Line Numbers vs. Other Numbers

In general, you can never be sure that a number is absolutely 100% correct due to the assumptions in the process, and the fact that you must rely on the work of many other developers. Most charts should be used in aggregate to paint a picture of what is happening. No single number should be thought of as absolute. If possible, you should try to present confidence intervals in charts or use other tools that represent the idea of fuzziness.

But as in everything, there are exceptions.

With particularly important numbers, if the amount of data that goes into them is relatively small, then we can manually validate the process by comparing the results with the actual production database. The point is that for the most important, top-line calculations, you should be extremely confident in your process. You should have reviewed each step along the way and ensured to the best of your ability that the number is as close as possible to the real number.

TLDR

When you’re trying to communicate the accuracy of your data to the business team…

  • Focus on what you have done to validate the numbers.
  • Keep in mind that the data analysis process is software that evolves toward correctness as all software does.
  • Validate data analysis like you validate any other software.
  • Where it’s possible and important, do manual validation against production so you can have high confidence in your top line numbers.