Big companies won't fix their bugs

Can't? Won't? Tomato, tomato.

Aug 28, 2019

Hi. Can here. Today we talk about buggy software.

The Bug That Got Us Here

We started The Margins as a podcast around two years ago when Ranjan and I first met. Yet, as soon we recorded our second episode, I was unable to upload it on Apple’s Podcasts directory. The website simply didn’t let me login. I filed a support ticket, which got me nowhere. I asked friends who work over at Cupertino to fast-track it, and they casually ignored my requests. It took almost a month of back-and-forth with more than a handful of people to get it fixed. In the meantime, life happened, and I left New York. Our podcast project took a hiatus.

I was tempted to tweet, as I often do when I am frustrated. “Wow”, it would go. “Apple is a trillion dollar company yet it cannot get its podcasts page right”. It’d feel great, as I collected all the internet points. Then, I remembered that I am an adult who has better things to do. I instead tweeted a joke about venture capitalists, before asking one to refer to his portfolio company for a job.

The rant-cum-joke format is familiar. You see it often. On its face, it’s a good question. Why are things always broken? Why do huge tech companies not seem to be able to fix minor issues? Don’t they have all the resources in the world? There are thousands of engineers working there! What do they all do all day?

There are three main issues. First is, at any tech firm, a significant amount of work is simply wasted. Second, behind the scenes at any tech product, there’s a lot more than meets the eye. And third, and most importantly, things are always broken in one way or another, and that’s expected. It might even be fine.

I’ve been there. Let me tell you why this stuff happens.

Technical Bureaucracy

Let’s talk about wasted effort. One big culprit is the technological red tape that you need to wade through. At a big enough firm, you simply do not check in a piece of code and see it go into “production”, as code that serves users is generally called.

Getting code out in the wild in a big company is less doing it live, but more completing the 12 Labors of Hercules. Instead of satisfying capricious gods with arbitrary quests, you appease fellow engineers in code reviews. There are no monsters to slay, but there are style guidelines to abide by, and build systems to turn green and privacy reviews to pass (if you are lucky). Your resolve gets tested often and violently.

More hearts have been broken over the wrong type of delimiter in a logging script than you’d think, people really get hot and bothered about this stuff. I once received an email from my manager titled “I would prefer that you didn't submit this” and thought I was being fired. Turns out it was a templated message for a change request on my code submission. That was a minor nuisance then and is just a funny story to tell now. But any engineer worth his or her salt will have a few stories of code review skirmishes. It’s rite of passage.

Of course, none of those systems are individually bad, nor do they exist to slow people down. Ideally, they are there to ensure some modicum of quality and ensure multiple people can code, both simultaneously and in long term, on a shared code base.

Sometimes however, they are just bandages hiding battle scars from a previous outages, yet no one took them off. Unfortunately but equally likely, those various compliance and regulatory systems exist to worship a cargo cult whose teachings come in the form of authoritative blog posts. Even if they keep things running and ensure momentum in the long run, they create considerable friction in the short run and add to the immediate cost. Order does not come cheap.

Duplicated Effort

A more insidious form of wasted effort, however, comes in the form of duplication of effort. Join any sizable tech company, and you’ll see there are 5 different teams doing the same thing in 10 different ways. A common hazing ritual is asking a new hire to go through the wikis to figure out which storage system they should use for his or her first project. My friend once told me that there are always two types of internal systems at Google: One that is in beta and doesn’t have the features you want yet, and another that is deprecated and you shouldn’t be using anymore. Whichever you choose, you lose.

A human reason for duplication of effort is individual people wanting to own more. Organizational theory people sometimes talk about “empire building”, where a person wants to increase his or her authority against the larger goals of the organization. Often, this happens through padding your headcount.

However, in a tech company where fiefdoms are ranked in not just headcount (since we are in the business of automating people, generally) but also in technical reach. Many a times, teams build their own systems in spite of existing systems, to amass more power. More experienced empire builders not just stop there, but make the case they need more people to actually decommission the other competing systems. People, generally, are the worst.

Deadlines

But people also respond to incentives. While not always the best motivator or a source of good vibes, a reward coupled to an unrealistic deadline is an incentive. Things need to get done, and fast. You get to work.

As we discussed before, most tech companies operate in a service oriented fashion now. A good chunk of the work is simply figuring out which parts to use and tie them up in a productive way. This sounds great in practice, but in reality, those different parts rarely work exactly you want them to, or even as described. There’s always a part that’s missing, or broken in some subtle way (more on this, in a bit) or slow.

Whatever the reason, then you are often faced with two choices. You either go and interact with the people who maintain that piece of code, or you just build it yourself. Again, ideally, you do the former, and this helps the company overall. However, since people are generally busy (and The Worst, as we discussed), this is an annoying exercise.

Other teams are often toiling on their own unrealistic deadline, and have barely have time to listen to your demands, let alone satisfy them. No manager gets measured in how convincing or diplomatic he or she was to other teams, but in terms of his or her output (and his and her output only). Then, it only becomes natural to simply build your own systems, call it a day. If I had to guess, this is probably why most duplication happens in most companies.

Internal Competition

But there’s also a special set of companies that seemingly encourage this duplication. I’ve never worked at Google, but heard many times that the company does encourage such internal competition - or at least used to. This, coupled with a (un)healthy dose of empire building is why you end up with competing messaging products, competing music stores.

Hiding the Complexity

But duplication is not just why companies never seem to have enough people on board. One view of technology is that it absorbs complexity and only exposes a simplified interface.

An automatic gearbox is the token example here; while it is mechanically extremely complex compared to a manual transmission, it exposes a much simpler interface. You might pay a little in terms of less personalization, or limited customization, but that’s not a foregone conclusion. More sophisticated systems not just automate, but also allow tinkering, like the Tiptronic gearbox. And, not surprisingly, those systems are even more complex than those that just automate. I hate the term manumatic, but it’s a thing. Any mechanic will tell you that those are more complicated than a simple manual gearbox.

Imagine a service like Reddit. In terms of functionality, it’s a simple list of links where people vote things up and down and comment. But, dig a couple levels deeper, you realize there’s a decent amount of complexity. Any user-generated content system needs to has to handle spam and porn, which often come together.

So you end up building moderation tools and automated systems (again, humans are expensive, the worst etc etc). You need to sell ads to keep the lights on, then you build an ad delivery system. You hire ad sales people, but the real sustainable margins are in self-serve advertising, so you build those too. Soon, you realize ads don’t pay well unless they are targeted, so you end up needing to know more about your users and tracking them all over the web.

Things work well for a while. Maybe your competitor blows up spectacularly and you end up with lots of users. Then, what used to work at your previous scale stops working, so you need to rebuild things from the ground up, while trying to make sure you do not blow up spectacularly yourself. You end up rebuilding a plane while in flight. There’s a ton of complexity in being able to even serve a list of links. You want to do all this without as few people as possible to protect The Software Margins, but those engineers that build the stuff are still people that need to get paid.

It’s just code

Again, a good model of Lyft (or Uber, sure) is an app that allows you to call a ride. Yet, there’s an increasing amount of complexity where you need to interact with various identity verification systems, import and export data to and from various third parties, allow several hundred different city teams model their traffic flows, handle arrears, build maps, acquire and retain drivers, calculate incentive and trip payments. Most of this stuff isn’t brain surgery, by any means, but all this needs to be built, and debugged, and rebuilt at increasing scale.

We talked a lot about why companies need so many people, and barely scratched the surface. Obviously, some companies are pathologically bloated, and their profitability simply hides it. It’s also true that most growth stage companies err on the side of a bit extra headcount, in often a vain hope of skating where the puck will be. My co-host loves to view the world through the lens of Zero Interest Rate Policy, or ZIRP for short. There’s a hint of truth in the age of practically endless capital, the bottom line is rarely a concern. Top line, your revenues, is where the action is.

OK, fine. But why are things always broken? Isn’t there some end state, where things just work?

Probably not.

Broken is Natural

Complex systems run in degraded mode (PDF link). Moreover, complex systems that are in flux by design, such as agile software, do not have a stable state, free of errors, that can be well defined, let alone be achieved. Combined natural and artificial forces of constant technological change, ever-changing user requirements, shifting organizational roadmaps, evolving competitive landscape make it impossible to define specifications in anything more solid than quicksand.

You take a deep breath, throw out your best effort and hopefully it sticks on the wall long enough for you to make a profit —or not—, and then move on. After all, even the best laid product requirement documents do not survive the first contact with the users.

This is not a fatalist view of the world, but simply an acknowledgement of the business realities of our industry. There’s a lot more money to be made by putting out something that works good enough, than to try to roll out something perfect. Facebook gets a well-deserving bad rap for accidentally breaking our liberal democracies or fanning genocides with its long-term product and policy choices, but its day-to-day bugginess has been a feature from day one. No one wants to fix things that don’t matter.

No matter how you spin it, agility and stability are in conflict in software. And, yes, ZIRP does play into this also. In an era of extremely low margin products competing with each other, scale is not just how you win, but survive. You need to reach scale, and this means reaching as many people as possible, instead of satisfying a few fully. Growth stage firms especially are beholden to this power. Maturity demands stability, but expansion requires creative destruction.

Software is hard. As a somewhat seasoned practitioner of this illustrious form of science and art, it amazes me daily that anything works. I do not miss the days of shrink-wrapped software, but I do often live in fear of my favorite software going haywire after a benign update. Being able to steer the ship in a sea goes from boiling to still on a moment’s notice is what attracted me more to product management in the first place.

As I am now interviewing (say hi) for jobs, one question I always ask is “What is your bottleneck today?”. I ask this, both to understand if a firm is really growing but also to see if the person I am talking to is looking at thing systematically. The most common answer I get is “talent”. There’s just not enough people to fix all the bugs, and all also move forward. Of course, maybe someone is really trying to hire me (somewhat likely), and saying nice things. Or maybe stuff takes a lot more people than you think (more likely).

What I’m Reading

Deconstructing Google’s excuses on tracking protection: Privacy is big news now. Apple’s WebKit team stirred the pot recently by announcing more aggressive protections against many forms of tracking. Obviously, there has been a lot of pushback. Ben Thompson at Stratechery wrote one (which I disagree with, more on that later). And of course, so did Google, which makes its money by selling targeted ads, which does require a bunch of tracking. Arvind Narayanan and his team of researchers wrote about Google’s take on Apple’s take. They are not impressed.

There is nothing new about these ideas. Privacy preserving ad targeting has been an active research area for over a decade. One of us (Mayer) repeatedly pushed Google to adopt these methods during the Do Not Track negotiations (about 2011-2013). Google’s response was to consistently insist that these approaches are not technically feasible. For example: “To put it simply, client-side frequency capping does not work at scale.” We are glad that Google is now taking this direction more seriously, but a few belated think pieces aren’t much progress.

Streaming Video Will Soon Look Like the Bad Old Days of TV: Since I moved to US in 2006, I have paid for cable for only 1 year. I do not watch sports other than Formula 1 and tennis, and my TV intake is limited to putting Scrubs and Frasier re-runs as background noise and watching whatever Armando Ianucci puts out. So I am not the best person to comment on the new streaming wars. But Adam Bell is, having worked Amazon Studios. He thinks now we are building so many streaming services, that we’ll end up again paying for TV shows we never watch. And Netflix is in trouble, because they’ve bet the farm on not having ads.

Behind this bill is the cost of making high-quality programming. Although much has been said about how Netflix and Amazon have disrupted the video business, no media company has figured out how to make premium movies or TV shows significantly more cheaply. In fact, competition has driven production budgets even higher. Ultimately, these costs are paid by viewers (especially if they choose to watch without ads).

Margins by Ranjan Roy and Can Duruk

Discussion about this post