Jul 12, 2023 27 min read Podcast

ProdFund 1.8: Quantified Leap

The idea that everything is a test, and software is driven by data, is a very recent notion. Indeed, rigorous experimentation is a pretty new idea in every context. Let's see how it became ubiquitous in modern software development.

Product Fundamentals is a podcast dedicated to spreading the core knowledge software product people need in order to succeed. Season 1 is structured as a history of how we make software.

The 2010s were characterized by a huge shift toward rigorous experimentation and quantification in many aspects of software development. Big data and data science, A/B testing, data science, quantified OKRs – it was the decade that we all needed to brush up on our statistics.

This episode, we cover the history behind this shift and the technologies that made it possible. Plus, we take a little detour into the contentious issue of how to make a proper cup of English tea.

The audio for this episode is embedded below, and the episode transcript follows.

You can also find this episode of the Product Fundamentals podcast on the show website and through all the usual podcast services.

Transcript

Hello friends, and welcome back to the Product Fundamentals Podcast, episode 8: Quantified Leap

Last time, we discussed the concept of the Minimum Viable Product, and the 2010s startup wave that has so shaped our contemporary software industry.

This episode, we’ll look at that recent period through a slightly different historical lens.

Periodization

In looking back on this history of software development, we can apply a rough periodization that captures the mix of cultural, economic, and technological forces that dominated each period.

To my mind, the first of these would be the era of Big Science, running from the earliest computers in the mid 1940s through to roughly the NATO conference of 1968.

From 1968 to roughly 1980 might be the period of professionalization, as the intellectual class sought to turn software into a real engineering discipline, and the technology became increasingly relevant to large corporate customers.

From 1980 to 1995, miniaturization brought hardware increasingly within reach of individual users, while corporatized software development methods generally floundered.

Then we might see the period from 1995 through about 2010 as the era of developer class consciousness and the brash entrepreneur.

In this structure, I might tentatively label the period from 2010 to 2020 as the era of experimentation and quantification. In this phase, nearly everything comes to be the subject of measurement and analysis. I call this “experimental” and “quantitative” rather than “scientific,” because the early computing pioneers were absolutely scientists, and as we saw last episode, writers like Steve Blank were certainly investigative. But so far, software development has mostly happened without the trappings of statistics and data that we live and breathe today.

This most recent phase of software has encouraged us to become rigorous experimentalists and actuaries; its shibboleths are A/B tests, statistically significant results, and tracking percentage to goal for a KPI. We can even see this trend toward quantification in the wider culture and in the popular press; Google Books shows terms like “data-driven,” as in “data-driven approach” or “data-driven decision,” steadily increasing in use by a total of 350% between 1999 and 2019.

But with any periodization scheme like this, the seeds of each phase are rooted in what came before. So this episode, we’ll return to my favorite form, and dig way back into the deeper history of these trends, and then trace how the hallmarks of the quantified software period came together.

There are several different threads to follow here. First, we’ll discuss how software teams tested their products through alpha and beta testing, dating back to well before the Internet. Then we’ll cover the emergence of web analytics and “big data” in the early 2000s, before jumping to the scientific precursors of the modern A/B test. We’ll touch on the rapid spread of the quantified OKR, and finally, we’ll talk through the uneasy relationship that Agile methods and their forebears have had with the leap to quantification.

Alpha and Beta testing

The story of alpha and beta testing is straightforward, though the sources available to us involve some personal recollections from participants rather than totally documented history. There’s a little more apocrypha here than I’d like, but that’s what we’ve got to work with.

Pre-Internet

As best we can tell, alpha and beta testing for computer systems began at IBM, perhaps as early as the 1950s, though they didn’t use the Greek lettering yet.

Fun fact: the terms “Alpha test” and “Beta test” had been used since World War I as the names for two aptitude tests for recruits in the US Army. In that context, the Army Alpha test was for recruits who could read; the Army Beta test was for those who could not. The first versions of these tests were widely criticized and retired after the First World War, but subsequent versions, which also used the name "Alpha,” were employed until at least the late 1930s.

Perhaps because of this existing usage, the labels “alpha” and “beta” were not applied to technology testing until the mid-1970s.

Remember that through the early decades of commercial computing, hardware and software were still tightly coupled. There was no easy distribution channel to get software updates to customers. As we discussed in Episode 4, heavy testing before delivery was seen as critical.

So, when IBM was preparing a new product, they would go through at least two rounds of testing: First was “A Testing” – like, tests labeled with the letter A – which happened inside the company, before a product was even announced to the public. During the A tests, IBM employees would test a product concept to see if a solution is viable at all. Once a rough working version of the hardware or software existed, even though it wasn’t yet ready for customers, the company could announce publicly that the product was in development.

Next came Field Testing, which may have internally been called B testing at IBM. In a field test, IBM deployed the new system to a limited set of sites, both within IBM and at customers. Especially for systems that involved new mainframes or new peripheral equipment, one of the purposes of the field test was to ensure the hardware would function outside the controlled laboratory or workshop environment where the alpha testing had taken place.

Due to the very high sticker price of systems, and the fact that they were often customized to particular client needs, these field tests weren’t like modern beta testing. They weren’t necessarily for collecting client feedback that would inform an upcoming general product release. Rather, IBM field engineers would be deployed to the test site along with the hardware to smooth the installation for the client, integrate the new system alongside the client’s existing ones, and measure real-world performance of the system outside the lab. Basically, they were more like today’s solution engineers. In a world of relatively small volume, high sales price integrated hardware and software systems, this makes sense.

There doesn’t seem to have been a watershed moment when alpha or beta testing took off in this early period. Instead, the practice diffused slowly into the industry. By the mid-1970s, the industry press occasionally reported on beta tests of new computer systems, using that language. My best guess is that the switch from the labels “A Test” and “B Test” to “alpha” and “beta” was a just natural evolution to remove the spoken linguistic ambiguities caused by using “a” and “be.”

Consumer Beta tests

It wasn’t until the consumerization of software that the beta test took on anything like its modern form.

Microsoft stands out as one of the first large consumer companies to perform a beta test with members of the general public. From mid-1994 to its general release in August 1995, Microsoft went through several public betas of its Windows 95 operating system, then code-named Chicago.

Microsoft ran this beta by physically mailing floppy discs with draft versions of the operating system to members of the public who had opted-in. The scale of the beta was massive: by late 1994 there were already 50,000 testers, and by March 1995 there were 450,000 testers.

It seems that a significant number of these testers had actually paid for the privilege of participating: for the very cute price of $19.95, users in the US could actually pay Microsoft in order to receive the beta version of Windows 95. Microsoft called this beta a “preview.” The customer-turned-tester would receive a set of floppy discs that they could use to install the new operating system over an existing Windows 3.1 installation. That preview OS was scheduled to stop working just after the general release of Windows 95.

Microsoft’s Preview program shows that consumer software beta tests have been as much a marketing gimmick as an actual testing protocol for as long as they’ve been around. The Windows 95 preview generated buzz and news coverage – I found short articles in the New York Times about it – and the program even generated net new revenue! Testers paid nearly $20 to use the beta for less than a year, and then paid $109 to upgrade in the general release.

Google iterated on the Microsoft preview strategy with the launch of Gmail in 2004, which limited growth using peer-to-peer invitations to a limited beta. This gave Gmail membership extra cachet and a social cool-factor, while giving Google some control over the speed of product adoption. Gmail even retained the “beta” label after the service became open to anyone in early 2007. At that point, “beta” no longer indicated a testing phase with a limited userbase – beta was a status symbol of sorts for users, while letting the company retain a modicum of strategic flexibility in how they positioned the product.

Another prominent early user of the consumer beta test was Netscape, which released their Netscape Navigator browser – the forerunner of Mozilla’s Firefox – as free software for download over the Internet in October 1994.

Netscape made substantial use of the beta as marketing. A beta version of Navigator was to be free, while Netscape said they would charge most customers for the general release version. As Netscape explored its business plan opportunities, the freely available beta morphed into a way to “try before you buy.”

Netscape’s use of the beta also indicates an understanding of the error-tolerant nature of technology early-adopters. Netscape iterated up from version 1.0 to 3.0 in less than two years between 1994 and 1996, releasing a regular stream of free public beta versions along the way. Many early adopters were happy to accept the stability tradeoff in order to get access to the cutting-edge.

In January 1998, Netscape announced that it had given up on the idea of charging for its software, committing that future versions would be free, including stable general releases. This removed the financial incentive for users to try the beta software, but the draw of getting the newest version of free software had been established. To this day, every major browser – Chrome, Firefox, Edge, Opera, Brave, you name it – always has an unlimited open public beta available. Thanks to Netscape, the perpetual beta is here to stay.

The always-present open beta evolved into a new approach to software versioning: that is, release channels. To the roster of stable and beta, Googled added the additional always-available channels of “Dev” and “Canary” to its Chrome browser and later its ChromeOS operating system. Firefox and some other installed software have followed suit.

Some vendors targeting corporate clients, especially those offering variants of the Linux operating system, began offering “Long Term Support” or “LTS” versions of the software in the mid-2000s, which in a certain sense are the anti-Beta test: in these cases, the provider is promising to make fewer-than-usual changes for years to come in order to prioritize stability.

Data suggest that beta testing has been in an apparent decline since around the year 2000 (trends, ngram). This makes sense; beta testing is essential for packaged software, where once the software is delivered, bugs are very hard to address. But in the age of Software as a Service, and of the lean startup, maintaining multiple versions mostly becomes more trouble than it’s worth. It’s classic “waste,” from a lean perspective.

These days, we’re rarely testing whether the software works in the sense of simply running smoothly on the hardware. That sort of in situ testing was a necessity for the complex integrated hardware and software systems of the mainframe age, when beta testing emerged. Now, web and app platform standards greatly reduce the space for the many compatibility issues that drove beta testing in the first place. Instead, our analysis is at a much finer-grained level of resolution. We’ll get to that shortly.

But for all that much of the software industry seems to be moving away from beta testing, the beta remains alive and well in operating systems, browsers, and some installed software – though it is now perhaps as much a marketing and customer engagement tool as it is a real form of testing.

Big Data

Analytics

The first efforts to measure and understand user behavior on commercial websites date back to the early 1990s. Initially, the main data available to scratch this itch were server logs, which recorded the requests the server received from visitors, and some details about what the server returned.

Software to aggregate and analyze these log files followed, with prominent early vendors including WebTrends in 1993 and NetGenesis in 1994.

After the introduction of JavaScript into Netscape Navigator in 1995, it became possible for the web browser to return far more detailed information about the user and their behavior to the server. Increasing bandwidth and evolving JavaScript capabilities made it ever-easier to track details we now take for granted, like users’ click paths through a website, session lengths, and dwell time on individual pages.

An important player in the emerging analytics industry was Urchin Software, which started in 1998. Urchin’s software combined a server-side log reader with a browser-side JavaScript package, and then synthesized those data to provide businesses with more comprehensive information than had been available before.

In early 2005, Urchin was bought by Google, and its capabilities were rolled into a new Google Analytics product in November of that year. Appropriately enough, Google Analytics also went through a limited beta, apparently to manage technical capacity – but Google still seized the opportunity for an invitation-based system, turning that resource scarcity into additional buzz.

By August 2006, Google Analytics was available to everyone. Suddenly, every website could have access to robust browser-based data about their traffic and user demographics. The basic table stakes for businesses to understand and manage their traffic had been raised.

In the years that followed the release of Google Analytics, new market entrants would expand the analytics tool chain, with players like Mixpanel specializing in cohort analysis and funnel measurement, and others capturing similar analytics data from mobile apps. By the late 2010s, a raft of tools even offered the ability to replay a individual user’s session, enabling us to watch every interaction a user has with our web page or app.

Hadoop

All of this tracking data, though, posed new technical challenges. While it may have been possible to collect data about some user interactions on the early Internet, it would have been prohibitively expensive to store, organize, and analyze data at anything close to the scale that is common today.

The hard problem of handling such vast amounts of data picked up the moniker “Big Data” in the late 1990s. Plenty of people have claimed credit for coining the term, and it’s a generic enough phrase that it’s hard to pick a clear winner for who first used it in the sense we mean it today. But suffice it to say, between about 1997 and 2000, lots of people were noticing that the world was about to be awash in collections of data far too large for anyone to reasonably store or process on one computer.

Shortly thereafter, the concept of “data science” coalesced as a distinct academic discipline, applying statistical methods to analyze large amounts of computer data. The creatively named Data Science Journal as well as The Journal of Data Science launched in 2002 and 2003.

Critical technologies for wrangling this tidal wave of data emerged from Google in the early 2000s, with public papers on the Google File System and MapReduce released in 2003 and 2004. These papers inspired the development of Hadoop, a set of open source utilities that made it possible to use large amounts of relatively cheap hardware to store and process huge amounts of data in parallel. After Hadoop’s initial release in 2006, the technology spread rapidly. As technologies like Hadoop and its descendents were adopted into the cloud infrastructure as a service that we discussed last episode, suddenly the technology fell into place for many more businesses to engage in serious data analysis.

It certainly hadn’t been impossible to store and analyze larger amounts of data than could fit on a single drive before Hadoop. But the cost, complexity, and overhead all plummeted with these new technologies. Queries that had once taken hours to resolve, or were even too unwieldy to run at all, became answerable in seconds or minutes.

A/B Testing

With the ability to store and process huge amounts of data diffusing by the late 2000s, the capabilities were in place for the widespread adoption of the software A/B test. But this now ubiquitous practice was building on a set of ideas, not just technology.

As I started working on this topic, I joked to myself that I’d have to resist the temptation to go deep into history, because I’d end up doing a history of all of science.

Buuuut, as it turns out, experimentation is actually a pretty recent phenomenon, at least at the level of sophistication we apply in A/B tests. Humans have been trying stuff and seeing what happened since time immemorial, and observational studies looking at differences between preexisting populations have been conducted since the 1800s. For example, in 1855, John Snow – not that one – proved that water quality caused cholera by observing that cholera outbreaks in London were tied to which water company served each of the city’s neighborhoods.

Fisher and the Randomized Control Trial

But the idea of running experiments over groups of humans, exposing them to some stimulus, and quantitatively measuring the results is actually shockingly recent. Medical treatments were a major driver of early 20th century experimentation, but random assignment of subjects to control and test groups was a hard sell. Many physicians believed that they had special insight into the suitability of treatments for patients based on each patient’s unique context, and that vital information would be lost if treatments were assigned to a test or control treatment randomly. Besides, these were scarce medical treatments and there were sick people in need – giving medicine to people the clinicians knew were not sick, or giving placebos to people who were sick, just seemed wrong.

The theoretical work credited with the proliferation of randomized control trials was done by the brilliant British statistician, Ronald Fisher. His books, especially 1935’s The Design of Experiments, laid out a methodology that still remains with us today, including segmenting experiment subjects into blocks for analysis, randomly assigning users to test and control treatment, ensuring reproducibility of the test and its analysis, and the use of a null hypothesis.

Fisher illustrates the idea of a null hypothesis – that is, an initial assumption that a test condition will have no effect – with the example of an English woman who claims she can tell by taste whether a cup of tea was prepared by pouring the milk or the tea into the cup first. Fisher walks through how, by serving the woman eight cups of tea (half prepared tea-first, half milk-first) and having her state how she believes they were prepared, he could then use statistical methods to evaluate her ability. The null hypothesis would be that she had no ability; by working through the combinatorics, Fisher said we should only reject the null hypothesis if she correctly identified all eight cups (an event with a probability of just 1.4% under 50-50 randomness).

As an aside, researching this bit took me down a rabbit hole I never knew existed. It turns out that some British people have very strongly-held positions on whether to add tea or milk first. George Orwell was for tea-first, and then stirring the cup as milk is added. ISO standard 3103, adopted in 1980, defines a standard for how to brew tea, and it is a milk-first protocol. This has no bearing whatsoever on the history of software development – I just thought you should know about it.

Anyway, Fisher’s proposed methodology for what is now the random control trial took off in agriculture, education, and elsewhere, buoyed by statistical innovations of the early 20th century, including the now ubiquitous Student’s t-test and the confidence interval.

Still, rigorous experimentation wouldn’t become a required part of the medical industry for decades more. In 1962, the law that had created the United States Food and Drug Administration was updated to require that all medications sold must have specific health benefits, and that drug manufacturers do the work to demonstrate those benefits; before that, medications needed only to be shown to be safe for health. In the subsequent review of all medications then being sold, fully 70% of the 16,000 claims of medical benefits made by drug manufacturers were false, resulting in many drugs becoming illegal. In 1970, the FDA issued guidelines for what drug companies had to do to satisfy this requirement, and those guidelines effectively required the Fisher model of a controlled experiment. (source)

Scientific Advertising

So that’s how we got the modern scientific experiment. But rigorous experimentation in business developed from a rather different angle than medicine: it came from advertising.

For much of his career, Claude Hopkins was a Chicago-based ad writer and executive at the advertising firm Lord & Taylor. He worked to develop ads for clients selling consumer goods like baked beans, toothpaste, and cigarettes.

In addition to supposedly inventing the advertising slogan – not any one specific slogan, but the entire concept of slogans – Hopkins’ biggest business innovation was the analytical use of the coupon in the first decades of the 20th century.

Ads in Hopkins’ campaigns, especially those sent out through the mail to prospective customers, would include distinctive coupons, which were unique to the channel they were sent through and the broader advertisement copy and imagery they were paired with. Those coupons entitled the customer to a free sample of the product, and when those coupons were redeemed at stores, the totals would be tallied.

These tests weren’t random control trials per se, but they were certainly experimental, and they were certainly quantitatively analyzed. In his 1927 autobiography, Hopkins describes his work selling Pepsodent toothpaste:

“We keyed every ad by the coupon. We tried out hundreds of ads. Week by week the results were reported to me, and with each report came the headline we employed. Thus I gradually learned the headlines that appealed and the headlines which fell flat.”

Hopkins was single-minded that the point of all advertising was to increase sales in an attributable way. There was no value to abstract concepts like brand-building unless concrete sales could be tied to the campaign.

In this way, Hopkins’ work presages the modern A/B tester, endlessly optimizing the copy, colors, and layout of a landing page or a checkout funnel for conversions. Hopkins really was operating at a similar level of granularity, despite relying on an analog toolchain, and he faced the same pushback from people who valued holistic aesthetics over results that every homepage optimizer faces today. In his 1923 book, Scientific Advertising, Hopkins writes,

“... the ads you see today are the final result of all those experiments. Note the picture he uses, the headlines, the economy of space, the small type. Those ads are as near perfect for their purpose as an ad can be. … You may not like them. You may say they are unattractive, crowded, hard to read – anything you will. But the test of results has proved those ads the best salesmen those lines have yet discovered. And they certainly pay.”

Hopkins isn’t treating every campaign as a totally independent experiment, though. He continues later in Scientific Advertising:

“In a large ad agency coupon returns are watched and recorded on hundreds of different lines. In a single line they are sometimes recorded on thousands of separate ads. Thus we test everything pertaining to advertising. We answer nearly every possible question by multitudinous traced returns. Some things we learn in this way apply only to particular lines. But even those supply basic principles for analogous undertakings. Others apply to all lines. They become fundamentals for advertising in general. They are universally applied. No wise advertiser will ever depart from those unvarying laws.”

Every once in a while, you have to stop and appreciate the earnest, idealistic modernism of early 20th century thinkers like this. Hopkins, an advertising executive selling beans and cigarettes, is positioning his work as part of the optimistic, rigorous, 20th century scientific project: to discover and harness universal truth in service of human goals.

Hopkins’ Scientific Advertising became required reading for generations of advertisers. This split testing that Hopkins pioneered went on to be a staple of direct mail advertising and catalog retailer strategy from Hopkins’ time through to the present. And it took the intersection of Hopkins’ commercial motivation, Fisher’s structure of the randomized control trial, and the technological capabilities of distributed computing, to create the modern A/B test.

The A/B Test in Software

The first known A/B test of the consumer Internet age began on February 27, 2000, at Google. The young search engine tested varying the number of items shown on the results page, with the default of 10 results as a control group, and test groups each with a random 0.1% allocation of traffic showing 20, 25, and 30 results. (source)

Apparently the test was an initial unexpected failure – something in the test’s implementation caused experimental treatments to all load results much more slowly than the control, hurting the site’s core metrics. Still, the genie was out of the bottle. A/B testing quickly became an essential part of Google’s modus operandi, as well as at other early Internet giants like Amazon, Microsoft, Netflix, and Intuit.

As an aside, the term “A/B test” itself has no clear origin; it may have evolved organically in the statistics discipline, or among audiophiles doing side-by-side hardware comparisons. In the early days, a randomized control trial was called a live traffic experiment at Google, a flight at Microsoft, and a bucket test at Yahoo.

The practice gained wider awareness – and perhaps a standardized name – thanks to media coverage of the extensive and successful use of A/B testing in Barack Obama’s 2008 campaign, especially in fundraising emails and landing page optimization.

The Obama work was led by Google product manager Dan Siroker, who had taken a leave of absence to work on the campaign.

Siroker went on to co-found the A/B testing framework company Optimizely in 2009; he was initially competing with offerings from Adobe and Google. Many more entrants would come to market in the early 2010s, providing easy tools for the companies of the startup wave to use A/B testing from their inception.

Ever more mature platforms for A/B testing, whether internal tools at the giant companies or open tools for small companies, drove down the cost of experimentation. At the same time, the massive and growing scale of consumer software companies meant that running experiments that improved business metrics by fractions of a percent could still pay for itself quickly. Even at smaller-volume businesses, persistently low financial discount rates could make testing for marginal improvements rational even if it would take years for those changes to pay for the cost of implementing the experiment.

By the mid-2010s, Google, Microsoft, Amazon, Netflix, and other tech giants were each running ten thousand experiments or more per year, almost all as A/B tests, and the idea of software as a scientific experiment was firmly enmeshed in the industry culture.

All of this testing and data inevitably changed the way teams work. “Data Science” moved from a field of academic inquiry to a profession with the first “data scientist” roles opening at Facebook and LinkedIn by 2008. An article in Harvard Business Review in 2012, written by the person who coined the term “data scientist,” called it “The sexiest job of the 21st century.”

Degree-granting programs quickly followed. In 2014, UC Berkeley began offering a Master’s program in data science; in 2015, NYU opened an undergraduate degree in data science.

By the late 2010s, this wave of quantification had significantly changed how we make software. A new function had been added to the roster at many of the most sophisticated companies. Existing roles, such as product managers, picked up new responsibilities to analyze tests. New features are routinely rolled out as tests, often with multiple variants insistently run in parallel. After all, everyone has heard some story of the test variant no one thought would work that turned out to be the best.

This is new. Through the mid-1990s, software was built to satisfy the vision of the wise architect, or to meet the explicit requirements of the customer. For the Agile precursors and many Agile flavors, the customer was meant to be in the room directly or by proxy, providing feedback and prioritizing features. In some sense, we might see the A/B test as just another form of user input. But it’s also clearly very different – in almost all modern testing, independent random populations are evaluating independent variations on a product. They’re not making informed trade-offs between packages; indeed, each individual user isn’t making a choice at all. The user is simply seeing one thing, and being rolled into aggregated statistics. This is a very different relationship between the software team and the customer than we’ve seen before.

John Doerr and OKRs

Beyond our development methodology, software business management was also further quantified in the 2000s. One factor behind this was likely just scale: consumer software, the Internet, smartphones, globalization – the scale of everything was just bigger in the early 21st century than ever before! Quantified tools make sense.

When we discussed the ascent of Management by Objectives in episode 3, we saw the entry of MBO into tech through HP and Intel. Those were both hardware companies, first and foremost, with relatively long cycle times, high-sticker-price products, and relatively low volumes.

In that era, management objectives and their key results had a tendency to be qualitative and binary. That is, they were often “Build the thing” objectives. Peter Drucker’s example objectives were big and long-lived. They could be quantitative, they were simple, like: “Become the number 2 player by sales market share by the end of next year.”

Andy Grove added the notion of key results, but the examples from his book were still qualitative. An objective was “build a new factory,” key results had to do with finalizing plans and approvals. They were objective and measurable, but not always very quantitative.

This changed thanks to venture capitalist John Doerr, who learned about management by objectives while working directly under Andy Grove at Intel in the 1970s.

In 1999, Doerr evangelized his version of Management by Objectives – which he gave the modern moniker of “OKRs” — to Google’s founders and early employees. Google was a portfolio company of Kleiner Perkins, the venture capital firm where Doerr has been a long-time partner. Google’s founders, neither of whom had had a full-time job before, let alone executive management experience, enthusiastically adopted the practice of OKRs as a way to organize their company and align their workers. Doerr advocated for OKRs at other Kleiner Perkins portfolio companies including Zynga, and as Google’s success led to a diaspora of ex-Googlers becoming leaders across the software industry, OKRs quickly spread.

In his 2018 book, Measure What Matters, Doerr generally characterizes his approach to OKRs as being taken directly from Grove’s work. That said, I think there are some differences between Doerr’s model and Grove’s, at least as Grove presented his thinking in the 1983 book, High Performance Management, which we discussed in episode 3.

Wherever the differences came from, the version of OKRs that took off at Google and elsewhere put greater emphasis on quantifying key results than HP or Intel had, based on the scant sources I can find. Now, key results were much more likely to be specific numbers reached on an established KPI.

Doerr also writes that Intel had the common practice of having each OKR-owner assign a score from 0 to 1 for the progress made on each key result at the end of its time window. This practice certainly took off at Google, and has since spread elsewhere.

Of course, there are consequences for how the business prioritizes once you choose to assign a numerical score to results: If you think scoring is valuable, quantitative OKRs become all the more appealing – it’s just easier to map the measured value of some KPI onto a 0 to 1 scale. With increasingly ubiquitous data flowing in from consumer Internet products in the 2000s, the quantification of, well, everything was becoming more possible than ever.

The risks of quantification

Thus by the beginning of the 2020s, the software industry had taken a huge leap toward quantification, with effects on how we build, what we build, and how we coordinate and evaluate ourselves. This shift obviously paid dividends, as it became easy for participants to demonstrate and claim the positive impact of their work through A/B testing. Rigorously measured and tested changes, whether they be the “Easy wins” of tweaking button colors, or competing variants of machine-learning models, have become the key currency of any software worker’s self-evaluation.

But there are risks to all this quantification. Some of these dangers are frequently discussed, like the risk of A/B testing our way to a local maximum and getting stuck there, or making decisions based on short-term effects rather than long-term ones. While these are important, we can largely set them aside as matters to be addressed through best practice.

More relevant to the historical orientation of this series, though, is understanding the risks that earlier generations of business and software leaders saw in the analytical path we’ve since taken.

One such risk is the McNamara fallacy, named after the famously analytical Ford CEO and Vietnam-era Department of Defense chief Robert NcNamara. Public opinion pollster and social scientist Daniel Yankelovich, who coined the term, defined it thus:

“The fallacy is: If you’re confronted by a complex problem that is full of intangibles, you decide to measure only those aspects of the problem that lend themselves to easy quantification, either because you find the other aspects difficult to measure or because you assume that they can’t be very important or don’t even exist… This is suicide. It is a short, fatal step from the statement, “There are many intangibles and imponderables that we can’t put on our computers, to the statement, “Let’s measure what we can and forget about the intangibles.”

Yankelovich was writing in the magazine Sales Management in 1971. The context he pointed to was that companies were too focused on monitoring their current sales, current customers, and so on. These were well-understood and had abundant data. But the risks to businesses could come from unexpected places that weren’t being watched and were hard to measure. For example, shifting social attitudes around health-consciousness led to advertising bans against cigarettes. This changing context was a lethal threat to cigarette companies, but it would never show up in their operational measurements until it was too late to reshape the business.

Dr. William Edwards Deming, whose work on quality measurement has featured in multiple episodes of this series, might seem an unlikely critic of quantification. But by the 1980's, he believed that American businesses were substituting what we might now call “data-driven decision-making” in place of the proper long-term thinking and investment.

Indeed, Deming’s formative influence on post-War Japanese businesses, and eventually on iterative software development, was through the channel of his focus on improving process quality, finding and addressing root causes rather than focusing on output metrics.

In this light, it is perhaps unsurprising to see that Deming’s list of the 5 deadly diseases plaguing American corporations include,

“Management by use only of visible figures, with little or no consideration of figures that are unknown or unknowable.”

That’s from Deming’s 1982 book, Out of the Crisis, the whole of which is a broadside fired at most of the conventions of American management, across industries and in the government sector as well. I’ll keep the excerpts to a minimum here, because they’re a bit off of our current topic, but Deming brings the fire and it makes for a great read, in a management consultant business book kind of way.

Deming makes an explicit attack on Management by Objectives. In his 14 Points for Management, he includes directly,

“Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute leadership.”

By leadership, Deming means active involvement from managers in the success of their subordinates. Deming sees the function of the manager as understanding the capabilities of each worker, and aligning the capabilities of the worker with the work of the company. Because Deming believes firms should have stable long-term strategies and that workers should be retained for many years, this higher investment in coordinating the efforts of each worker can work. The workers don’t need to be steered by objectives – they need to be given work that fits their strengths and then encouraged to do it.

As a slight aside, Deming is serious about retaining workers for the long-haul. Everything significant is on a very long time cycle for Deming. Along with OKRs, he rejects the idea of annual performance evaluations, because he thinks the fear, conflict, and frustration created by those reviews is far too high for the benefit, and the random noise in evaluations between workers will be too large for an annual frequency to be revealing. In a video interview that I’ve linked in the transcript, he notes that only after 10 years might it be possible to rigorously evaluate a worker’s overall performance.

One more quotation, this time from Deming’s 1993 book called The New Economics for Industry, Government, Education gives more of Deming’s concrete rationale for rejecting Management by Objectives. He writes,

“In M.B.O., as practiced, the company’s objective is parceled out to the various components or divisions. The usual assumption in practice is that if every component or division accomplishes its share, the whole company will accomplish the objective. This assumption is not in general valid: the components are most always interdependent. Unfortunately, efforts of the various components do not add up. There is interdependence. Thus, the purchasing people may accomplish a saving of 10 percent over last year, and in doing so raise the costs of manufacture and impair quality. They may take advantage of high-volume discount and thus build up inventory, which will hamper flexibility and responsiveness to meet unforeseen changes in the business.”

As we’ve discussed previously, Deming was deeply influential in the post-war Japanese economic recovery, which in turn informed the Agile movement generally and Lean software development in particular.

Thus, it is perhaps unsurprising that the relationship between most Agile methodologies and the OKR system has been arms-length. None of the Agile texts that we’ve discussed have advocated for Management by Objectives or OKRs, despite the practice’s deep-roots from well-before the 2001 Agile Manifesto. Not even 2011’s The Lean Startup mentions OKRs.

Yet they’re everywhere at purportedly Agile companies. What gives?

There are a few factors to remember. One is timing: While “management by objectives” has been in the air since the 1950s, and Andy Grove popularized his approach with a 1983 book, MBO had actually been in the decline in popular literature from the 1980s until it exploded back into the public consciousness in the late 2010s. During the window when the early Agile methods were getting formulated, MBO wasn’t a super salient idea.

But there are also substantive tensions between the recent strand of rigorous quantitative OKRs and most Agile methodologies. As a perspective on business strategy, the Agile Manifesto was calling for small teams partnering closely with customers to discover the customer’s real needs and meet them. The customer’s needs are only loosely known, and the time and means needed to satisfy them isn’t known, so wrapping them in OKRs is pretty silly.

And as a moment of cultural expression, the Agile Manifesto was telling the suits to shove off, that work should be enjoyable and overtime forbidden, and that empowered teams should be given resources and trusted to do the right thing. A management system where requirements are inherited from above rather than the customer, where every outcome is scored, and where goals should always be impossible to deliver in the time allowed, simply was never going to fit the cultural ethos of early Agile.

Wrapping up

But the meteoric ascent of OKRs, despite their tension with the purportedly dominant software methodology, was a sign of the changing times in mid- to late 2010s. Software organizations were growing far larger than the small departments and independent teams that the early Agile advocates had written about. Standardization and quantification had great appeal as tools to manage the complexity of software orgs that could now span many thousands of individuals.

The new scale of modern software teams would create strong incentives to iterate on the way we build software. And the results have been – well, we’ll see. Be sure to join us next episode as we take on Agile at Scale.

That’s all for now.

As always, your comments and feedback on this episode are very welcome. You can find a transcript, links to sources, and ways to reach me on the show website at prodfund.com.

And if you like this series, and you want to hear more, do me a favor and share it with someone you think would enjoy it too.

Thank you very much for listening.