We've Been Asking the Wrong Questions About Artificial Intelligence

Rumman Chowdhury — Mon, 04 May 2026 12:34:20 GMT

There is a phrase that came up in the first episode of Thinking about Thinking that I have not been able to stop thinking about since. It arrived almost offhandedly, mid-conversation, as the best ideas often do: benchmarks are obsolete by design.

The implication is sharper than it first sounds. We have spent years arguing about whether AI benchmarks accurately measure intelligence: whether MMLU captures reasoning, whether passing the bar exam means a system can “do law,” whether benchmark saturation means models have genuinely plateaued or merely gamed the test. These are real debates, and they feed the current mania about the ‘jobs apocalypse’. But they rest on the premise that benchmarks were ever meant to be permanent measures in the first place.

A benchmark, in AI, is scaffolding. It is a temporary structure erected to improve the building, not to stand beside it forever as its mirror. The point of a benchmark is to induce better performance in the system being tested. Once the system has absorbed and exceeded the benchmark, the benchmark is discarded and a harder one takes its place. This is not a bug in the methodology, it is the methodology. The benchmark that isn’t eventually surpassed is a benchmark that wasn’t set ambitiously enough.

This reframes the entire AI evaluation discourse. The conversation around benchmark saturation — the observable phenomenon by which frontier models cluster near the ceiling of any given evaluation dataset within months of its release — has largely been framed as a crisis of measurement: our tests are too easy, too leakable, too narrow. But if we take seriously the idea that obsolescence is the intended endpoint, then benchmark saturation is not evidence of a broken system. It is evidence of a system working exactly as designed. The benchmark served its purpose. We should be making a new one, not lamenting the death of the old one.

The Bach Problem

In cognitive science and philosophy of mind, the debate over whether generative AI systems genuinely “understand” or merely “pattern match” has become something of a cottage industry, producing landmark arguments from John Searle’s Chinese Room — first articulated in “Minds, Brains and Programs” in 1980 - to, more recently, the debates surrounding emergent capabilities in GPT-4 and its successors. What these debates share is a tendency to adjudicate on behavioral grounds: can the system produce outputs indistinguishable from human outputs?

The Turing test, despite decades of philosophical critique, remains an implicit benchmark. In the course of the seminar episode, I reflected on my time as a data science teacher. Each year I’d have a student who wanted to create an “AI music generator.” Consider what happens when a student builds an AI music generator trained on Bach. The system produces music, and that music sounds like Bach’s. It is, by any Turing-adjacent standard, a success. And yet something is obviously missing. If Johann Sebastian Bach were alive today, he would not simply produce more Bach-sounding Bach. The whole reason we still perform the Goldberg Variations and the Well-Tempered Clavier three centuries after his death is that he did something completely novel. In AI-speak, he didn’t optimize within a prior distribution, he changed the distribution.

This “Bach Problem” is a formulation of something that existing intelligence theory gestures at but rarely pins down. The challenge is not whether a system can produce outputs that are statistically plausible continuations of a training corpus. Nearly any sufficiently large language model can do this with impressive fluency. The challenge is whether a system can produce outputs that have no adequate prior in the training data — outputs that aren’t interpolations or extrapolations but genuine departures.

The distinction matters because it maps onto something deeper than style imitation. Howard Gardner’s theory of multiple intelligences, first articulated in Frames of Mind: The Theory of Multiple Intelligences, proposed that what we call “intelligence” is not a unitary faculty but a cluster of relatively independent competencies — linguistic, musical, logical-mathematical, spatial, and so forth. What Gardner did not dwell on, but what the Bach Problem brings into focus, is that each of these intelligences, in its highest expression, is characterized not by competent performance within a domain but by genuine contribution to it. Einstein did not solve physics problems faster than other physicists. He reconceived what a physics problem was. Coltrane did not improvise jazz better than his contemporaries. He changed what jazz permitted itself to do.

By this standard, current AI systems, despite their extraordinary capabilities across text, code, mathematics, and image generation, are still, fundamentally, making more Bach-sounding Bach. This is not a dismissal of their power, but a more accurate description of their current nature and an indicator of what the next threshold would have to look like.

Atomization and the Ghost in the System

The seminar closed with an argument from systems theory that deserves more attention than it typically receives in AI discourse. The claim is this: AI development as currently conceived attempts to decompose human capability into atomized individual tasks — writing, reasoning, coding, image recognition, planning — achieve superhuman performance on each task individually, and then sum those performances as a proxy for human-level general intelligence. This is, structurally, a category error.

Systems theory, from Norbert Wiener’s foundational work in Cybernetics, or Control and Communication in the Animal and the Machine through the later contributions of Donella Meadows in Thinking in Systems: A Primer in 2008, establishes the foundational principle that the output of a system is not reducible to the sum of its components. Emergent properties — the behaviors and capacities that arise from the interaction of parts within a system — cannot be reconstructed by analyzing parts in isolation. This is as true of human cognition as it is of ecosystems, economies, or ant colonies.

What makes a skilled lawyer valuable is not the sum of their performance on discrete subtasks but the way those capacities interact under conditions of uncertainty, stakes, and client relationship. The same is true of what makes a great teacher, a creative scientist, or a gifted manager. These are emergent properties of systems, not additive properties of components.

This is the society that AI evaluation systems are being built to replicate, not the society we might want to build. We are designing intelligence benchmarks in the image of performance improvement plans.

The Goalpost Principle

There is a pattern in the history of AI milestones that the seminar identified with unusual clarity. In 1975, the economist Charles Goodhart observed that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes” — now widely known as Goodhart’s Law, and later crystallized by the anthropologist Marilyn Strathern as: “When a measure becomes a target, it ceases to be a good measure.”

When IBM’s Deep Blue beat Garry Kasparov in 1997, the response was not “AI has achieved general intelligence” but “chess is not a real test of intelligence.” When AlphaGo defeated Lee Sedol in 2016, we had the same response. GPT-4 passed the bar exam, the USMLE, and numerous other professional licensing assessments. The response should not be flailing about a jobs apocalypse, but a deeper understanding of what human intelligence is when aspects of the profession are automated.

However, the pattern of perpetual goalpost movement cannot be indefinitely sustained without acknowledging what it implies: that we do not have a theory of intelligence robust enough to survive contact with a system that meets our stated criteria.

What a Real Benchmark Would Have to Measure

If we take the Bach Problem seriously, a genuine benchmark for AI intelligence would need to assess not fluency within a distribution but contribution beyond it. If we take the systems theory argument seriously, it would need to assess not component performance in isolation but emergent behavior under conditions of integration, uncertainty, and stakes. If we take Goodhart’s Law seriously, it would need to be defined before the system is tested, with criteria that do not shift once the system approaches them. These are not easy problems to solve - but articulating them reframes how we consider, build and deploy our benchmark scaffolding.

Thinking about Thinking is a six-episode podcast series recorded live at Harvard’s Berkman Klein Center for Internet and Society. Learn more on our website here, and subscribe to our podcast on Apple Podcasts and Spotify. Episode 2 launches May 18th.

Sources:

Searle, J. (1980). “Minds, Brains and Programs.” Behavioral and Brain Sciences, 3, 417–457.
Gardner, H. (1983). Frames of Mind: The Theory of Multiple Intelligences. Basic Books.
Wiener, N. (1948). Cybernetics, or Control and Communication in the Animal and the Machine. MIT Press/Wiley.
Meadows, D. (2008). Thinking in Systems: A Primer. Chelsea Green Publishing.
Goodhart, C. (1975). Problems of Monetary Management: The U.K. Experience.
Flynn, J. (1984). “The Mean IQ of Americans: Massive Gains 1932 to 1978.” Psychological Bulletin, 95(1), 29–51.

Thinking about Thinking

Rumman Chowdhury — Sun, 03 May 2026 11:49:51 GMT

A few years ago, I read an astounding fact on a subreddit. It wasn’t until 1987 that babies were anesthetized prior to surgery. Yes, that’s right. If you or I — or most people who subscribe to this newsletter — required surgery soon after birth, we were left screaming on a table while a doctor cut us open.

I had many questions. But mainly: why?

It was a widely held belief in the medical profession that newborns did not feel pain because their nervous system was immature, and that any expression of distress was simply reflexive. It was also believed that they retained no memories, and therefore there were no meaningful ethical considerations — pain without recall was pain without consequence.

What changed this? In 1987, Anand and Hickey published a pivotal paper in The Lancet documenting the biological responses of infants to surgery: dramatic increases in adrenaline and cortisol, shifts in glucose and fat metabolism, unstable cardiovascular patterns. The science made the suffering credible. That same year, it was declared unethical to operate on babies without anesthesia unless medically impossible.

Here is what compels me about this story. The expression of pain was insufficient to justify relief. It had to be scientifically proven that the pain was “real” to matter.

In the society we have built, the rights we expect as living creatures are a function of a justification of our intelligence and capabilities.

Babies were deemed incapable of recall, so we withheld pain relief. Persons with intellectual disabilities were denied basic rights on the basis of cognitive assessments. “Scientific” measurements of intelligence were used to exclude Black Americans from citizenship and women from professions. Rights have always been downstream of whoever controls the measurement.

Human cognition is the ultimate battlefield on which we consider the integration of artificial intelligence into our society. This is not a philosophical discussion — it is one rooted deeply in sociotechnical reality. Any conversation about the future of work or the future of education is actually a conversation about the measurement constructs we use to align human intelligence to economically valuable labor: SAT tests or bar exams, for example. Any conversation about intelligence is a conversation about human worth, and any conversation about human worth is political.

Our current fascination with measuring AI capabilities — and the growing existential dread around an AI apocalypse, whether economic, societal, or otherwise — is rooted in that same construct. We are not debating the technical properties of a language model. We are debating how we measure human worth, and what we will do if technological systems surpass the highest capacity of that measurement.

A deeper understanding of artificial intelligence requires a deeper understanding of what intelligence means — as a scientific, social, political, and economic construct. That is what the podcast Thinking about Thinking explores: six episodes, recorded as a live Harvard seminar, unscripted, with a room full of people who disagree with each other in productive ways. The podcast launches tomorrow, and a new episode will post every two weeks.

This newsletter launches alongside it. Each issue will go deeper than the episode — into the academic literature, the historical antecedents, and the things I think about as I listen back. A book is in progress at Bloomsbury Press, and this is where I work out the thinking behind it in public. If you’re new here: subscribe, and you’ll get each essay the day an episode drops. The podcast is on Apple Podcasts and Spotify.

We are not afraid of AI because it might surpass us. We are afraid because we have always tied capability to worth — and we know, better than we admit, what we have done to the things we decided didn’t measure up.