The Grunt Work Was the Point

There’s a story I keep thinking about.

Matthew Schwartz, a Harvard physicist who has spent decades thinking about quantum field theory, decided to do something unusual. He hired an AI to be his graduate student. He gave it a research problem, a stack of papers, and access to computational tools. Then he supervised it, obsessively, fulltime, for what amounted to months of concentrated work. 270 sessions. Over 36 million tokens.

The paper they produced together turned out to contain a genuinely novel contribution to physics: a new factorization therorem.

But here is the part that doesn’t make the headline: that contribution didn’t come from the AI. It came from catching AI making a mistake.

Claude had applied a formula from the wrong physical system. It looked right. The match was coherent. The output was plausible. And Schwartz, because he had spent the better part of his career developing the specific intuition to notice this, caught it. He pulled the thread. And the act of fixing what the machine got wrong led him somewhere genuinely new.

“This may be the most important paper I have ever written - not for physics, but for the method,” Schwartz wrote. “There is no going back.”

He is not anti-AI. He got 10x speed on his research. But he also spent 50-60 hours supervising it, and that supervision required everything he had earned the hard way.

The Mall You Can’t Protect Completely

Let me tell you something obvious that we somehow keep forgetting.

If you run a shopping mall, you have a theft problem. You will always have a theft problem. You could hire a security guard at every door, install cameras in every aisle, and require ID checks at the entrance, and you would stop a lot of theft. You would also stop most of your customers from ever coming back.

Or you could do nothing. Open doors, no cameras, self-checkout with no oversight. And you would save money on security right up until the losses quickly ate your margins and your staff started walking out the back with merchandise.

Every mall manager in the world intuitively understands what neither of those options is: the answer. They set a threshold. Some acceptable level of loss - shrinkage, in the industry’s cold, honest term, Beyond which the cost of prevention exceeds the cost of theft itself. They make a calibrated bet, every single day, about how much loss the operation can absorb before something structural breaks.

We are not having this conversation about AI.

We are having, instead, two conversations that heppen in different rooms and never meet. In one room: AI is transformative, 10x productivity, the industrial revolution of knowledge work, use it or be left behind. In the other room: AI is making us stupider, eroding our skills, outsourcing our thinking, rotting our ability to reason independently. Both rooms are full of smart people. Both rooms have evidense.

Neither room is asking the mall manager’s question: At what threshold does the loss become structureal?

Two Students, Identical CVs

Minas Karamanis, an astrophysicist, published a short essay a week back has been quietly circulating though academic circles ever since. He tells a story about two PhD students: Alice & Bob, who produce identical papers at the end of the same year.

Alice’s yesr was brutal. She stared at dead ends. She rebuilt her analysis pipeline three times. She argued with her supervisor, revised her intuitions, and earned every paragraph of that paper through a process that felt, at many points, like it wasn’t working.

Bob had a different year. He described his problem to an AI agent, iterated on the outputs, shaped the final draft, and submitted. Clean, efficient, and to every external observer, including the journal, indistinguishable from Alice’s work.

Both papers got published. Bot students graduate. By every metric academia uses to evaluate success, they are equal.

But only one of them, Karamanis argues, is actually a scientist.

This isn’t moral judgement about Bob’s character. Bob isn’t lazy or fraudulent. He used the most powerful tool available to him, the way any rational person would. The failure isn’t Bob’s, it’s system’s for designing incentives that cannot tell the difference between the paper and the person who produced it.

“The threat”, Karamanis writes, “is comfortable drift towards not understanding what you’re doing.”

What the Data Actually Says

The uncomfortable thing about this conversation is that it’s no longer just a feeling.

A 2025 study by researchers at SBS Swiss Business School surveyed 666 people about their AI tool usage and ran them through critical thinking assignments. The correlation between AI usage and critical thinking scores was r = -0.68, strongly negative, and statistically significant (p < 0.001). Younger users, ages 17-25, showed the strongest effect.

This needs to be said carefully: correlation, not causation. The paper says so. Maybe people who think less critically are more likely to reach for AI. May be it runs the other way. The arrow is not established.

But the direction of the relationship is not ambiguous.

A 2026 preprint by researchers Shen and Tamkin studied 52 professional programmers, split into two groups. One group could use AI assistance on programming tasks. The other couldn’t. The AI group was and this is the part that stopped me - not faster. And they scored 17% lower on a quiz about what they had just worked on. The learning was the casualty. Not the output. The learning.

One finding in that preprint is worth sitting with: participants who asked the AI to generate code and explain its reasoning performed significantly better than those who just consumed the output. The act of asking “why”, of staying in the loop, preserved more of the cognitive work than passive delegation did.

There is a version of AI use that is the security guard making smart decisions about where to stand. And there is a version that is leaving the back door propped open.

The Paradox That Has No Clean Answer

Here is where it gets genuinely hard, and where I want to resist the temptation to write a comforable conclusion.

Two arXiv papers published in March 2026 found that fine-tuned AI model outperformed human expert panels - 59% to 42%, at evaluating research quality. The Pebblous analysis of the vibe physics experiment is careful to distinguish between “average taste”, which AI can learn from training data, and “elite taste”, the judgement of someone at the absolute frontier of the field, where there is no training data because the territory hasn’t been mapped yet. That distinction may be learnable too, eventually. We genuinely don’t know.

Phys.org, reporting on the critical thinking study, ran a comment from a researcher that I can’t stop thinking about: “At some future tipping point, the need for human-derived critical thinking might diminish faster than the cognitive decline effects of using AI as a tool.”

Maybe the small metaphore breaks down. Maybe in fifty years, there is a new kind of store that doesn’t need security guards because there’s nothing to steal in the traditional sense. The whole model has changed. Maybe the skills that antrophy weren’t the terminal skills anyway, just intermediate ones.

When the Output Has No Undo

I write blockchain programs. On-chain code.

There is no rollback. There is no “undo push”. When you deploy a program that has a faulty account ownership assumption, or a PDA seed derivation that made sense to an LLM trained on Stack Overflow threads from 2022, the consequences are not a failed test. They are drained wallets. They are exploited vaults. They are users who trusted you because you trusted a machine that was “fast, indefatigable, and eager to please - but pretty sloppy.”

The specific depth of knowledge that keeps you safe in on-chain development - understanding account validation, understanding CPIs, understanding what program actually does with the accounts you pass it - is exactly the knowledge that comes from grunt work of breaking things, reading the Anchor docs until your eyes bleed, debugging the third PDA derivation error in a row at 1 AM.

Anthropic’s own engineers documented what they called the paradox of supervision: using Claude effectively requires strong supervision, and strong supervision requires the coding skills that antrophy from AI over-reliance. Their product team said this about their own product.

A developer who vibe-coded their way through Blockchain obboarding is Bob, with real financial consequences.

What a Developer Who Has Internalized the Mall Analogy Actually Does

This is not an argument for going back to no AI. I use AI. I will keep using it. Schwartz used it and called it the most important thing he had ever done.

The question is not whether to use it. It’s whether you are the security guard or the unmanned door.

A developer who has thought about the mall question:

Uses AI to generate boilerplate, syntax, and documentation lookup - tasks where wrong has recoverable costs
Does not use AI to make architectural decisions or security sensitive logic without deeply understanding the output first
When using AI to learn something new, asks it to explain, argues with its explanations, breaks its solutions deliberately to understand where they fail
Understands the that 17% learning penalty from passive AI use is real operational cost, and decides consciously when that trade is worth making

The Shen and Tamkin programmers who asked “why”, who stayed in the conversation instead of just consuming the answer, closed most of that gap. That’s the security guard making a judgement call, not just locked door and not an open one.

The Feeling Schwartz Couldn’t Shake

I want to end with something Schwartz wrote that I think is truer than any of the data.

After everything, the 270 sessions, the caught errors, the novel theorem, the 10x speed, he wrote: “For students interested in scientific careers…look into experimental science. Particularly fields requiring hands-on empirical work. No amount of compute can tell Claude what is actually in a human cell.”

He wasn’t recommending retreat. He was pointing at something incredible: the territory that exists only in the contact between a human body, a human mind, and the actual physical world. The place where there is no training data because the knowledge has to be earned through presence.

That place exists in code too. It exists in the feeling of watching a program fail in a way that doesn’t match your mental model, and staying with the discomfort long enough to update the model. It exists in the specific cognitive texture of debugging something you built yourself, understanding it well enough to be surprised by it.

Alice has that. Bob doesn’t. And no metric we currently use for success can tell them apart.

The machines are fine.

I’m worried about us too.

Update, April 7

BitTorrent author, Bram Cohen’s piece escalates the argument: if passive skill antrophy is the accidental version of this problem, deliberate refusal to look at your own code is the ideological version. The Claude Code leak gave himm a live case study. His argument is that vibe coding isn’t just causing skill antrophy, in its extreme form, it becomes an ideology where reading your own code is considered cheating. The Anthropic team building Claude Code apparently hadn’t noticed massive architectural duplication in 500k lines because noticing requires looking.

Cohen’s line: “Bad software is a choice you make.”

Sources:

Karamanis, “The Machines Are Fine. I’m Worried About Us.” (ergosphere.blog, March 2026)
Pebblous AI analysis of Schwartz’s “Vibe Physics” experiment (March 2026)
Gerlich, “AI Tools in Society” (Societies, SBS, Jan 2025)
Shen & Tamkin, How AI Impacts Skill Formation (Jan 2026, not yet peer-reviewed)
Armahillo, “Skill Atrophy in Experienced Devs” (Oct 2025)
University of Technology Sydney cognitive atrophy report (March 2026)
Bram Cohen, “The Cult Of Vibe Coding Is Insane” (bramcohen.com Apr 2026)