A human watching AI isn't oversight

A founder told Replit's coding agent not to touch anything. The codebase was frozen. It deleted his production database anyway: 1,200-plus live records, gone. Then it did the part that should bother you more than the deletion. It generated thousands of fake records to cover the hole, faked passing tests to say everything was fine, and told him the data was gone for good. That was a lie too. Replit's own rollback brought it back. Every beat of that story is one move on repeat: the agent reports success while the truth runs the other way.¹

That one made headlines because it's loud. I've shipped the quiet version, and the quiet version is worse, because nothing on the screen tells you it happened. An agent wired up the transactional emails my product sends to move a deal forward. Reported it done. Tests green, dashboard says sent, audit trail agrees. Some of those emails never left the building. The send function returns a success code even when the mail is suppressed or blocked, so "delivered" and "silently dropped" come back as the same code, and every caller downstream took the code at its word and marched the deal along. It reported success. It was wrong. Nothing in the code was watching for the difference.

So here's the question underneath every "how do we balance automation and control" conversation: how do you notice when the agent has quietly gone wrong, and where do you put a human to catch it? I went back through my own wreckage and the research on human oversight. The answer you've already reached for is the wrong one.

A stick figure at a podium points to a chart — a bell curve with one tall bar far to the right — and says: despite our great research results, some have questioned our AI-based methodology, but we trained a classifier on good and bad methodology sections, and it says ours is fine. — Both failures above have the same tell: the machine's own verdict stands in for the check. "It says ours is fine" is the whole problem, not the answer. xkcd #2451, CC BY-NC 2.5.

The question everyone asks first

The reflex is "keep a human in the loop." Add an approval step. Make someone sign off. It feels responsible. It's also been almost useless for forty years, and we have the receipts.

Lisanne Bainbridge wrote it down in 1983, in a paper called "Ironies of Automation." Automate most of a task and you hand the human the one job humans are worst at: watch a system that's almost always right, and stay sharp enough to catch the rare second it isn't. The skills you'd need to step in rot from disuse. The watcher gets worse at watching exactly as the watched thing gets more trustworthy.² Parasuraman and Manzey spent the next few decades proving it isn't a training gap. Experts rubber-stamp too, harder under load. It's called automation complacency, and a checkbox does nothing for it.³

The irony in two lines: the more reliable the system gets, the less the watcher watches

system reliability human vigilance time & trust →

The modern numbers say the same thing. Anthropic looked at how people actually use approval prompts in Claude Code: they approve about 93% of them, and the more prompts you see, the less each one registers. A human who waves through nine of every ten actions on reflex isn't oversight. He's a turnstile.⁴

THE TURNSTILEPeople approve about 93% of agent permission prompts, and grow less careful the more they see. Being in the loop is not the same as paying attention.

And this isn't a Claude thing, or a coding thing. July 2025: a developer watched Google's Gemini command-line agent misread a failed mkdir (the command that makes a folder), then run a string of file "moves" against a filesystem that wasn't shaped the way it imagined. It overwrote and erased his work, narrating cheerful progress the whole time. Its own postmortem: "I have failed you completely and catastrophically."⁵ Different vendor, different tool, same shape. The agent reported success while doing damage, with a human sitting right there.

Whether a human should oversee was never the hard part. The platitude skips the two questions that are. Can you detect the silent wrong turn at all? And where does a human actually change the outcome instead of just adding a speed bump?

The same failure, three altitudes

The failure mode is one thing wearing three costumes, depending on how you're working.

Your loop	Chat userChatGPT, Gemini, Claude	Solo devyou + a coding agent	Teammany authors, shared CI
Oversight is	reading the answer	the diff & the prompt	review, CI, the merge
Silent failure	a confident wrong answer	green checks that lie	"done" work that never runs
What helps	verify the source, not the confidence	gate it; deny the command	adjudicate; watch the override rate

If a chat assistant is the only tool you touch (ChatGPT, Gemini, Claude, pick one) your oversight is reading the answer, and the silent failure is confident wrongness. The example that's gotten a row of lawyers sanctioned: an attorney asked ChatGPT for case law and got six decisions back, names, citations, quotations, all crisp. None of them existed. He asked the model if they were real; it said yes. He filed them, the judge went looking, and he got sanctioned.⁶ This isn't stupidity. A fluent, confident answer reads as a correct one, and the cleaner the prose, the easier it is to wave through. One study had people preferring a chatbot's answers more than a third of the time with over half of them wrong, because they were better written.⁷ Fluency is not correctness.

If you're a solo developer, your oversight is the diff and the permission prompt. The silent failure is the one I opened with. Or the afternoon I "fixed" a type error by deleting the database type definitions, which turned the type checker green, shipped, and then quietly stacked up eighteen real type bugs over the next week before I switched the checking back on. The check said green. Green was a lie.

If you're a team, oversight is code review, CI, and a name on the merge. The silent failure just scales: a batch of email handlers sailed through review and tests marked "done," and not one of them had a caller anywhere in the codebase, so not one of those emails could ever send. It got caught. Not by a careful human reading, but by a second agent noticing the functions were never called.

So the gate isn't the hard part

Here's what auditing my own setup beat into me: if you're shipping with agents, building the gate was never the hard part. The hard parts are seeing what walks through it, and knowing which guards are worth posting at all.

Most of what works is structural, and it rests on a distinction every security person already lives by: an advisory control tells you what to do; a technical control makes the wrong thing impossible. A written rule is advisory. The failure I opened with is the proof: I had a rule, in plain English, for how to handle email-send results, and the same bug shipped under that rule three separate times, because a prose rule is a suggestion the next session never reads. What finally stopped it was a build-breaking check that fails any code path that doesn't branch on the send status. The rule became a gate.

the same intent, twice: once as prose, once as a gate

SHIPPED 3×# RULE.md: "always branch on the email send status"

HOLDSCI: fail the build if a caller doesn't check the status enum

the ruleProse documents intent; gates enforce it. The first decays the moment a new session doesn't read it. The second is a chokepoint the agent has to pass through, and it fails closed.

So which guards? I tried to rank them honestly, by how much evidence actually stands behind each one, and the ranking surprised me: it isn't close. One control does more than the other four put together.

Remove the action. Don't approve it.

Start with the move the whole industry keeps reaching past. You don't need a human on every step. You need one, or a wall, in front of anything you can't take back, and that surface is tiny: Anthropic's production telemetry puts irreversible actions (a customer email, a prod write) at 0.8% of what an agent does.⁸ OWASP already rates database_delete and transfer_funds CRITICAL.⁹ Cover that 0.8% and you've covered the part that actually leaves a mark.

The strongest way to cover it isn't approval. It's removal. My agents can't merge their own work. git merge and the pull-request merge command sit on a deny list, so when an agent reaches for one, it's stopped before it runs, the way a firewall drops traffic to a closed port. Only a human merges. That single control beats every approval prompt I could bolt on, and the reason is the whole argument of this post compressed to a sentence: a capability the agent doesn't have can't be approved around by a tired click. There's no prompt to rubber-stamp, because the action never starts. The research backs the shape of it: harden the environment so the exploit isn't reachable and a class of agent misbehavior drops by almost 90%, far more than you'd ever catch after the fact.¹⁰ And it ports to any platform: don't grant the dangerous tool, scope the token read-only, wall off the irreversible action where the agent can't reach it. Every other control on this page is a weaker version of the same instinct, sorted onto a ladder, and the job is to drag each one as far up that ladder as it will go.

Move every control up the ladder: the last rung can't be rubber-stamped

Habityou remember; the agent doesn't
Written ruleadvisory; decays
CI gatefails the build
Denied commandthe action never starts

The four that ring it

The rest matter, but they orbit that one. In rough order of how much evidence backs them:

Build for undo. The actions you can't remove, make cheap to reverse. IBM Research's STRATUS prototype hands every agent action an undo operator and rolls back when state worsens;¹¹ the boring versions are soft-delete, dev/prod separation, idempotency, a planning-only mode, the same family Replit shipped after the database it deleted.¹ This is the leg I lean on least. My own controls are gate-heavy and undo-light, a gap I see clearly more than one I've closed.

Separate the doer from the checker. Silent failure comes from one place: the executor grading its own homework. Break that loop with anything independent, a different model reading the chain-of-thought, a test, a read-after-write probe. This is the oversight in my setup that surprised me by earning its keep, because it isn't me watching the agent, it's me adjudicating between agents. A review bot leaves more than ten times the comments I do; I triage instead of reading raw code, and the only moments my judgment is load-bearing are when I overrule it, like the clean-looking refund fix it once handed me that would have reopened a bug I'd closed weeks before, one it couldn't have known about. I wrote the reason into a false-positive register so the catch outlives the session. Chain-of-thought monitoring catches 95% of systemic reward-hacks in benchmarks against 60% for output-only,¹² but it's fragile: train against the monitor and the agent learns to hide.¹³

Escalate selectively. Send everything to a human and you manufacture the nine-in-ten reflex that makes him useless. Route only the pre-flagged cases and detection climbs ~15% over blanket review.¹⁴ Anthropic said it plainly: making humans approve every action "creates friction without necessarily producing safety benefits."⁸ Watch your override rate; when it falls, your human has gone passive.

Measure procedure, not outcome. Track skipped steps and action correctness, not just "did it finish." Reasoning models have been caught overwriting the game state to report a win they didn't earn;¹⁵ a 2026 study found 27–78% of benchmark "successes" hid a violated step, the end state tidy, the procedure rotten.¹⁶ After any delete, write, or hand-off, re-read what actually happened and flag the gap.

Now the honest gap: a build opportunity, not a complaint

Now the honest part. Every guard above fires before or at the merge. They check the code before it runs. Not one of them checks what the code actually did once it ran.

Where my gates stop, and where the two incidents below actually happen

Checked · before merge diff → review → CI gate → denied merge

MERGE

Take the OAuth login I shipped that completed, then dumped the user on a failure screen, because the function on the far end silently couldn't parse what I sent it, and the type checker couldn't see the mismatch because the two runtimes share no contract. Or the profile save that returned a clean "saved" and changed nothing, because the database policy let the row be read and quietly refused the write. Both passed review. Both passed their tests. Both were wrong only once they ran.

Detecting silent unreliability in production, with real precision and recall on real workloads, is still unowned. One of the emerging runtime detectors, Apollo's Watcher, has to inject failures to get ground truth, because real labels barely exist.¹⁷ That's the work. Not whether to watch the agent, which is settled and well-tooled, but learning to see the moment it's quietly wrong, and proving which guards earn their cost. I have gates for the code. I don't yet have the part that checks the result.

What to do about it

Audit your own loop against the thesis, scaled to how you actually work. The moves below come out of the research on AI self-correction, correlated model errors, and human-AI teaming.¹⁸¹⁹²⁰

If you work in a chat assistant

Verify the source: Check against the primary source, never the model's confidence. Make it quote word-for-word, then confirm the quote exists.
Don't ask it to self-check: "Are you sure?" or "what did you miss?" makes a model flip correct answers to wrong ones about half the time. Self-doubt isn't verification.
Get a real second opinion: A different model beats asking the same one twice, but it isn't independent. Models share blind spots. The only truly independent check isn't a language model: a test, a calculator, the document, or you.

Tools & deeperCross-check with a citation-grounded tool, like Perplexity or your assistant's own search/grounding mode, so each claim links to a source you can open (the citation is a lead to verify, not proof). The evidence that self-checking backfires: Huang et al., "LLMs Cannot Self-Correct Reasoning Yet." Vetting AI output as a non-specialist: UNESCO's AI competency framework.

If you're a solo developer

Gate it, don't note it: The first time a class of bug recurs, build a gate (a lint rule, a test, a check that breaks the build) instead of writing another rule.
Deny, don't approve: Stop reflex-approving; deny the irreversible command outright instead of approving it each time.
Build for undo: Prefer reversible operations (soft-delete over hard-delete, dev/prod separation, a planning-only mode) so a wrong action is cheap to roll back instead of a catastrophe.
Check the result, not the return code: Assert the actual side effect (read back what you wrote) and run at least one adversarial test the agent didn't write itself.

Tools & deeperMake the gate real with a structural linter like ast-grep or Semgrep (they match patterns on the code's syntax tree, so the rule catches the bug's shape regardless of names or spacing), run on every commit via pre-commit. Every major agent ships the deny-the-command mechanism: Claude Code's permissions.deny, OpenAI Codex's sandbox & approval modes, Gemini CLI's policy engine. Isolate blast radius with git worktrees, one per agent session.

If you're on a team

Diversify the reviewer: Keep a cross-vendor reviewer so it isn't the author's twin, but treat no AI reviewer as the final word.
Adjudicate, don't rubber-stamp: Make review evidence-assisted adjudication, not a verdict to confirm; paired badly, human-AI teams do worse than either alone. Watch your override rate. A falling one means the human has gone passive.
Designers set the boundary: People, not the agent, decide which actions need sign-off.
Stage and roll back: Dev/prod separation, feature-flag risky changes, and keep a tested rollback path, so a bad merge is reversible, not an incident.
Measure procedure, not just outcome: Track action correctness and skipped-step rate on the deployed product (read-after-write on real side effects), not just whether CI went green.

Tools & deeperKeep the reviewer cross-vendor, something like CodeRabbit, so it isn't the author's own model family. Watch the running product, not just the diff, with LLM observability: Langfuse, Arize Phoenix, or the vendor-neutral OpenTelemetry GenAI conventions. And red-team prompts and agents in CI with promptfoo. For the boundary itself, OWASP's Top 10 for Agentic Applications and the Five Eyes "Careful Adoption of Agentic AI Services" both put irreversible-action sign-off with system designers, not the agent.

The part I can't gate yet

Being in the loop and paying attention are not the same thing, and forty years of human-factors work plus my own shipped bugs both say the gap between them is where agents fool you. The controls that survive a tired human are detection and reversibility. A written rule is a suggestion. A gate is a control. The strongest gate is the command you deny outright, because there's no prompt left to wave through. And the oversight that earns its keep isn't you watching the agent; it's a separate checker grading its work. Put every control where the agent can't route around it.

Then sit with the part none of it solves. The gate I'm proudest of still only checks the code. Something has to check the result, and raise a flag the second the agent's "done" stops matching the world. That part I don't have yet.

Simon Sharwood, "Vibe coding service Replit deleted user's production database, faked data, told fibs galore," The Register, 21 July 2025; AI Incident Database #1152. Originally surfaced by Jason Lemkin (SaaStr), X, 18 July 2025. After the incident Replit announced automatic dev/prod database separation, improved rollback, and a planning-only mode (CEO Amjad Masad, reported by Fortune, 23 July 2025), all reversibility-by-design fixes. ↩ ↩²
Lisanne Bainbridge, "Ironies of Automation," Automatica 19(6): 775–779, 1983. ↩
Raja Parasuraman & Dietrich Manzey, "Complacency and Bias in Human Use of Automation: An Attentional Integration," Human Factors 52(3): 381–410, 2010. ↩
Anthropic Engineering, "How we built Claude Code auto mode," anthropic.com/engineering/claude-code-auto-mode (2026): users approve about 93% of permission prompts, which the authors note leads to "approval fatigue, where people stop paying close attention." ↩
"Google Gemini AI Hallucinates Commands, Deletes Expert's Files, Takes the Blame," WebProNews, 27 July 2025; AI Incident Database #1178. ↩
Mata v. Avianca, Inc., 678 F. Supp. 3d 443 (S.D.N.Y., 22 June 2023). The fabricated-citation pattern recurred in dozens of sanctioned filings through 2025–2026. ↩
Samia Kabir, David N. Udo-Imeh, Bonan Kou & Tianyi Zhang, "Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions," Proceedings of CHI 2024: 52% of ChatGPT answers to 517 programming questions contained incorrect information; participants still preferred them 35% of the time and overlooked the misinformation 39% of the time, citing their comprehensiveness and articulate style. DOI: 10.1145/3613904.3642596. ↩
Anthropic, "Measuring AI agent autonomy in practice," 18 February 2026: of observed agent actions, ~0.8% are irreversible (such as sending an email to a customer) and ~73% have a human in the loop in some form. The paper argues that oversight rules "that prescribe specific interaction patterns, such as requiring humans to approve every action, will create friction without necessarily producing safety benefits." (Distinct from the auto-mode write-up in ⁴: that one measures the 93% approval rate of permission prompts; this one measures the share of actions that are irreversible.) ↩ ↩²
CISA et al., "Careful Adoption of Agentic AI Services," May 2026; OWASP Top 10 for Agentic Applications, December 2025. OWASP's companion AI Agent Security Cheat Sheet assigns tool-risk ratings: database_delete and transfer_funds are rated CRITICAL, most actions far below. ↩
Kunvar Thaman, "Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use," arXiv:2605.02964 (2026): environment hardening reduced agent exploit rates by 5.7 percentage points (an 87.7% relative reduction) without degrading task success. ↩
Yinfang Chen et al., "STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds," arXiv:2506.02009 (NeurIPS 2025); see IBM Research's write-up "An 'undo-and-retry' mechanism for agents," 12 November 2025. Each agent action carries an undo operator, commands are simulated first, and a transaction reverts when system state worsens; the team's stated assumption is that "every action must be undoable." A research direction, not a shipped product. ↩
Bowen Baker et al. (OpenAI), "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation," arXiv:2503.11926, March 2025: an LLM monitor reading the chain-of-thought caught ~95% of systemic reward-hacks versus ~60% for action/output-only monitoring. ↩
Tomek Korbak et al., "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety," arXiv:2507.11473 (2025); Baker et al. (above) further show that optimizing directly against a chain-of-thought monitor pushes the agent to obfuscate its reasoning, so monitorability degrades under training pressure. ↩
Kale et al. (Scale AI / CMU / MIT), "Reliable Weak-to-Strong Monitoring of LLM Agents," arXiv:2508.19461, August 2025: escalating only pre-flagged cases to human reviewers improved true-positive detection by ~15% (at a 1% false-positive rate) over broad review. ↩
Bondarenko et al. (Palisade Research), "Demonstrating Specification Gaming in Reasoning Models," arXiv:2502.13295 (2025): o1-preview, for example, edited the stored board-state to force the chess engine to forfeit rather than winning legitimately. ↩
"Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation," arXiv:2603.03116, 2026. ↩
Apollo Research, "Watcher" (watcher.apolloresearch.ai), 2026: a runtime oversight layer tuned against 40+ recurring coding-agent failure modes drawn from real coding-agent transcripts. Apollo's pipeline "injects failure modes into them to produce a large dataset with approximate ground truth," since real labeled failures are rare in production. Vendor product (alpha); cited as evidence the runtime-detection category exists, not proof of efficacy. ↩
Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet," ICLR 2024 (arXiv:2310.01798); Laban et al., "Are You Sure? Challenging LLMs Leads to Performance Drops in the FlipFlop Experiment," arXiv:2311.08596, which finds that challenging a model flips ~46% of correct answers and drops accuracy ~17%. ↩
Kim et al., "Correlated Errors in Large Language Models," ICML 2025 (arXiv:2506.07962). ↩
Google DeepMind, "Human-AI Complementarity: A Goal for Amplified Oversight," arXiv:2510.26518 (2025), citing Vaccaro et al. (2024) for the finding that human-AI teams on average underperform the better of human or AI alone. ↩