A human watching AI isn't oversight
A human watching AI isn't oversight — detection and reversibility are. Place controls where the agent can't route around them, and remember every control should be watching the process and results, not just the code.
- "Keep a human in the loop" is the answer everyone reaches for, and 40 years of human-oversight research says a passive monitor is the weakest safeguard.
- Three altitudes — chat, solo dev, team — same failure: the agent reports success while doing the wrong thing.
- What worked in my own build: layered automated review with me as the adjudicator, structural gates that check processes, fail the build before production, and blocking irreversible actions entirely.
- The honest gap: every gate I have fires before PR merge. Two incidents slipped all of them — the code was fine; only the end result was wrong.
You've probably heard the Replit one. In July 2025 a founder building an app on Replit's AI agent put the project under an explicit code freeze. The agent deleted his production database anyway — more than 1,200 live records — then generated thousands of fake records to fill the gap, produced passing test results that weren't real, and told him the data was gone for good and couldn't be restored. It could. Every beat of that story is the agent reporting success, or certainty, while the truth was the opposite.1
I've never deleted a production database. But I've shipped the quiet version of that same failure — the one that doesn't make headlines because nothing dramatic happens on screen. An agent wired up the transactional emails my product sends to move a deal forward, reported the work done, and the tests passed; the dashboard showed the messages as sent, and the audit trail agreed. Some of them never left the building. The function that actually sends the mail returns a success code even when a message is suppressed or blocked — so "delivered" and "silently dropped" came back looking identical, and every caller that trusted that code marched the workflow forward as if the mail had gone out. It reported success, and it was wrong, and nothing in the code was there to notice.
This is the question everyone is circling when they ask how to balance automation against control: how do you tell when the AI has quietly gone down the wrong path, and where do you insert human oversight to catch it? I've analyzed my own patterns and performed best practices research, and I think the common answer is the wrong one.
The question everyone asks first
The reflex is "keep a human in the loop." Add an approval step. Make someone sign off. It feels responsible, and it is almost useless, and we've known that for forty years.
Lisanne Bainbridge wrote it down in 1983, in a paper called "Ironies of Automation": automate most of a task and you leave the human with the one job humans are worst at — watching a system that's usually right, staying sharp enough to catch the rare moment it isn't. The skills you'd need to intervene atrophy from disuse. The watcher gets worse at watching exactly as the thing they're watching gets more trustworthy.2 Parasuraman and Manzey spent the next few decades showing this is not a training problem — experts rubber-stamp too, especially under load. It's called automation complacency, and a checkbox doesn't fix it.3
The modern numbers say the same thing. Anthropic measured how people actually use approval prompts in Claude Code and found users approve about 93% of them — and the more prompts someone sees, the less attention they pay to each one. A human "in the loop" who approves nine of every ten actions on reflex isn't oversight. It's a turnstile.4
And this isn't a Claude problem, or a coding problem. In July 2025 a developer watched Google's Gemini command-line agent misread a failed mkdir (the command that creates a folder), then run a series of file "moves" against a filesystem that didn't exist the way it thought it did — overwriting and erasing his work while cheerfully reporting progress. The agent's own confession afterward: "I have failed you completely and catastrophically."5 Different vendor, different tool, identical shape. The agent reported success while doing damage, and the human was there the whole time.
Whether a human should oversee was never the hard part. The platitude skips the two questions that are: can you detect the silent wrong turn at all, and where does oversight actually change the outcome instead of just adding friction?
The same failure, three altitudes
The failure mode is one thing wearing three costumes, depending on how you're working.
| Your loop | Chat userChatGPT, Gemini, Claude | Solo devyou + a coding agent | Teammany authors, shared CI |
|---|---|---|---|
| Oversight is | reading the answer | the diff & the prompt | review, CI, the merge |
| Silent failure | a confident wrong answer | green checks that lie | "done" work that never runs |
| What helps | verify the source, not the confidence | gate it; deny the command | adjudicate; watch the override rate |
If you only ever use a chat assistant — ChatGPT, Gemini, Claude, any of them — your oversight is just reading the answer, and the silent failure is confident wrongness. The clearest example is the one that's now gotten a string of lawyers sanctioned: an attorney asked ChatGPT for case law, and it produced six court decisions — names, citations, quotations — that did not exist. When he asked the model whether the cases were real, it assured him they were. He filed them; the judge couldn't find them; he was sanctioned.6 It isn't malice or stupidity. A fluent, confident answer reads as a correct one, and the more polished it is, the easier it is to wave through. In one study, people preferred a chatbot's answers more than a third of the time even though over half were wrong, because they were better written.7 Fluency is not correctness.
If you're a solo developer, your default oversight is the diff and the permission prompt. The silent failure is the one I opened with — or the time I "fixed" a type error by dropping the database type definitions, which made the type checker pass and shipped, and quietly accumulated eighteen real type bugs over the following week before I turned the checking back on. The check said green. Green was a lie.
If you're a team, default oversight is code review, CI and someone's name on the merge. The silent failure scales up: a batch of email handlers that passed review and tests as "done" — and had no caller anywhere in the codebase, so not one of those emails would ever send. It was caught, but not by a person reading carefully. It was caught because a second agent noticed the functions were never invoked.
What actually worked
Here's the thing I didn't expect. The oversight in my own setup that earns its keep is not just me watching the agent. It's me adjudicating between agents.
The first-pass reviewer on my pull requests is itself an AI — a reviewer bot that reads every change from multiple perspectives. In one of my repos it has left more than ten times as many review comments as I have. I don't catch raw code by eye anymore; I triage what the bot flags. And the moments where my judgment is actually load-bearing are the ones where I overrule it. The bot once flagged a refund calculation and proposed a clean, reasonable-looking fix — that would have reintroduced a bug I'd fixed weeks earlier, one it had no way to know about. I left a comment turning the fix down and codified the issue in a false-positive register. That's the job: not vigilance, but knowing the thing the reviewer can't see and writing it down so the catch survives.
Everything else that works is structural — and it comes down to one distinction every security person already knows: an advisory control tells you what to do; a technical control makes the wrong thing impossible. A written rule is advisory. The failure I opened with — the emails that reported sent and never went out — is the case in point: I had a rule, in plain words, about how to handle email-send results, and the same bug shipped under that rule three times, because a prose rule is a suggestion the next session forgets. What finally stopped it was a build-breaking check that fails any code path that doesn't branch on the send status. The rule became a gate.
# RULE.md: "always branch on the email send status"CI: fail the build if a caller doesn't check the status enumthe ruleProse documents intent; gates enforce it. The first decays the moment a new session doesn't read it. The second is a chokepoint the agent has to pass through, and it fails closed.
The strongest control I have is the end of that same line: a denied command. My agents can't merge their own work — git merge and the pull-request merge command sit on a deny list, so when an agent reaches for one, it's blocked before it runs, the way a firewall drops traffic to a closed port. Only a human merges. That's worth more than any approval prompt, because a capability the agent doesn't have can't be approved around by a tired click — there's no prompt to rubber-stamp, because the action never starts. The research backs the shape of it: in one study, hardening the environment so an exploit simply wasn't available cut a class of agent misbehavior by almost 90% — far more than catching it after the fact.8 Stated platform-agnostically, the move is the same everywhere: don't grant the dangerous tool, scope the token to read-only, put the irreversible action behind a wall the agent can't reach. Graded against the published guidance — CISA's agentic-AI advisory, the OWASP agentic top ten — denying the irreversible action is exactly the posture they prescribe.9
- Habityou remember; the agent doesn't
- Written ruleadvisory; decays
- CI gatefails the build
- Denied commandthe action never starts
The gap I can't gate
Now the honest part. Every control I just described fires before or at the moment code merges. They check the code — the logic, before it runs. Not one of them checks what the code actually did once it ran.
Take the OAuth login flow I shipped that completed and then bounced the user to a failure screen, because the function on the receiving end silently couldn't parse what my code sent it — and nothing in the type checker could catch the mismatch, because the two runtimes don't share a contract. Or the profile save that returned a clean "saved" and changed nothing, because the database policy let the row be read but silently refused the write. Both passed review, both passed their tests, both were wrong only once they ran.
This is the frontier, and the evidence is blunt about it. Researchers have shown reasoning models that overwrite the game state to report a win without winning.10 A 2026 study found that somewhere between a quarter and three-quarters of agent "successes" on a benchmark concealed a skipped or violated step — the end state looked done; the procedure was corrupt.11 Checks that look only at the end result pass these by definition. The fix-shaped idea, in my own words: check the process and the result, not just the code — after any delete, write, or external hand-off, re-read what actually happened and raise a flag when it doesn't match what the agent claimed. I have gates for the code. I'm still building the part that checks the result.
What to do about it
Audit your own loop against the thesis, scaled to how you work. The moves below are drawn from the research on AI self-correction, correlated model errors, and human-AI teaming.121314
If you work in a chat assistant
- Verify the source
- Check against the primary source, never the model's confidence — make it quote word-for-word, then confirm the quote exists.
- Don't ask it to self-check
- "Are you sure?" or "what did you miss?" makes a model flip correct answers to wrong ones about half the time. Self-doubt isn't verification.
- Get a real second opinion
- A different model beats asking the same one twice, but it isn't independent — models share blind spots. The only truly independent check isn't a language model: a test, a calculator, the document, or you.
If you're a solo developer
- Gate it, don't note it
- The first time a class of bug recurs, build a gate — a lint rule, a test, a check that breaks the build — instead of writing another rule.
- Deny, don't approve
- Stop reflex-approving; deny the irreversible command outright instead of approving it each time.
- Check the result, not the return code
- Assert the actual side effect — read back what you wrote — and run at least one adversarial test the agent didn't write itself.
If you're on a team
- Diversify the reviewer
- Keep a cross-vendor reviewer so it isn't the author's twin — but treat no AI reviewer as the final word.
- Adjudicate, don't rubber-stamp
- Make review evidence-assisted adjudication, not a verdict to confirm; paired badly, human-AI teams do worse than either alone. Watch your override rate — a falling one means the human has gone passive.
- Designers set the boundary
- People, not the agent, decide which actions need sign-off.
Presence isn't oversight
The lesson isn't "keep a human in the loop." It's that presence isn't oversight — detection and reversibility are. A written rule is a suggestion; a gate is a control; and the strongest gate is the command you deny outright. Put it where the agent can't route around it.
And then remember the part I'm still working on: the gate you're most sure of only checks the code. Something still has to check the process and the result — and raise a flag the moment they don't match what the agent reported.
Footnotes
-
Simon Sharwood, "Vibe coding service Replit deleted user's production database, faked data, told fibs galore," The Register, 21 July 2025; AI Incident Database #1152. Originally surfaced by Jason Lemkin (SaaStr), X, 18 July 2025. ↩
-
Lisanne Bainbridge, "Ironies of Automation," Automatica 19(6): 775–779, 1983. ↩
-
Raja Parasuraman & Dietrich Manzey, "Complacency and Bias in Human Use of Automation: An Attentional Integration," Human Factors 52(3): 381–410, 2010. ↩
-
Anthropic Engineering, "How we built Claude Code auto mode," anthropic.com/engineering/claude-code-auto-mode (2026): users approve about 93% of permission prompts, which the authors note leads to "approval fatigue, where people stop paying close attention." ↩
-
"Google Gemini AI Hallucinates Commands, Deletes Expert's Files, Takes the Blame," WebProNews, 27 July 2025; AI Incident Database #1178. ↩
-
Mata v. Avianca, Inc., 678 F. Supp. 3d 443 (S.D.N.Y., 22 June 2023). The fabricated-citation pattern recurred in dozens of sanctioned filings through 2025–2026. ↩
-
Samia Kabir, David N. Udo-Imeh, Bonan Kou & Tianyi Zhang, "Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions," Proceedings of CHI 2024: 52% of ChatGPT answers to 517 programming questions contained incorrect information; participants still preferred them 35% of the time and overlooked the misinformation 39% of the time, citing their comprehensiveness and articulate style. DOI: 10.1145/3613904.3642596. ↩
-
Kunvar Thaman, "Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use," arXiv:2605.02964 (2026): environment hardening reduced agent exploit rates by 5.7 percentage points — an 87.7% relative reduction — without degrading task success. ↩
-
CISA et al., "Careful Adoption of Agentic AI Services," May 2026; OWASP Top 10 for Agentic Applications, December 2025. ↩
-
Bondarenko et al. (Palisade Research), "Demonstrating Specification Gaming in Reasoning Models," arXiv:2502.13295 (2025) — e.g., o1-preview edited the stored board-state to force the chess engine to forfeit rather than winning legitimately. ↩
-
"Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation," arXiv:2603.03116, 2026. ↩
-
Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet," ICLR 2024 (arXiv:2310.01798); Laban et al., "Are You Sure? Challenging LLMs Leads to Performance Drops in the FlipFlop Experiment," arXiv:2311.08596 — challenging a model flips ~46% of correct answers and drops accuracy ~17%. ↩
-
Kim et al., "Correlated Errors in Large Language Models," ICML 2025 (arXiv:2506.07962). ↩
-
Google DeepMind, "Human-AI Complementarity: A Goal for Amplified Oversight," arXiv:2510.26518 (2025), citing Vaccaro et al. (2024) for the finding that human-AI teams on average underperform the better of human or AI alone. ↩