Never make a user wait on an AI

RiskScanAI was my first product: a small business answers about twenty-five questions about their security posture, and a Claude model writes them an assessment. The worst bug had nothing to do with whether the AI was right. It was about time.

The collision

Generating the summary — a five-section structured JSON document, a couple thousand tokens — takes the model somewhere between twenty and sixty seconds. The serverless platform hosting the app gives a single web request ten seconds by default, twenty-six on the paid plan, before it terminates the request and hands the user a raw error page. Those two numbers cannot coexist.

For six days I tried to make them. Tighter timeouts. Faster models. Retries. Racing a deterministic fallback against the gateway's clock. Each attempt either failed the same way or produced a worse summary — and one "fix" made it worse still: the timeout page the platform returns is HTML, which my frontend then choked on while trying to parse it as JSON.

The fix that stuck

The answer was to stop fighting the constraint and restructure around it. The commit that ended the war put it plainly: don't wait for Claude in the HTTP request.

The moment the user finishes the questionnaire, the request returns in under a second — HTTP 202 Accepted, the status code for "I've taken the work, it isn't done yet."
The actual model call moves to a background function with a fifteen-minute budget, which writes its result to the database when it finishes.
The frontend polls every few seconds and renders a solid, instantly-available fallback summary — computed without the AI — then upgrades the page in place when the real one lands.

snapshot-complete.js · where the Claude call lives

504sconst summary = await claude(…) // 20–60s, inside a 26s request

HOLDSreturn 202; background(() => claude(…)) // then poll for the result

the ruleThe web request is for acknowledgment, never for work. Return 202 in under a second, run the model in a background function with a real time budget, and poll for the result — with a deterministic fallback on screen until it arrives.

Then a tuning pass — kicking off the background work before the "done" screen even renders, halving the token budget for a faster first token, first poll at 300 milliseconds — cut the perceived wait from about sixty seconds to about fifteen.

Constraints are design instructions

"Requests die at twenty-six seconds" sounds like a limitation. It's actually the platform telling you where long-running work belongs — and it isn't inside the request. The user-facing request is for acknowledgment; the work happens behind it.

That became a permanent architecture rule for everything I've built since: acknowledge instantly, do the slow work in the background, and always keep something useful on screen while the better answer is still cooking. My later products never had to fight this war, because the first one lost it properly.