AI Confidence Thresholds: When to Escalate to a Human

Created time

Jun 9, 2026 02:49 PM

Title length (<60)

Author

Mike Heap

Last optimised

Ecomm?

Why "set a confidence threshold at 80%" is a pre-LLM relic

⚡

TL;DR: A numeric threshold made sense for classical machine learning, which output a real probability you could calibrate. A generative LLM's confidence score is the model rating itself, so a fixed percentage is shakier than it looks.

The standard advice is tidy. Give the AI a confidence score on every answer, then act in tiers: respond directly above 90%, add caveats or ask a clarifying question between 60 and 90%, and escalate below 60%. Then calibrate by pulling a few hundred escalated tickets, checking what confidence the AI reported, and nudging the threshold until you're happy with the trade-off.

For a classical machine-learning classifier (the world this advice grew up in), that's exactly right. Those models output a genuine probability, and a probability is something you can threshold and calibrate with a straight face.

Generative LLMs don't work that way. Most "confidence scores" in a modern support agent are produced by prompting the model to say how confident it is, and that's more of a judgment call than a measurement.

It's a bit of pseudo-science, really (and I mean that literally): ask two different models to rate the same answer and you get two different numbers, and ask the same model at two times of day and the number moves again. More art than science, which makes a hard threshold on top of it shakier than it looks.

A two-column comparison of a classical machine-learning classifier versus a generative LLM agent across what the score is, asking twice, and whether it is safe to threshold.

That vagueness, though, is mostly an upgrade. A modern model applies far smarter judgment to "do I actually know this?" than a binary cut-off ever could, because it can read the situation, weigh what it's been given, and reason about what it does and doesn't know. What you actually need is a better model of the decision.

A single global threshold has a second problem (and it's the bigger one). It treats every ticket as carrying the same risk. At 80%, the same dial that confidently answers a "where's my order?" question also confidently answers a "can I cancel my policy, and am I still covered?" question, even though getting the second one subtly wrong costs far more.

As CX Today notes, an escalation strategy that's too eager or too sticky erodes customer trust either way. A number also ignores every reason to escalate that has nothing to do with confidence: a frustrated customer, a topic you've decided a human must own, or someone who simply asks for a person.

Replicant's guide on escalation rules makes the same point from the other direction, that escalation is intentional design and "it's important not to treat every bump in the road as a reason to escalate." Even Microsoft's bot-handoff documentation frames the question as when a human is needed, rather than which number to dial in. None of them frame the answer as a single percentage, because there isn't one.

The framework: two thresholds, four triggers, and one question first

⚡

TL;DR: Replace the one confidence number with two thresholds (can the AI ground an answer? and should it, given the stakes?), four escalation triggers, and a clarifying-question step before any handoff.

If a single confidence number is the wrong model, here's the one we'd use instead. It has two thresholds that most teams collapse into one, four triggers that fire a handoff, and a clarifying-question step in the middle that recovers answers a naive threshold would have thrown away.

A framework breakdown showing two thresholds (can it ground an answer, should it given the stakes), a clarifying-question step, and four escalation triggers.

Threshold one: can the AI ground an answer?

This is the threshold I actually care about, and you don't set it as a number at all. It comes down to whether the information needed to answer is available to the AI. If the knowledge is connected and the data is reachable, the AI can answer; if it isn't, the AI should say so instead of guessing.

You don't tune this dial, you raise it by giving the AI more to work with. Connect more of your help center, wire up live customer data through an API, and the set of questions the AI can ground grows. And if you don't have written docs yet, this isn't a dead end: we let you use Train on Historic Tickets to auto-generate starter knowledge from your last few thousand resolved tickets, so you can get going from scratch.

The clearest proof of this is Edel Optics. They lifted AI resolution from 25% to 79% not by loosening an escalation rule but by adding a User Data API so the AI could see order, delivery and return information. They raised the real threshold, and fewer tickets tripped the "I can't answer" line as a result.

Threshold two: should it answer, given the stakes?

The second threshold is one we lean on hard, and it's independent of confidence. Even when the AI can answer perfectly well, some topics should go to a human anyway because the cost of being wrong is too high. Security and safety issues, legal and fraud questions, billing disputes, account deletion: a person owns those by design.

This is a decision you make once and then leave alone (and I'd argue it matters more than the first). You're deciding which categories of conversation a human always handles, no matter how confident the AI is. Kriptomat does exactly this: they route legal and fraud topics to humans through handover guidance, while the AI handles the other 62% of resolved tickets on its own.

The bit the consensus misses is that these are two different questions. Capability ("can it answer?") and stakes ("should it?") aren't the same axis, and a single threshold conflates them. Tune one number and you'll be over-escalating easy tickets and under-escalating risky ones at the same time, with no way to fix both at once.

The clarifying-question step

Before any handoff, there's a tier most guides skip (and I think it's the one that does the quiet heavy lifting). When the AI isn't sure, or there's genuine judgment involved, the right move is often neither escalate nor guess: it's to ask the customer a clarifying question. "Which order are you referring to?" or "Do you mean the annual or monthly plan?" frequently unlocks an answer the AI can then give cleanly.

So the flow runs in three steps: answer; if it can't, clarify; and if it still can't, offer a human. A surprising number of conversations a blunt threshold would have bounced straight to an agent get recovered at that clarify step.

A three-step flow: answer if the AI can ground it, clarify with a question if unsure, then escalate to a human if it still can't.

The four triggers

Confidence is only one of four reasons to hand off, and a good setup fires on all four instead of limiting escalation to "the AI didn't know."

Video preview — AI to Human Handoff for Customer Support

It can't ground an answer. The information isn't there, so the AI says so, clarifies if it can, and otherwise offers a person. (This is the closest thing to the classic "low confidence" trigger, reframed around information instead of a self-rated score.)

The topic belongs to a human. A risk or policy boundary the AI shouldn't cross (security, safety, legal, fraud, billing), escalated regardless of how confident the AI is.

Sentiment or frustration. If the customer is upset, we escalate even when the AI could technically answer. The goal is the right outcome, rather than a defended resolution rate.

An explicit request for a person. Always honor it, immediately.

Calibration without a dial

You'll notice none of this involves picking a percentage (I promise that's deliberate). What you're actually calibrating is which topics a human owns and how easy the way out is, and the lens for that is the cost of being wrong in each direction.

There are two failure modes, and in our rollouts they're rarely symmetrical. The first is false deflection: the AI answers, gets it wrong, and the customer quietly gives up or churns. It's expensive and it's invisible, because a deflection metric counts it as a win.

The second is over-escalation: the AI hands off something it could have solved, which is wasteful and slow and undercuts the reason you bought AI in the first place. So the rule is simple: where false deflection is the costlier error (anything irreversible, regulated, or high-value), send it to a human by design; where over-escalation is the costlier error (high-volume, low-stakes, easily reversible), let the AI try, backed by the clarify step. You're designing routes rather than turning a knob.

Mapped onto real ticket types, that calibration looks like this:

Ticket type	Costlier error	Who owns it
Order status, WISMO	Over-escalation	AI, with the clarify step
Returns, address changes	Over-escalation	AI, with the clarify step
Refund eligibility	False deflection	AI only if it can ground it; otherwise a human
Billing disputes, cancellations	False deflection	A human, by design
Security, safety, legal, fraud	False deflection	A human, always

One more thing, because it's how we keep our own numbers straight. We count a conversation as resolved when it wasn't escalated to a human, and that only holds up because escalation is genuinely easy to reach.

We don't pretend to know an issue was truly solved without the customer confirming it. The trustworthiness comes from the easy, multi-path exit rather than a clever score.

What this looks like in real rollouts

⚡

TL;DR: The high-CSAT rollouts we've watched each designed the escalation path on purpose: Edel Optics by information, Kriptomat by topic, and Sofar Sounds by escalating most of the inbox deliberately.

The framework isn't theoretical. The rollouts we've watched that get escalation right each lean on a different one of the triggers, and together they show the range.

Edel Optics: raise the real threshold

Edel Optics set up an "I don't know" handover, so any question the AI couldn't answer went straight to a person. That's threshold one in action, and probably the cleanest example we have: the AI doesn't guess, it routes.

The part I find telling is how they improved it. Rather than loosening the rule, they added a User Data API so the AI could see live order, delivery, return and tracking information, which lifted AI resolution from 25% to 79% and pushed AI CSAT to 92% across 4,067 tickets. They raised the information available to the AI (no escalation rule touched), and the resolution rate followed.

Kriptomat: own the risky topics

Kriptomat configured handover guidance to send legal and fraud topics to humans regardless of confidence. That's threshold two in our framework: a stakes call, where capability doesn't enter into it. The AI resolves 62% of tickets and saves the team 172 hours a month, while the conversations that should never be automated get routed away by design.

They also kept the AI useful after the handoff (a detail I like). Their agents use the AI Copilot inside Intercom once a conversation has been passed over, so the human is faster even on the tickets the AI deliberately didn't answer.

Sofar Sounds: escalate on purpose

Sofar Sounds is the counter-example I always reach for. They run roughly 750 monthly tickets through Zendesk and deliberately escalate around 74% of the inbox to humans, with the AI resolving only about 26%. By a headline resolution number, that looks low.

But it's a deliberate calibration rather than a failure. The AI triages and prepares context so the small team can respond faster to everything else, and the result is 85% AI CSAT and around 16 hours saved every month. When the stakes and the experience matter more than the headline rate, a deliberately low answer-threshold is the right call.

For context, the field-wide AI resolution rate sits at a median of around 70% (fun fact: the competitor-only median lands in almost exactly the same place, and these are self-reported stats across roughly 55 vendors, so treat them as directional rather than like-for-like). Edel Optics and TravelJoy sit above that center; Sofar Sounds sits well below it on purpose. The number on its own tells you almost nothing about whether the escalation design is good.

A spectrum from 0 to 100 percent AI resolution plotting Sofar Sounds at 26 percent, the field median at 70 percent, and Edel Optics at 79 percent.

What to do this week

⚡

TL;DR: Don't pick a percentage. Close your information gaps, decide which topics a human owns, turn on clarify-then-human, and review handover rate and CSAT weekly.

Notice that none of the actions below say "set your threshold to X", and that's intentional. Here's where I'd spend the time instead.

Close the information gaps. Pull the questions your AI couldn't answer last month, and connect the missing knowledge or wire up the live data behind them. (No help-center content yet? Generate a starter set from your historic tickets.) This raises the real threshold and tends to lift resolution more than any escalation rule. Budget around half a day.

Decide which topics a human owns. List the irreversible, regulated and high-stakes categories (security, safety, legal, billing disputes, cancellations) and add a topic-based escalation rule for them. Around 30 minutes, and the work is design rather than dial-tuning.

Turn on "can't answer, then clarify, then human." Let the AI ask a clarifying question before it gives up, and offer a person only if that fails. Around 30 minutes, and you'll watch false deflection drop without sending more tickets to agents than you need to.

Make escalation easy and multi-path. Wire sentiment and explicit requests so they escalate instantly, beyond the "didn't know" case. Around 15 minutes, and probably the cheapest win on this list.

Review weekly on the right metrics. Look at your handover rate, your AI CSAT, and a sample of the tickets the AI actually answered, rather than your deflection number alone. Adjust which topics humans own based on what you see (around 30 minutes a week).

How do I get an AI to sort my tickets into AI-handled and human-owned?

Want a head start on step one? Paste your ticket types into the prompt below and let an LLM draft the first version of your map. It's only a starting point (it can't judge your real answer quality, so you still need to test on live tickets), but it turns a blank page into something to react to.

You are helping a customer support team decide when their AI agent should answer a ticket and when it should hand off to a human.

Here are our most common ticket types:
[paste your top 10-15 ticket types, one per line]

For each ticket type, do the following:
1. Rate the cost of the AI getting it wrong (a "false deflection") as low, medium, or high. Anything irreversible, regulated, financial, or safety-related is high.
2. Rate the cost of the AI needlessly escalating it (an "over-escalation") as low, medium, or high. High-volume, low-stakes, easily reversible questions are high.
3. Recommend an owner: "AI, with a clarify-then-human fallback" when over-escalation is the costlier error, or "Human, by design" when false deflection is the costlier error.
4. Note what knowledge or live data the AI would need connected to answer it well.

If you can't tell from the ticket-type name, write "need more detail" instead of guessing.

Output a table with these columns: Ticket type | False-deflection cost | Over-escalation cost | Owner | Data the AI needs.

"But shouldn't I be able to tune this?"

⚡

TL;DR: Mostly you shouldn't have to. Tuning the confidence line is the vendor's answer-quality job; you control the escalation design. Early rollout, regulated support, and tiny teams are the real exceptions.

Here's the answer I'd give that most vendors won't: mostly, no. Deciding where the confidence line sits is the vendor's answer-quality problem to solve, never the customer's. A tool that hands you a confidence-threshold slider is quietly handing you its own job, and you should be able to trust that you don't need to fiddle with confidence parameters to get a good outcome.

What you should control (and what we actually expose) is the escalation design: which topics a human owns, and how easy the way out is. That's real configuration with a real effect. A confidence percentage gives you neither.

There are three genuine exceptions where more control helps. The first is early rollout, where deliberately escalating a lot while you learn is sensible (day one is the worst your AI will ever be, so erring toward humans early is a fair trade).

The second is high-stakes or regulated support, where so much is costly to get wrong that "escalate by default" is simply correct. The third is very small teams, where a human answering is sometimes faster than configuring anything at all (and that's fine).

And one warning in the other direction, because I see this one constantly. Chasing a low escalation rate is itself a mistake. The goal is the fastest, best, right answer for the customer, and sometimes that means handing off immediately, so minimizing handoffs is optimizing the wrong number.

The takeaway

⚡

TL;DR: Split the confidence threshold into information availability and stakes, fire four triggers with a clarify step, and pick a tool whose answer quality you trust enough to leave the slider alone.

Stop looking for the magic confidence percentage. With generative AI that number is mostly the model rating itself, and in our experience it's too unstable to carry the weight teams put on it.

Split the decision into the two thresholds that count: whether the AI can ground an answer, which you raise by connecting knowledge and data, and whether it should answer at all, which you decide by topic. Fire on four triggers, ask a clarifying question before you escalate, and make the way out easy.

Then pick a tool whose answer quality you trust enough that you're not babysitting a slider. If you want to go deeper on the mechanics of the handoff itself, our guide on AI-to-human handoff covers the triggers, the transfer and what a good handoff looks like. And if you'd rather see it working than configure it, you can try My AskAI free.

FAQs

When should an AI agent escalate to a human?

In our model, a handoff should fire on any of four triggers: the AI can't ground an answer, the topic is one a human should own (legal, billing, security), the customer is frustrated, or the customer asks for a person. Confidence is only one of those four, and we'd lead with whether the AI actually has the information to answer instead of a self-reported score.

What is a good confidence threshold for AI handoff?

There isn't a magic number for a generative AI agent, much as everyone wants one. The "confidence score" a modern LLM produces is the model rating itself, so a fixed percentage rests on a shaky foundation. The threshold that genuinely governs this is whether the information needed to answer is available to the AI.

What is an AI confidence score, and is it standardized across vendors?

It's a measure of how certain the AI is about an answer, but for generative models it's usually produced by asking the model to rate its own confidence, which is closer to a judgment than a calibrated probability. It's not standardized: ask two models, or the same model twice, and you can get different numbers. We treat it as a soft signal, and never a dial that governs escalation.

How do you stop an AI from answering when it shouldn't?

Make sure it says "I can't answer" when the information isn't available, instead of guessing, and let it ask a clarifying question before it gives up. The biggest single lever we see is the knowledge itself: when the AI can ground answers in connected docs and live data, false deflection drops sharply.

Should I try to minimize my AI's escalation or handover rate?

No, and chasing that number is a classic trap. The goal is the fastest, best, right answer, and sometimes that means handing straight to a person. We've watched teams deliberately keep AI-resolution low and CSAT high because escalating the harder inbox, with context prepared, was the better experience.

How do I escalate certain topics, like billing or legal, regardless of confidence?

Set a topic-based escalation rule so those categories always go to a human, independent of how confident the AI is. In our product this is done with natural-language Handover and Escalation guidance, and it works across whichever helpdesk you already run.

How do I know whether my escalation is set right?

Watch your handover rate, your AI CSAT and a sample of the tickets the AI actually answered, rather than your deflection number alone. Deflection counts a wrong answer as a win, so it hides exactly the failure mode you care about.

Does My AskAI let me set a confidence threshold?

No, and that's deliberate. We don't expose a confidence number to tune, because we don't think you should have to. The real threshold is whether the AI can ground an answer; on top of that, our guidance rules let the AI clarify and then offer a human, route topics like security or legal to a person, and escalate on frustration or an explicit request.

When Should AI Escalate to a Human? A Confidence-Threshold Framework

Why "set a confidence threshold at 80%" is a pre-LLM relic

The framework: two thresholds, four triggers, and one question first

Threshold one: can the AI ground an answer?