AI Customer Service Vendor Selection: Buyer Checklist

Created time

Jun 2, 2026 10:25 AM

Title length (<60)

Author

Mike Heap

Last optimised

Ecomm?

Why does most AI vendor selection go wrong?

⚡

TL;DR: Most buyers pick on the demo, the model name, or brand familiarity. None of those predict how a tool handles your real tickets, or what it costs as it improves, which is what actually decides your resolution rate a year in.

Read the guides that rank for "how to evaluate an AI customer service vendor" and they all have the same shape (I have read a lot of them). They lead with enterprise-technical signals: model explainability, drift monitoring, performance-testing methodology.

A lot of the loudest ones are written by vendors, so the criteria quietly bend toward whatever that vendor happens to be good at. The rest are generic AI-procurement checklists that were never written for a support team at all.

Three instincts run most real buying conversations, and all three are weak predictors of whether the thing will actually work for you.

Four common mistakes buyers make when choosing an AI customer service vendor: picking on the demo, the model, brand familiarity, and day-one answer quality.

The first is the demo, which I've already picked on.

The second is the model. Buyers ask whether it's GPT-5.5 or fine-tuned, as if the answer settles anything. It doesn't.

The best model gives poor answers when it's set up badly, and a humbler model gives great answers when the team behind it knows what they're doing. So judge the quality of the answers you get, and let the engine be the engine's problem.

The third is brand familiarity, and I get the pull. The native AI inside your helpdesk feels like the safe default, and sometimes it genuinely is. But "we already pay for it" is a reason to test it before you lean on it.

What none of those three instincts measure is the thing that decides your resolution rate a year from now: how the tool improves after launch, what it costs as it does, and whether it'll clear your security review without stalling the deal. So that's what we're going to score.

The three-gate vendor scorecard

⚡

TL;DR: Score every vendor against three gates, in the order the deal moves: will it work (the champion's gate), will the bill hold (the economic buyer's gate), and will it clear review (the security gate). Skipping one is how deals die at finance or get vetoed by security.

Here's the organizing idea, and it comes straight from how these deals actually move. An AI customer service tool only gets bought when three different people say yes, in this order:

The champion (the support or CX lead running the trial) has to believe it'll work day to day.

The economic buyer (the director or VP signing the check) needs the bill to be predictable.

The security reviewer (IT, InfoSec, or your DPO) needs it to clear review.

Most failed selections skip a gate. A tool that wins the champion but blows up the budget dies at finance, and a tool the champion and buyer both love can be quietly killed by a security reviewer in week three.

So instead of one long undifferentiated list, score each vendor against three short ones, in the order the deal travels. Each criterion below has a "what good looks like" and a question to put to the vendor. Score each out of five, and the winner is the one with the fewest weak spots across all three gates (not the one with the prettiest demo).

The three gates of AI vendor selection in deal order: will it work (champion), will the bill hold (economic buyer), will it clear review (security).

Gate 1: will it actually work?

This is the gate the person running the trial owns. Here's the scorecard.

Criterion	What good looks like	The question to put to the vendor
Knowledge ingestion	Direct connectors to your help center, website, docs and wikis; sensible limits; auto re-sync when content changes	"Can you ingest my actual sources through a direct connector, not just a crawl?"
Answer quality on your tickets	A free test on a few hundred of your own real tickets before you commit	"Will you run 50 to 250 of my real tickets, free, first?"
Escalation and handover	A clean hand to a human with the conversation summarized, plus an easy path for the customer to ask for a person	"When the AI can't help, exactly how does the customer reach a person?"
The improvement loop	Visibility into misses and the tools to fix them, in both knowledge and actions (see below)	"Show me how I close one gap I find in week one, and an action the AI can take beyond answering."
Channel and pattern fit	Direct reply, internal notes, and copilot modes across your channels, matched to how your team works	"Can I run this in notes mode alongside my current setup before going live?"
Setup and support	Live in days, real onboarding help, support hours that cover yours	"Who helps me set this up, and how fast are most customers live and replying directly?"

Most of these are self-explanatory once you see them written down, so I'll pause on the two that aren't.

Escalation and handover should be the first thing you check, because no AI on the market resolves everything yet. You want a clean hand to a human with the conversation summarized, and an obvious way for a customer to reach a person whenever they want one.

There's a sharp edge here too. A resolution rate is only as honest as the escalation path behind it, and a tool that makes it hard to reach a human can flatter its own numbers by simply not handing off. We count a conversation as resolved only when it wasn't escalated, and we keep escalation deliberately easy precisely so that number stays honest.

The improvement loop is the one I'd tell you to weight above everything else, the demo included. Day one is the worst your AI will ever be. Where it lands by month six comes down to how far and how fast you can close the gap afterwards, and the gap has two halves.

The first half is knowledge. Can you see the questions the AI couldn't answer, the topics driving handovers, and the conversations customers rated poorly? And can you act on them fast, by adding a custom answer, fixing a help-center article, or letting the tool draft new knowledge from how your agents actually replied?

(Ours does that last one through Self-Learning and surfaces the misses through Insights, but the capability matters far more than our names for it.)

The second half is the one most checklists miss entirely: action tools. Past a point, answering questions better stops moving your resolution rate at all.

What closes the remaining gap toward full automation is the AI being able to act: look up an order, process a refund, change an address, read a customer's live account data. A tool that can only answer from documents will cap out. Give it the ability to take actions through tasks and APIs (we built ours as Tasks plus a User Data API, but the principle holds whoever you buy from), and it keeps climbing.

When you score the improvement loop, score both halves.

Gate 2: will the bill hold?

This is the gate the check-signer owns, and it's where deals most often unravel after the champion has fallen in love. I've watched more than one die right here. Three criteria.

Criterion	What good looks like	The question to put to the vendor
The meter	A unit you can forecast from numbers you already track, like tickets	"Can I forecast next month's bill from numbers I already have?"
What counts as billable	A clear, narrow definition of the billable event, with no ambiguous "resolutions"	"What exactly do you count as the billable event, and what doesn't count?"
Total cost and the improvement tax	A bill that stays flat or falls per resolved ticket as the AI improves	"What's my bill at double my current resolution rate?"

The meter matters more than the headline rate, because some meters you can forecast and some you can't. You already know roughly how many tickets you get each month, so a per-ticket meter rides a number you can predict. A per-resolution meter rides your resolution rate, which moves around.

That second point is the trap I most want you to flag. Under per-resolution pricing, your bill goes up as your AI gets better, because you pay for every extra resolution.

And here's the part vendors don't volunteer: most of what lifts your resolution rate is your own work. You write the knowledge, connect the tools, tune the guidance, so you end up paying more for improvements your own team created. A flat per-ticket meter stays put, and your cost per resolved ticket falls as the agent gets better (we're firmly in the per-ticket camp, for exactly this reason).

As a reference point, at 10,000 tickets a month and a 75% resolution rate, our Scale plan runs about $1,299 a month. The big helpdesk-native tools land far higher on their per-resolution maths: roughly $7,425 for Intercom Fin, $6,750 for Gorgias Automate, and $11,250 for Zendesk AI. The gap comes down to the meter.

Monthly cost comparison at 10,000 tickets and a 75% resolution rate: My AskAI Scale $1,299, Gorgias Automate $6,750, Intercom Fin $7,425, Zendesk AI $11,250.

Gate 3: will it clear review?

This is the gate that quietly kills deals everyone else has already approved, and I've watched it happen more than once. Bring it in early, by week two, and remember the reviewer wants documentation. Three criteria.

Criterion	What good looks like	The question to put to the vendor
Compliance	SOC 2 Type II and GDPR as a baseline, plus whatever your industry actually requires	"Send me your trust portal and the certifications relevant to my requirements."
Data handling	A clear no on using your data to train models, a clean sub-processor list, sensible retention	"Is my data ever used to train a model, and who are your sub-processors?"
Residency and access	Known data residency, SSO, role-based access, and an audit log for mid-market and up	"Where is data stored, and do you support SSO and audit logging?"

A word on compliance, because the vendor-written guides love to inflate it (we field this one constantly). SOC 2 Type II is the genuine baseline in 2026, and GDPR or UK GDPR if you touch European data.

Beyond that, work out what you actually need before you ask. Regulated and larger enterprises may require ISO 27001, ISO 42001, PCI-DSS, or HIPAA, and plenty of good vendors hold none of those, so name yours up front.

We're SOC 2 Type II and GDPR compliant, and I'll tell you straight that we don't hold ISO 27001, PCI-DSS, or HIPAA. A vendor being clear about what it isn't certified for is itself a trust signal.

On data handling, you want a clear no on training and a clean sub-processor list. (For what it's worth, our customer data is never used for model training or anything beyond serving that customer's own tickets, with isolated containers, AES-256 at rest and TLS in transit.) Ask the question anyway, of every vendor.

What does good vendor selection look like in practice?

⚡

TL;DR: The customers who chose well ran a real-ticket bake-off first, testing several vendors on a couple hundred of their own questions. Zeffy tested eight and only one passed, and the other teams here ran the same play before signing.

The pattern across the customers who chose well is boring and consistent: they ran a real-ticket bake-off before they signed. They tested several vendors on a couple of hundred of their own questions and let the scores on the doors decide.

These happen to be our customers, so take the logos with the appropriate pinch of salt. The takeaway is the method they used to choose, which is the boring-but-effective option almost nobody runs properly.

Three customer proof points from running a real-ticket bake-off: Zeffy tested eight vendors and one passed, Customer.io saved 55 hours in week one, Freecash resolves 82% of 70,000 monthly queries.

Zeffy, the free fundraising platform for non-profits, stress-tested eight AI vendors with more than 200 varied questions. Only one cleared their bar. They now deflect 84% of support tickets, and their seven-person CX team still splits its time evenly between support and strategic work despite growing fast.

"My AskAI was the only solution that allowed us to integrate AI seamlessly into our existing systems. It's been a win-win for us." Ella Roy, Customer Success Manager at Zeffy.

I point buyers at this next one a lot. Customer.io ran the same exercise against a complex legacy Zendesk stack: eight vendors, 200-plus questions. The result was 68% AI deflection and 55 hours of human time saved in the very first week of full deployment.

"My AskAI blew everybody else out of the water, making the selection process easy for us." The Customer.io team.

Freecash, one of the highest-volume deployments we run, took Gate 2 seriously. They tested Intercom Fin first and rejected it at $0.99 per resolution as uneconomical at their volumes, then ran a competitive selection over several months before choosing on answer accuracy and backend data integration. Their agent now resolves 82% of more than 70,000 monthly queries.

Edel Optics is my favorite proof of the improvement-loop criterion. They started after underwhelming results from their helpdesk's native AI, ran My AskAI in internal-notes mode to watch it, then went direct.

The decisive move was connecting their order data through the User Data API, which lifted resolution from roughly 25% to around 79% almost overnight. (I bring this one up on half my demo calls.) The gap closed because the AI could finally look things up and act, exactly the second half of Gate 1.

None of these teams picked on the demo. They picked on the bake-off.

What should you do this week?

⚡

TL;DR: Pull 50 to 250 of your real tickets, shortlist two to five vendors on the Gate 1 minimums, and run a free bake-off before you book a single full demo. It out-predicts everything else and costs nothing.

If you do just one thing I suggest here, build the test set and run the bake-off before you sit through a single full sales demo. It out-predicts everything else, and it costs nothing.

Pull 50 to 250 real, unedited tickets into a spreadsheet. Don't tidy them up. Include a spread: simple questions, complex ones, greetings, vague one-liners, technical questions, questions with a single buried answer, questions you wouldn't want answered, questions about a competitor, and multi-part questions. This is your scoring set (the messier, the better). About two hours.

Shortlist two to five vendors that clear your Gate 1 minimums on paper: knowledge fit, a real escalation path, and a genuine improvement loop in both knowledge and tools. About half a day.

Run the free bake-off. Give each tool the same knowledge and the same questions, then score the answers on accuracy, any outright wrong answers, conciseness, source quality, tone, whether it asks for clarification when a question is vague, and whether it stays on topic. Any good vendor will let you test free, and some (us included) will run it for you. About a sprint.

Fill the three-gate scorecard, and hand Gate 2 and Gate 3 to your buyer and security reviewer early. Any deal that's only spoken to one person by week two is at risk. About an hour.

Grab the printable version. Our AI support vendor checklist packages these criteria into something you can take into a procurement meeting.

How do I get AI to research these vendors for me?

The desk research is the part you can hand off (I run something like this before most vendor calls). Most of the three gates are questions you answer from a vendor's own pricing, docs, and trust pages, so ChatGPT, Gemini or Claude can give you a first-pass comparison in minutes.

Paste your shortlist into the prompt below, add a line about your setup, and let it fill the scorecard.

You are helping me choose an AI customer service vendor. I'll give you a shortlist; research each one and build me a comparison report.

VENDORS: [paste your shortlist, e.g. My AskAI, Intercom Fin, Zendesk AI, Gorgias Automate]

MY SETUP: [your helpdesk, monthly ticket volume, industry, and any compliance you need, e.g. "Zendesk, ~8,000 tickets/month, ecommerce, need SOC 2 + GDPR"]

For each vendor, use their own pricing page, docs, and security/trust pages as primary sources. Answer every question below. Where you cannot verify something, write "unverified, ask the vendor" instead of guessing.

GATE 1 - WILL IT WORK?
1. Does it connect to my sources through a direct integration, or only by crawling a website?
2. Will they run a free test on my own real tickets before I commit?
3. When the AI can't answer, how does the customer reach a human?
4. Can I see what it failed to answer and fix it, and can it take actions (refunds, order lookups, account changes) through tools or APIs, not just reply with text?
5. Which modes does it support: direct reply, internal-note drafts, copilot, across my channels?
6. How long does setup take, and how much help do they give?

GATE 2 - WILL THE BILL HOLD?
7. What is the pricing meter: per ticket, per resolution, per conversation, per seat, or credits?
8. What exactly counts as a billable event, and what does not?
9. What is my estimated monthly cost at my ticket volume, and what happens to the bill if my resolution rate doubles?

GATE 3 - WILL IT CLEAR REVIEW?
10. Which certifications do they hold (SOC 2 Type II, GDPR, and anything I listed above)?
11. Is my data used to train models? Who are the sub-processors? What is the retention policy?
12. Where is data stored, and do they support SSO and audit logging?

THEN GIVE ME:
- A comparison table: one row per vendor, one column per question, with "unverified" where you could not confirm.
- A 1-to-5 score for each gate, per vendor, with a one-line reason.
- The two or three strongest fits for my setup, and the single most important question to put to each on a call.
- A reminder that this is desk research only: it cannot judge how good the answers actually are, so I still need to run my own tickets through each tool.

Treat what comes back as a first draft. It will get pricing and compliance roughly right, but it cannot tell you whether the answers are any good on your tickets, which is still the bake-off's job.

When is this checklist overkill?

⚡

TL;DR: Skip the full scorecard if you're a sub-50-person team (a 30-minute bake-off beats an RFP), if your incumbent helpdesk AI might already be good enough, or if a 40-criterion RFP would stall the decision for a quarter.

A full three-gate scorecard is the right tool for a mid-market or enterprise selection. I won't pretend it always fits, though. It's the wrong tool in three cases, and ignoring that would make this just another vendor checklist.

If you're a founder or a small team under about 50 people, the three gates collapse into one person: you (I've been that one person). A 30-minute bake-off on 30 tickets and a clear look at the pricing page will out-decide a formal RFP, and speed to value matters more than procurement rigor here. Don't build a process you then have to staff.

If you're deep inside a single helpdesk at modest volume, the native AI might simply be good enough. Its integration depth can outweigh a third-party tool's quality edge for a while, so put it in the bake-off too and let it compete rather than ruling it in or out on reputation.

And I'd watch hard for over-engineering the decision. A 40-criterion RFP can stall a choice for a quarter while tickets pile up and your team burns out. The scorecard is a filter, so pick the eight criteria that matter most to you, score those, and move.

The takeaway

⚡

TL;DR: Score vendors on the three gates and let a real-ticket bake-off break the tie. Weight the improvement loop, how fast the agent gets better after launch, above the demo.

Score vendors on the three gates (will it work, will the bill hold, will it clear review) and let a real-ticket bake-off break the tie. The demo measures the vendor's best case. Your tickets measure your worst, which is exactly the case you're buying for.

And I'll say it one more time: weight the improvement loop above the rest. The answer quality you see in a demo is a snapshot; what you're really buying is the slope, how fast the agent improves as you close gaps in knowledge and add the tools that let it act.

The teams that get this right are buying the slope. They pick the tool they can make smartest over the next year, whatever its demo happened to look like.

If you want the longer version of the thinking behind each gate, the AI Customer Service Complete Guide 2026 walks through the whole journey, and the case studies above show what a bake-off looks like in the wild.

FAQs

How do you evaluate AI customer service tools, and what should you look for?

Score each vendor across three gates: will it work (knowledge ingestion, answer quality on your own tickets, escalation, the improvement loop, channel fit, setup), will the bill hold (the pricing meter, what counts as billable, total cost at your volume), and will it clear review (compliance, data handling, access). The single best test is a free bake-off on 50 to 250 of your real tickets. And weight the improvement loop, how fast the AI gets better after launch, above the demo.

How much does AI customer support software cost?

It depends on the meter. Per-resolution tools typically run $0.90 to $1.50 per resolution, while per-ticket pricing comes in far lower (we work out around $0.10 per ticket), and across the market you'll see anything from roughly $0.05 to $1.99 per interaction. Forecastability matters more than the headline rate: a meter tied to a unit you already track, like tickets, is far more predictable than one tied to a resolution rate that moves.

What resolution rate should I expect from AI customer support?

For repetitive, document-answerable tickets, 60 to 80% is a reasonable target, and across our customer base agents resolve 72%+ on a rolling 30-day basis. The real answer is that it depends on your knowledge and your tooling as much as the vendor you pick. Be careful comparing resolution rates between tools, too, because a "resolution" usually just means the customer wasn't escalated to a human, so a tool that makes it hard to reach a person can look better than it is.

Can I test AI customer support tools for free before committing?

Yes, and you should. Any good vendor will let you run a set of your real questions through the tool free of charge, and some will do it for you. The best testers also offer a proper trial (ours is 30 days, no credit card) so you can see a full month of real volume and cost before you pay anything.

What's the difference between an AI chatbot and an AI agent for customer support?

A legacy chatbot follows decision trees and scripted flows you build by hand. A modern AI agent answers from your knowledge in natural language and, the important part, can take actions: looking up an order, processing a refund, updating an account. That action piece is what separates a tool that deflects FAQs from one that resolves tickets end to end, so confirm which one you're actually evaluating.

Can I set up AI customer support without any coding?

For a standard setup, yes. Connecting your knowledge (a help center, a website, a Shopify store) needs no developer and is usually live within minutes to hours. Where you might need engineering is custom integrations, connecting your own backend APIs so the AI can read live account data or take actions, and that's gated by your dev team's time rather than the product.

How do I add AI to my support without replacing my helpdesk?

Look for a tool that integrates with your existing helpdesk rather than replacing it. Several AI agents (ours among them) run inside Zendesk, Intercom, Freshdesk, Gorgias, or HubSpot, so you keep your stack, your macros, your tags, and your routing. You're adding an AI layer on top and keeping the platform you have.

Is AI customer support software GDPR and SOC 2 compliant?

The good ones are, but check rather than assume. SOC 2 Type II and GDPR are the baseline most mid-market teams need, and you should ask for the trust portal plus a clear statement that your data isn't used to train models. If you're in a regulated industry, work out whether you also need ISO 27001, PCI-DSS, or HIPAA before you shortlist, because plenty of vendors hold none of them.

The AI Customer Service Vendor Selection Checklist

Why does most AI vendor selection go wrong?