The shortlist looks identical. The architecture is not.

Every AI pen test vendor on your shortlist will tell you their false positive rate is under five percent. Their demos will look impressive. Their decks will name the same frontier models.

This is the problem. Frontier model access is commoditizing. Any team can wire an Anthropic, OpenAI, or Google API to a target environment and call the result an AI pen testing platform. The demos work because they were built to demo. The trouble starts in production, where the system has to validate exploits against live systems, chain findings into multi-stage attacks, survive a CIO-led security review, and not break what it’s testing.

The ten questions below are designed to surface what’s actually under the hood, in 30 minutes or less. Send them to vendors in writing before the demo. The written answer reveals the architecture. The demo reveals the polish. The two are not the same thing.

This is the checklist we wish more buyers used when they evaluate us. The good news is, if you ask these questions, the shortlist narrows itself.

1. What is your false positive rate, and how was it measured?

Start here. Do not end here.

False positive rate is the single highest-signal metric in AI pen testing because it tells you whether the platform is reporting hypotheses or validated findings. Without an exploit validation layer, LLM-generated security output sits between 40 and 70 percent false positives. That is the same range as traditional DAST. It is not a model quality problem. It is an architecture problem. The model can reason about a vulnerability all day. Until something actually fires the payload against the target and checks the response, every finding is a guess.

Production-grade Agentic AI platforms run under 2 percent. The methodology behind that number should be specific: how the validation pipeline works, what evidence is captured per finding, and how the rate was sampled. If a vendor compares themselves to “industry average” or will not share methodology, you have your answer.

2. How do you validate that a finding is actually exploitable?

This is the follow-up that catches most wrappers off guard. “Our model evaluates the response and decides” is not a validation pipeline. “We use the latest reasoning model” is an answer about the engine, not the brakes.

A real validation layer does four things in sequence. It executes the proposed exploit against the live target. It captures the request and response. It runs deterministic assertions on the response (status codes, payload reflection, oracle conditions). And only then does it promote the finding to a report.

Every finding should ship with reproduction steps, the captured exploit trace, and a working proof-of-concept in code that a developer can run independently in Python, curl, or whatever the team already uses. The PoC removes your AppSec team from the validation loop. If your engineers still have to confirm each finding manually, the platform shifted the burden. It did not reduce it.

3. Show me your public benchmark performance

Three benchmarks are the floor for any vendor claiming autonomous web application pen testing capability: XBEN, Acuart (Vulnweb), and DVWA. The current bar is 100 percent.

The FireCompass agent scores 104 of 104 on XBEN across Easy, Medium, and Hard. Twelve of twelve PoC-validated findings on Acuart. Full coverage across all three DVWA difficulty levels. Fully autonomous. No manual steering. No human hints.

XBEN was built by a direct competitor. We score 100 percent on it. The benchmark was not designed in our favor.

If a vendor cannot point to specific numbers on these suites, ask why. Then ask the harder follow-up: what evaluation question did you move to once public benchmarks saturated? A vendor still in benchmark-chasing mode has not built an evaluation that tells them anything new. Public benchmarks plateau fast. They stop telling you anything useful about complex, stateful, real-world applications about six months in.

4. How do your agents compare against top human researchers?

This is the question that separates research demos from production platforms. Public benchmarks tell you the system can do real work in a controlled environment. They do not tell you whether it can outperform a skilled human researcher on a complex, messy, stateful application. That is a different test, and it is the one that matters.

The methodology to look for: researcher-versus-agent on blind targets, scored on unique critical findings, false positive rate, and time-to-finding. FireCompass agents now beat top in-house researchers in 60 to 70 percent of these head-to-head evaluations, while staying under 2 percent false positives. When researchers find something the agent missed, we re-train within 24 hours. The feedback loop compounds.

A vendor whose only claim is “100 percent on benchmarks” has not yet built an evaluation harder than benchmarks. That is fine for a research demo. It is not enough for production deployment.

5. Can your agents chain findings into a multi-stage attack path?

This is the question every other AI pentest checklist misses. It is also the cleanest way to tell a scanner with AI marketing from a real agent.

Scanners report findings. Real attackers chain them. The difference between “medium severity information disclosure” and “full database compromise” is often four steps of chaining that no scanner reports.

A recent FireCompass engagement: the agent ran surface mapping on a customer web application. It discovered an exposed .git directory. It reconstructed the source repository. It found database credentials in a config file inside the repo. The database port was blocked externally, so it was hypothesized that credential reuse would be tested using the same credentials against SSH. It got root. It pivoted from the server to the internal network and connected to the database directly. Sensitive data was exfiltrated.

What a scanner would have reported: one medium severity finding, .git exposure.

What actually happened: full database compromise across roughly 30 steps, switching protocols from HTTP to SSH to the database, testing a hypothesis (credential reuse) that no signature captures.

Ask the vendor to walk you through their longest validated chain in a customer environment. If they cannot, the platform reports findings, not attack paths. The architectural question underneath is whether they have an attack state machine that maintains coherent state across long chains, or whether each finding is generated in isolation. The first is a platform. The second is a faster scanner.

6. How do you discover the actual attack surface before testing?

The scope gap is the most consistently underestimated risk in offensive security. Roughly 20 percent of breaches start through a peripheral asset the security team did not know existed. Yet most vendors test what you tell them to test, full stop.

“You give us the URL, and we test it” is the wrapper answer. “You give us a domain, and we discover everything that belongs to it, then prioritize what to test” is the platform’s answer.

A real discovery layer maps the full external attack surface: subdomains, exposed APIs, forgotten staging environments, cloud assets, third-party assets that resolve back to your infrastructure. It tracks change over time and prioritizes targets by exposure and exploitability. One Fortune 500 customer went from testing roughly 200 apps per year to over 2,000 continuously. The 10x coverage gain came from finding the assets, not from testing them faster.

Ask vendors what fraction of their critical findings in a typical engagement come from assets the customer did not originally include in scope. A vendor doing real discovery will name a number. A vendor without discovery will deflect.

7. How does the platform handle authenticated workflows and business logic?

Most exploits do not live on the unauthenticated surface. They live behind login walls, inside multi-step workflows, and in the business logic of the application. If a vendor cannot speak credibly to credential and session management, multi-step auth flows like OAuth or SAML, or business logic testing beyond OWASP Top 10 injections, you are looking at an unauthenticated scanner with better marketing.

A real business logic finding from a FireCompass engagement on an e-commerce application: the agent identified the multi-step checkout workflow, then tested whether it could skip the payment step entirely by POST-ing directly to the order confirmation endpoint. Order confirmed and fulfilled, no payment processed. Then the agent tested negative payment amounts. Refund credited to the customer account.

There is no CVE for this. There is no DAST signature. It falls under OWASP A04 Insecure Design. A scanner cannot find it. An agent that understands the workflow can.

8. Can the platform test continuously, aligned to release cycles?

The cadence gap between attacker speed and defender pentest schedule is structural. Critical CVEs get weaponized in roughly three days. Most enterprises pentest once a year. That is 362 days of unmonitored exposure, every year.

“We can run pentests in days instead of weeks” is still point-in-time. It is a faster clock, not a continuous clock.

What continuous actually requires: integration with CI/CD, a defined cadence (daily, weekly, per-release), automatic re-test when the attack surface changes, findings delivered into existing developer workflows (Jira, ServiceNow, Slack), and a re-test workflow that confirms remediation.

This is the operating model we call Continuous Offensive Security, or COST. Gartner has tracked the category across four consecutive Hype Cycles. It is the only model that actually closes the cadence gap. Anything less continuous is a faster version of the wrong model.

9. What safety controls, scope enforcement, and audit trail does the platform provide?

This is the question your CIO and your auditor will eventually ask, so ask it first. An agent that can find a vulnerability can also break production if scope and safety are not enforced at the architecture level.

“We have safety guardrails in the prompt” is not safety enforcement. “The model has been trained not to cause damage” is not safety enforcement. Either answer tells you the platform is one prompt injection away from operating outside its lane.

Real safety lives at the execution layer, not the prompt layer. What to look for:

Scope boundaries enforced at the agent level, with every action checked against an asset whitelist before dispatch
Rate limiting built into the runtime to prevent unintended DoS
A kill switch that halts every concurrent agent action within seconds
Safe payload enforcement that blocks modify, update, and delete operations by default
Credential scope guards that stop UAT credentials from being tested against production
An append-only audit log with cryptographic timestamps that survives compliance review under DORA, PCI DSS 4.0, SOC 2 Type II, and ISO 27001
Full chain-of-thought visibility so every agent decision is recorded and reviewable

The mental model: the LLM is the engine. You cannot ship a car with just an engine. You need brakes, steering, a dashboard, and a seatbelt. The safety architecture is what makes the platform deployable in regulated environments. Without it, you have a research tool that should not see production traffic.

Ask the vendor to show you a sample audit log from a real engagement. The level of detail tells you whether they built for enterprise from day one or are retrofitting under sales pressure.

10. What is the architectural moat versus the model you use?

This is the tiebreaker if two platforms have passed the first nine.

If a vendor’s only differentiator is “we use the best model” or “we built our own model,” they are selling the engine. Frontier models commoditize on a quarterly cycle. Anyone can wire one to a target. The two to four years of engineering around the model is the actual moat.

A production-grade AI pen testing platform requires roughly ten engineering components, and the previous nine questions map directly onto them: an execution runtime that delivers exploits against live systems and captures evidence; an exploit validation pipeline; an attack state machine for long chains; multi-agent orchestration so recon, vulnerability assessment, and chaining run in parallel; a discovery layer; credential and session management; continuous testing infrastructure; scope and safety enforcement at the agent layer; an evidence chain of custody; and enterprise RBAC for governance.

A vendor who can speak credibly to each of these is selling a platform. A vendor who cannot is selling an LLM with a UI.

Scoring the answers

Send the ten questions in writing before the demo. Score each answer on a 0 to 2 scale.

Score	Answer profile
0	Evasive answer, marketing copy, or “we use the best model”
1	Addresses the question but without specifics (round numbers, no methodology, no customer example)
2	Specific, measurable, evidence-backed. Name the architecture, the methodology, and the sample size.

Total possible: 20. A real platform should score 16 or higher. A wrapper will struggle to break 8.

If your evaluation comes down to two platforms both scoring above 16, the architectural moat questions (9 and 10) become the tiebreaker. If your evaluation includes vendors failing half the checklist, you are looking at wrappers, and the right move is to narrow the shortlist before the demo phase. Demos are designed to make wrappers look like platforms. The written checklist is designed to do the opposite.

Send the ten questions in writing before the demo. Score each answer on a 0 to 2 scale.

What the next five years actually look like

In five years, every serious enterprise will run continuous AI pen testing. The vendors who survive will be the ones who built the platform, not the ones who shipped a model with a dashboard. The architecture is what compounds. The models commoditize on a quarterly cycle.

Send the questions. Score the answers. Trust the written record more than the demo.

The category has a name now. We call it Continuous Offensive Security: AI Penetration Testing across Web, API, and Infrastructure. Anything less continuous is a faster version of the wrong model.

“FireCompass’ approach to automating penetration testing of complex, multi-stage attacks is the next level of penetration testing. Agent AI is a promising way to solve this otherwise hard problem.”

Bruce Schneier Advisor at FireCompass

FAQ

What is an AI pen test vendor?

An AI pen test vendor provides a platform that uses AI agents (typically built on large language models) to autonomously discover, validate, and report security vulnerabilities in web applications, APIs, and infrastructure. The strongest platforms run as Agentic AI systems, not single-shot LLM calls, and include validation, chaining, discovery, and safety layers in addition to the model itself.

How do I evaluate an AI pen testing platform versus an LLM wrapper?

Ask architecture-level questions, not feature questions. The ten questions in this checklist surface whether a vendor has built the supporting engineering (exploit validation, attack state machine, discovery layer, safety enforcement) or is reselling an LLM API with a UI. The written answers reveal the architecture. The demo reveals only the polish.

What is a good false positive rate for AI pen testing?

Production-grade Agentic AI platforms operate under 2 percent. Without an exploit validation layer, LLM-generated findings sit between 40 and 70 percent false positives, the same range as traditional DAST. The false positive rate, paired with the methodology used to measure it, is the single highest-signal metric in AI pen test evaluation.

Which benchmarks should AI pen test vendors meet?

Three public benchmarks are the floor: XBEN, Acuart (Vulnweb), and DVWA. The current bar is 100 percent. The FireCompass agent scores 104 of 104 on XBEN, 12 of 12 PoC-validated findings on Acuart, and full coverage across all three DVWA difficulty levels, fully autonomously.

Can AI agents outperform human pen testers?

On complex, stateful, real-world applications, the strongest Agentic AI platforms now beat top in-house researchers in 60 to 70 percent of blind head-to-head evaluations, while maintaining under 2 percent false positives. Where researchers still find things the agent misses, the gap is closed in a 24-hour retraining cycle.

What is Continuous Offensive Security (COST)?

Continuous Offensive Security (COST) is the operating model where AI agents run penetration testing continuously across web, API, and infrastructure (rather than once or twice a year), integrated with CI/CD and re-tested when the attack surface changes. Gartner has tracked the category across four consecutive Hype Cycles. It is the only model that closes the structural cadence gap between attacker speed and defender pentest schedule.

What safety controls should an AI pen testing platform have?

Safety must be enforced at the execution layer, not the prompt layer. Look for: scope boundaries checked against an asset whitelist before every action; runtime rate limiting; a kill switch that halts all concurrent actions within seconds; safe payload enforcement that blocks destructive operations by default; credential scope guards that prevent cross-environment credential testing; and an append-only, cryptographically timestamped audit log that survives DORA, PCI DSS 4.0, SOC 2 Type II, and ISO 27001 review.

How is FireCompass different from other AI pen testing vendors?

FireCompass is an Agentic AI platform for autonomous penetration testing and red teaming across Web, API, and infrastructure. It discovers shadow assets, validates what is exploitable, and chains findings into multi-stage attack paths with near-zero false positives. Recognized in 30+ analyst reports from Gartner, Forrester, IDC, and GigaOm. Trusted by Fortune 500 enterprises. Featured in the Gartner Hype Cycle four cycles in a row. Bruce Schneier is an advisor.

About FireCompass

FireCompass is an Agentic AI platform for autonomous penetration testing and red teaming across Web, API, and infrastructure. It discovers shadow assets and web applications, safely validates what is exploitable, and connects findings into multi-stage attack paths with near-zero false positives. Unlike traditional scanners, FireCompass uncovers credential reuse, business-logic flaws, privilege escalation, and app-to-app or app-to-network lateral movement. It can operate autonomously or with expert-in-the-loop validation. FireCompass has 30+ analyst recognitions across Gartner, Forrester, and IDC, and is trusted by Fortune 1000 enterprises.

Hack Yourself Before AI Does.

See what is actually exploitable in your environment. Claim free AI pen testing credits: firecompass.com/explorer