Gartner published a research note in March 2026 that quietly reshaped the offensive security market. It’s called The Future of Pen Testing Is Continuous Offensive Security Testing (Dhivya Poole, Carlos De Sola Caraballo, Mitchell Schneider, 6 March 2026, ID G00845606), and it introduces a new category: Continuous Offensive Security Testing, or COST.
FireCompass was named in the vendor matrix. We’re proud of that. But the recognition isn’t the interesting part of this story. The interesting part is what Gartner is actually saying. And what most vendors who get listed are going to fundamentally misunderstand.
Because COST isn’t a marketing rebrand of annual pen testing. It’s a clean break from it. And the gap between “we do continuous testing” as a slide bullet and actually being architecturally built for continuous validation is much larger than the industry is letting on.
What COST Means
Strip away the analyst framing and here’s what Gartner is calling for: penetration testing that activates when material risk changes, not when the calendar says it’s time.
The current model is broken in a way most security leaders already know. You scope an engagement, you wait two to six weeks for a vendor, they test a slice of your environment, you get a PDF, you remediate some findings, and twelve months later you do it again. Meanwhile, your applications shipped 3,000 changes, your attack surface grew by 40%, three critical CVEs dropped, and an attacker probed every public endpoint you own. Twice.
COST replaces that with four ideas that, taken together, change the economics of offensive security:
Trigger-driven testing. Tests fire when something changes (a deployment, a new asset, a fresh CVE, a configuration drift), not when a contract starts. Risk-tiered triggers (high, medium, low) determine both the urgency of the test and what kind of test runs.
Unified discipline. Pen testing, red teaming, bug bounty, and control validation stop being separate purchases with separate vendors and separate dashboards. COST treats them as one continuously operating capability.
Closed-loop remediation. Findings flow directly into ticketing, fix verification, and revalidation. The output isn’t a report; it’s a measurable reduction in exposure window.
Outcome metrics that aren’t “test complete.” Mean time to mitigate. Exposure window length. Validation speed. The KPIs are about how fast risk is actually closed, not how many tests were scheduled.
Gartner predicts that by 2028, over 60% of enterprise pen test programs will operate as continuous validation embedded within DevSecOps pipelines, replacing annual assessments as the primary proof of resilience. That’s a two-year timeline. Most vendors are not ready for it.
Why Most Vendors Will Struggle With COST, Even The Ones On The List
Here’s what the matrix doesn’t tell you: getting named in a Gartner vendor matrix means you’re in the conversation. It doesn’t mean you can deliver the model.
A category called “continuous offensive security testing” creates immediate temptation. Every vendor with a SaaS interface and a scheduling button can now claim they’re “continuous.” Every PTaaS provider can argue that quarterly testing with a customer portal counts. Every DAST vendor with an integration into Jira will repackage their scanner as a COST platform.
It won’t work. Because actual continuous offensive security testing requires four things that most platforms structurally cannot do. And never could, regardless of how their pricing page is rewritten.
It requires testing without a human in the critical path of every engagement. If your model depends on a human pen tester to scope, configure, and run each test, you cannot fire a test in response to a deployment that happened at 2 AM. The economics of human-driven testing, even when offered as a service, are inherently incompatible with trigger-driven cadence. This is not a UX problem. It’s a cost-per-test problem. Manual pen testing runs $2,400–$10,000 per app. You cannot run that against every code change.
It requires evidence-backed validation, not theoretical findings. A continuous program that emits thousands of unvalidated findings is worse than no program at all. DAST scanners produce 40–70% false positive rates. If you 10x the test frequency, you 10x the noise. And your engineering team starts ignoring everything the platform emits. The only version of continuous testing that works is one where every finding comes with a working proof-of-concept exploit, a reproducible request/response pair, and runtime confirmation that the vulnerability is actually exploitable in production.
It requires closing the Scope, Depth, and Speed gap simultaneously. Most “continuous” platforms tackle one of these and ignore the other two. Continuous DAST gives you speed but no business logic depth. Continuous PTaaS gives you depth but limited scope and slow turnaround. Continuous ASM gives you scope but no exploit validation. COST demands all three: full attack surface coverage, real attack-chain depth, and a cadence that matches deployment velocity. There is no shortcut.
It requires governance and safety controls that auditors will accept. This is the part the market is underestimating most badly. If you’re running offensive security testing continuously against production, you need to prove to regulators, your board, and your own security team that every action was scoped, every payload was safe, every credential was contained, and every decision is auditable. An LLM with a network connection and a goal is not a COST platform. It’s a liability.
What Continuous Looks Like When You Build For It
This is the architecture the COST category implicitly demands. And what we’ve been building toward at FireCompass for years, before the category had a name.
It starts with an autonomous AI agent network that doesn’t need a human to scope, launch, or interpret each test. Tests fire on triggers: a new asset discovered by ASM, a CVE drop, a deployment notification from CI/CD, a configuration change in a cloud account. The agent picks up the scope, runs the test, validates the finding, and emits an evidence-backed result. The cost per test drops from $5,000 to under $1,000. The lead time drops from two weeks to zero. Test cadence stops being a budget constraint.
It requires what we call an Engine-and-Vehicle architecture. The LLM is the engine. Extraordinary at reasoning about attacks, generating payloads, interpreting responses. But an engine without brakes, steering, a dashboard, and safety systems is not a vehicle. It’s a danger. The vehicle is what surrounds the model: scope boundary enforcement that checks every agent action against an asset whitelist before dispatch, rate limiting to prevent unintended denial-of-service, a kill switch that halts all agents instantly, an AI firewall architecture where non-deterministic LLM output is governed by deterministic rule-based systems, safe payload enforcement that blocks modify/update/delete operations by default, credential scope guards that prevent UAT credentials from hitting production, RBAC with granular permissions, and append-only audit logs with cryptographic timestamps that satisfy DORA, PCI DSS 4.0, SOC 2 Type II, and ISO 27001 requirements.
This is not a feature list. It’s the architectural minimum for running autonomous offensive testing continuously in a production enterprise environment. Without it, you’re not doing COST. You’re doing reckless automation with a marketing label.
It requires evidence-backed validation as a gate, not as a nice-to-have. At FireCompass, every finding the AI proposes is treated as a hypothesis. It doesn’t reach a customer dashboard until the execution layer has confirmed the exploit against the live target and captured the proof. That’s how we hit a <2% false positive rate against the 40–70% baseline of traditional DAST. It’s not model accuracy. It’s engineering discipline at the validation layer.
And it requires closing the Scope-Depth-Speed gap as one problem, not three. Discovery agents that find shadow apps, forgotten subdomains, exposed APIs, and leaked credentials. Pen testing agents that execute OWASP Top 10 coverage with authenticated and unauthenticated paths, business logic testing, and privilege escalation. Red teaming agents that chain findings into multi-stage attack paths (credential reuse across services, app-to-app pivots, app-to-network lateral movement), and report them with MITRE ATT&CK-aligned narratives. Continuous cadence that matches CI/CD release velocity, validates day-1 CVEs, and revalidates fixes with one click.
That’s what the Fortune 500 technology company we work with replaced their consulting firm with. 2,000+ apps continuously tested instead of 200 annually. 80% cost reduction per test. Two-week lead time collapsed to one day. Chained attack paths the consultants had scoped out of their engagement, found autonomously. This isn’t a thought experiment. It’s a production deployment, running today, against the model COST describes.
The Real Test for COST Vendors Over the Next Two Years
Gartner’s prediction (60% of enterprise programs running continuous validation by 2028) sets a clock. Two years is short for a category that requires a re-architecture. Most vendors named in early matrices will be replaced or absorbed before the prediction window closes, because the gap between “we have a continuous-sounding product page” and “we are architecturally built for continuous” is going to become brutally visible the moment customers try to operationalize it.
The questions to ask any vendor claiming COST capability are simple:
Can you run a test in response to a deployment event, without a human in the loop, against any asset in scope, within minutes? Is every finding you report backed by a working exploit and reproducible evidence? Can you prove to a regulator that every agent action was governed, every payload was safe, and every decision is logged? Can you cover the full attack surface (apps, APIs, infrastructure, internal and external) on the same platform with the same governance? And can you run all of this continuously without the cost curve breaking your customer’s budget?
If the answer to any of those is “well, sort of, with some manual setup,” that vendor is going to lose this category. Not in five years. In two.
What This Means For Your Program
You don’t need to wait for the 2028 deadline to start. The shift to continuous offensive security testing is one of those rare transitions where the architecture is mature enough today to start replacing the calendar-based model immediately. And the economics work even at the pilot stage.
Start with your most business-critical applications. Establish a continuous cadence: monthly, weekly, or triggered by deployment events. Validate the evidence quality. Measure the exposure window before and after. Expand into APIs, then infrastructure, then continuous red teaming. The Fortune 500 case study didn’t happen because someone bought a bigger contract. It happened because they replaced a $5,000-per-test consulting motion with a sub-$1,000-per-test continuous platform, and then asked: now that the marginal cost is near zero, why are we still testing 10% of our portfolio?
That’s the question COST is forcing every security leader to answer.
Try FireCompass against your own environment. Run a free, evidence-backed pen test on a business-critical web app at firecompass.com/explorer. For the architectural detail behind why LLMs alone cannot deliver COST and what an enterprise-grade execution layer requires, read our Beyond Mythos whitepaper.
