Looking for the full COST definition? This post covers the market analysis. For the complete framework, triggers, and how FireCompass delivers Continuous Offensive Security Testing (COST) you can refer to: www.firecompass.com/continuous-offensive-security-testing

When someone tells you they do continuous offensive security testing or COST, ask two questions.

Q1: Can it prove the exploit with reproduction steps and a PoC, or does it just flag a maybe?

Q2: Can it run autonomously without you losing control of what it touches?

A new label is showing up across the security market: continuous offensive security testing, or COST.

After years of CTEM, ASM, BAS, and CART each carving off a piece of the problem, the industry is finally converging on the obvious idea: offensive testing should run continuously, not once a year.

Here is the problem. “Continuous” is the easiest word in security to fake. Most people who adopt this label are going to bolt it onto the same point-in-time tooling they already sell, run it a little more often, and call it a category shift. It is not. Below are the three places where the gap between the label and the real thing will show up, and the one capability almost nobody is going to put in the demo.

Continuous Offensive Security Testing Is Not a Faster Pen Test

The first failure mode is the most common: take an annual pen test, run it quarterly, and market the cadence as “continuous.” Four calendars a year is still a calendar.

The reason the calendar fails has nothing to do with how often you run it and everything to do with the structural mismatch underneath. Most programs test roughly 20% of the attack surface. Crown-jewel apps get attention; peripheral assets, shadow apps, forgotten subdomains, and API endpoints buried in JavaScript do not. Attackers probe 100%. Meanwhile, the typical gap between tests is around 365 days, and new CVEs get weaponized in roughly three days. Every release widens the window.

Continuous means matching deployment velocity, not beating last year’s audit date. That means testing weekly, on demand, or wired into CI/CD. It means day-one validation when a new CVE drops. And it means delta analysis, so each run focuses on what actually changed instead of re-scanning everything from scratch. If a product cannot do those things, the cadence is cosmetic.

A Finding Without Proof Is Just a Guess

The second gap is depth. Speed without accuracy just produces noise faster.

This is where most automated testing collapses. DAST scanners routinely run 40-70% false positive rates. That number is not an inconvenience; it is the whole problem. When more than half of what lands in the queue is wrong, your team spends its day triaging maybes instead of fixing risk, and people stop trusting the tool entirely.

A real finding comes with proof. The exact HTTP request that demonstrated the exploit. The server response confirms it. Step-by-step reproduction. Working proof-of-concept code that a developer can run to verify it independently. Anything short of that is a hypothesis, and hypotheses should be discarded, not reported. Hitting a sub-2% false positive rate is not a function of a smarter model. It comes from the discipline of re-executing every candidate finding against the live target and throwing away everything that does not exploit. Tie that validation to a real framework like the OWASP Top 10, and you get coverage that maps to risk a CISO can actually act on.

The Real Risk Lives in the Chain, Not the Single Finding

The third gap is the one scanners are structurally incapable of closing. They report findings in isolation. Attackers do not think in isolation.

Real breaches are chains. A low-severity disclosure feeds a credential, the credential gets reused against another service, that service pivots to an app that was never in scope, and the app opens a path into the network. Around 22% of breaches start with credential abuse, and roughly 20% begin through a peripheral asset nobody was watching. None of that shows up on a list of standalone CVEs sorted by CVSS.

Workflow abuse, state manipulation, authorization boundary violations, app-to-app pivots, app-to-network lateral movement: these are the things that map to a real MITRE ATT&CK kill chain, and they are exactly what point-in-time scanning scopes out. A category that calls itself “offensive” has to model the attack the way an attacker would, end-to-end. Otherwise, it is a vulnerability list with a more aggressive name.

An LLM Can Plan an Attack. It cannot Safely Run One.

Here is where the AI hype is going to lead a lot of buyers astray. There is a wave of products that wrap a large language model around a scanner and call the result autonomous. It is worth being precise about what an LLM actually does.

An LLM generates text. It can reason about a target, propose an attack path, and draft an exploit. That is genuinely useful, and it is also where the model’s job ends. It cannot send the request, hold authenticated state across fifty steps, confirm the exploit fired, capture the evidence, or stop itself from touching something it should not. Those are execution problems, and they are pure engineering.

The way to think about it: the model is the engine, and the platform around it is the vehicle. An engine on a stand is impressive and goes nowhere. Without brakes, steering, and a dashboard, a powerful engine is not fast, it is uncontrollable. The hard part of this category was never the reasoning. It is the execution runtime, the state machine, the validation pipeline, and the safety architecture that turn reasoning into something you can actually run against production.

Governance Is the Part Nobody Puts in the Demo

This is the capability that separates a production platform from a clever prototype, and it is the one most vendors will quietly skip past.

Point an autonomous agent at your production environment with no controls, and you have built a liability, not a security program. Governance cannot be a policy document bolted on afterward. It has to be a mandatory gateway that sits between the AI and the execution runtime, so no agent action reaches a live target without passing through it first.

In practice, that means a specific set of controls, each doing a specific job:

Scope enforcement. Every proposed action is checked against an asset whitelist before dispatch. Out-of-scope targets never get touched.
Rate limiting. A per-host ceiling prevents an enthusiastic agent from turning a test into an accidental denial of service.
A kill switch. One signal halts every running agent immediately.
An AI firewall. An input layer filters incoming requests, and an output layer governs every action the model proposes, so non-deterministic LLM output is always controlled by deterministic, rule-based systems.
Safe payload enforcement. Read and create operations within the scope are allowed. Modify, update, and delete are blocked by default and never run autonomously.
Credential scope guards. Environment tags keep UAT credentials from ever hitting production.
RBAC and append-only audit logs with cryptographic timestamps, granular across user roles, asset groups, programs, and data classification, and built to satisfy PCI DSS 4.0, SOC 2 Type II, and ISO 27001.
Full chain-of-thought visibility, so every decision an agent makes is recorded and reviewable.

This is what lets you prove to an auditor, a regulator, and your own leadership exactly what was tested, what happened, and that it was done safely. Without it, “autonomous” is a word you should be afraid of.

Run COST against your own attack surface.

Start free, or connect with a FireCompass expert. The Firecompass AI pen test agent (use free credits) to run a continuous offensive test against an asset you choose: firecompass.com/start-free-explorer/

What It Looks Like When the Category Is Built Right

None of the above is theoretical, and this is the point in the post where I will be direct about how we think about it at FireCompass.

A Fortune 500 technology company replaced a large consulting firm’s manual program with continuous AI-driven testing. They went from 200 applications tested annually to full coverage across 2,000+ applications, cut per-app cost by about 80% (roughly 11x cheaper, under $1,000 per test versus around $5,000), dropped lead time from two-plus weeks to a single day, and saw false positives fall from 70% to under 2%. The platform also surfaced chained attack paths that the prior consultants had scoped out entirely, on assets that had never been tested at all.

The reason it works is the architecture, not the slogan. Validation runs fully autonomously: 104/104 on the XBEN suite, 12/12 on Acuart, all difficulty levels on DVWA, with no manual steering. Costs land at roughly > $1,000 per app against $10,000 for manual testing. And because the system routes each task to the strongest available model rather than betting on a single provider, the intelligence layer keeps improving as the frontier does, while the governance and execution layers stay constant.

That is the shape of the category done properly: continuous cadence, validated exploits, real attack chains, and a governance layer that makes autonomy safe enough to point at production.