What our firsthand experience building pentest agents taught us about verifiability, benchmark saturation, and where human researchers still matter most

Our firsthand experience with application pentest agents at FireCompass has been unexpected.

When we started building them, I assumed AI would become a useful force multiplier for researchers. I did not expect it to start challenging strong researchers this quickly.

The progression has been hard to ignore.

Version 1: Our web application pentest agents were maybe half as good as our researchers.
Version 2: The agents started hitting 100% on public test benches like XBen.
Version 3: Public benchmarks stopped being enough, so we moved to researcher-versus-agent evaluation.
Version 4: The agents are now beating our top researchers 60–70% of the time. In some cases, by very wide margins, while staying under 2% false positives.

AI is likely to disrupt application pentesting earlier than many people expect.

This is not an argument that AI has solved all of penetration testing. It is not an argument that all of cybersecurity will be reordered by offense first. It is a more specific thesis: AI is becoming very strong in the repeatable, validation-heavy, mechanically testable layers of application pentesting.

That is where we are seeing the shift first.

The key idea: application pentesting has unusually strong feedback loops

The most important concept here is not “AI is smart.” It is verifiability.

AI systems improve fastest when they can take an action and get relatively fast, objective feedback on whether that action worked.

Application pentesting has a surprising amount of this built in.

A system can often test and learn from questions like:

Did the endpoint exist?
Did the request bypass authorization?
Did the role boundary break?
Did the injection payload work?
Did the state transition happen?
Did the finding validate with evidence?
Did the attack chain move from one step to the next?

These are not perfectly binary in every case, but they are often much more mechanically testable than the kinds of subjective, delayed, or ambiguous tasks where AI tends to struggle.

That matters.

Because once you have a tight loop between hypothesis -> execution -> observation -> validation, agentic systems start to compound.

Why does this not mean pentesting is “easy”

This is where I think a lot of people oversimplify the discussion.

Real pentesting value does not come only from a single exploit succeeding.

It comes from higher-order reasoning:

Choosing the right objective under uncertainty
Forming the right hypothesis from incomplete signals
Understanding how business logic changes the risk
Deciding when a weak signal is worth pursuing
Knowing when to pivot, abandon, or escalate
Distinguishing a technically interesting issue from an operationally important one

Those are still difficult problems.

But the full workflow is not one giant reasoning task. It is a chain of smaller subproblems, and many of those subproblems are locally verifiable.

That is the real unlock.

An agent does not need to solve “pentesting” in one shot. It needs to solve the next verifiable step:

Discover a route
Map a parameter
Test a trust boundary
Retry with a different state
Validate the signal
Carry the result forward

When enough of those local loops are measurable, performance starts to improve faster than many people expect.

Why benchmarks stop being useful after a point

Public test benches are useful early.

They are good for establishing baseline competence. They tell you whether the system can do real work in a controlled environment.

But once an agent reaches benchmark saturation, the benchmark stops being a meaningful proxy for real capability.

It no longer tells you:

Whether the system can handle messy applications
Whether it can deal with stateful flows
Whether it can reason across ambiguity
Whether it can pursue the right branch in a large search space
Whether it can validate without generating noise
Whether it can outperform a strong researcher in open-ended conditions

That was the turning point for us.

Hitting 100% on public benches was not the finish line. It was the point where we had to move to a more operational question:

How does the system perform against strong humans in real testing conditions?

That is a harder bar. It is also the bar that matters.

Why agents are getting stronger so quickly

From our perspective, there are three reasons.

1. They can go deep and wide at the same time

A good human researcher can go very deep.

A strong agent can often go deep, wide, and persistently at the same time. It can enumerate broadly, test multiple branches, retry endlessly, correlate weak signals, and continue exploring without fatigue.

A lot of application pentesting value comes from sustained exploration across a branching graph. Agents are increasingly well-suited for that shape of work.

2. They benefit from execution, not just reasoning

The real power is not just in generating an idea.

It is in generating an idea, executing it, observing the result, and adapting quickly.

A good pentest agent is not just an LLM with prompts. It is a system with tools, execution control, state management, memory, replay, and validation.

That loop is much more powerful than static reasoning alone.

3. They improve as the system improves

The progress is not only about the base model.

Agents improve as you improve:

Scaffolding
Tool routing
State handling
Decomposition of tasks
Evidence validation
Retry logic
Memory
Training and feedback

That is why the improvement curve can feel nonlinear.

The model gets better.
The system gets better.
The verifier gets better.

And once all three improve together, capability moves fast.

What AI is likely to take over first

I do not think AI will absorb all of the application pentesting uniformly.

The first major shift is likely in the parts of the workflow that are:

Repeatable
Telemetry-rich
Execution-heavy
Evidence-driven
Decomposable into verifiable steps

That includes a lot of:

Route and endpoint discovery
Parameter exploration
Auth and access-control validation
Exploit retries and variations
Attack-path enumeration
Evidence gathering
Retesting and confirmation
Reducing noisy findings into validated outcomes

This is where AI can create a real advantage.

Not because these tasks are trivial, but because they create strong feedback loops.

What remains stubbornly human

Human researchers still matter enormously.

In fact, I think the more capable the agents become, the more valuable high-quality human judgment becomes.

The humans may have an edge in areas like:

Objective selection
Unusual edge cases
Multi-system reasoning under ambiguity
Interpreting partial signals
Deciding what matters commercially and operationally
Designing the right testing strategy based on a specific business context
Supervising and improving the agentic system itself

| “The role does not disappear. But it changes.”

| “More of the manual execution shifts to machines.”

| “More of the human value moves up the stack.“

The real shift: from manual testing to human-directed systems

That, to me, is the deeper implication.

The future pentester will not just be the person who can manually test the most branches.

It will increasingly be the person who can:

Define the right objectives
Design and supervise the system
Interpret ambiguous outcomes
Distinguish noise from a strategic signal
Direct and improve agent fleets better than others

That is a different craft from purely manual pentesting.

And I believe it will define the next era of application security testing.

Closing thought

AI is likely to disrupt application pentesting earlier than many people expect, because large parts of the workflow are mechanically verifiable, execution-heavy, and composed of repeatable subproblems.

That is different from saying all of pentesting is solved. It is different from saying all of cybersecurity will move in the same order.

The most important thing we have learned is this: Application pentesting is not becoming interesting for AI because it is easy. It is becoming interesting because enough of it is verifiable. That creates the kind of feedback loops AI systems learn from unusually well. And when those loops are strong, capability tends to move faster than the market expects.

We are building toward that future at FireCompass.