What our firsthand experience building pentest agents taught us about verifiability, benchmark saturation, and where human researchers still matter most
Our firsthand experience with application pentest agents at FireCompass has been unexpected.
When we started building them, I assumed AI would become a useful force multiplier for researchers. I did not expect it to start challenging strong researchers this quickly.
The progression has been hard to ignore.
- Version 1: our web application pentest agents were maybe half as good as our researchers.
- Version 2:the agents started hitting 100% on public test benches like XBen.
- Version 3:public benchmarks stopped being enough, so we moved to researcher-versus-agent evaluation.
- Version 4: the agents are now beating our top researchers 60–70% of the time. In some cases, by very wide margins, while staying under 2% false positives.
AI is likely to disrupt application pentesting earlier than many people expect.
This is not an argument that AI has solved all of penetration testing. It is not an argument that all of cybersecurity will be reordered by offense first. It is a more specific thesis:
AI is becoming very strong in the repeatable, validation-heavy, mechanically testable layers of application pentesting.
That is where we are seeing the shift first.
The key idea: application pentesting has unusually strong feedback loops
The most important concept here is not “AI is smart.” It is verifiability.
AI systems improve fastest when they can take an action and get relatively fast, objective feedback on whether that action worked.
Application pentesting has a surprising amount of this built in.
A system can often test and learn from questions like:
- Did the endpoint exist?
- Did the request bypass authorization?
- Did the role boundary break?
- Did the injection payload work?
- Did the state transition happen?
- Did the finding validate with evidence?
- Did the attack chain move from one step to the next?
These are not perfectly binary in every case, but they are often much more mechanically testable than the kinds of subjective, delayed, or ambiguous tasks where AI tends to struggle.
That matters.
Because once you have a tight loop between hypothesis -> execution -> observation -> validation, agentic systems start to compound.
Why this does not mean pentesting is “easy”
This is where I think a lot of people oversimplify the discussion.
Real pentesting value does not come only from a single exploit succeeding.
It comes from higher-order reasoning:
- choosing the right objective under uncertainty
- forming the right hypothesis from incomplete signals
- understanding how business logic changes the risk
- deciding when a weak signal is worth pursuing
- knowing when to pivot, abandon, or escalate
- distinguishing a technically interesting issue from an operationally important one
Those are still difficult problems.
But the full workflow is not one giant reasoning task. It is a chain of smaller subproblems, and many of those subproblems are locally verifiable.
That is the real unlock.
An agent does not need to solve “pentesting” in one shot. It needs to solve the next verifiable step:
- discover a route
- map a parameter
- test a trust boundary
- retry with a different state
- validate the signal
- carry the result forward
When enough of those local loops are measurable, performance starts to improve faster than many people expect.
Why benchmarks stop being useful after a point
Public test benches are useful early.
They are good for establishing baseline competence. They tell you whether the system can do real work in a controlled environment.
But once an agent reaches benchmark saturation, the benchmark stops being a meaningful proxy for real capability.
It no longer tells you:
- whether the system can handle messy applications
- whether it can deal with stateful flows
- whether it can reason across ambiguity
- whether it can pursue the right branch in a large search space
- whether it can validate without generating noise
- whether it can outperform a strong researcher in open-ended conditions
That was the turning point for us.
Hitting 100% on public benches was not the finish line. It was the point where we had to move to a more operational question:
How does the system perform against strong humans in real testing conditions?
That is a harder bar. It is also the bar that matters.
Why agents are getting stronger so quickly
From our perspective, there are three reasons.
1. They can go deep and wide at the same time
A good human researcher can go very deep.
A strong agent can often go deep, wide, and persistently at the same time. It can enumerate broadly, test multiple branches, retry endlessly, correlate weak signals, and continue exploring without fatigue.
A lot of application pentesting value comes from sustained exploration across a branching graph. Agents are increasingly well-suited for that shape of work.
2. They benefit from execution, not just reasoning
The real power is not just in generating an idea.
It is in generating an idea, executing it, observing the result, and adapting quickly.
A good pentest agent is not just an LLM with prompts. It is a system with tools, execution control, state management, memory, replay, and validation.
That loop is much more powerful than static reasoning alone.
3. They improve as the system improves
The progress is not only about the base model.
Agents improve as you improve:
- scaffolding
- tool routing
- state handling
- decomposition of tasks
- evidence validation
- retry logic
- memory
- training and feedback
That is why the improvement curve can feel nonlinear.
- The model gets better.
- The system gets better.
- The verifier gets better.
And once all three improve together, capability moves fast.
What AI is likely to take over first
I do not think AI will absorb all of application pentesting uniformly.
The first major shift is likely in the parts of the workflow that are:
- repeatable
- telemetry-rich
- execution-heavy
- evidence-driven
- decomposable into verifiable steps
That includes a lot of:
- route and endpoint discovery
- parameter exploration
- auth and access-control validation
- exploit retries and variations
- attack-path enumeration
- evidence gathering
- retesting and confirmation
- reducing noisy findings into validated outcomes
This is where AI can create a real advantage.
Not because these tasks are trivial, but because they create strong feedback loops.
What remains stubbornly human
Human researchers still matter enormously.
In fact, I think the more capable the agents become, the more valuable high-quality human judgment becomes.
The humans may have edge in areas like:
- objective selection
- unusual edge cases
- multi-system reasoning under ambiguity
- interpreting partial signals
- deciding what matters commercially and operationally
- designing the right testing strategy based on specific business context
- supervising and improving the agentic system itself
The role does not disappear. But it changes.
More of the manual execution shifts to machines. More of the human value moves up the stack.
The real shift: from manual testing to human-directed systems
That, to me, is the deeper implication.
The future pentester will not just be the person who can manually test the most branches.
It will increasingly be the person who can:
- define the right objectives
- design and supervise the system
- interpret ambiguous outcomes
- distinguish noise from strategic signal
- direct and improve agent fleets better than others
That is a different craft from purely manual pentesting.
And I believe it will define the next era of application security testing.
Closing thought
AI is likely to disrupt application pentesting earlier than many people expect, because large parts of the workflow are mechanically verifiable, execution-heavy, and composed of repeatable subproblems.
That is different from saying all of pentesting is solved. It is different from saying all of cybersecurity will move in the same order.
The most important thing we have learned is this: Application pentesting is not becoming interesting for AI because it is easy. It is becoming interesting because enough of it is verifiable. That creates the kind of feedback loops AI systems learn from unusually well. And when those loops are strong, capability tends to move faster than the market expects.
We are building toward that future at FireCompass.
