AI has already changed offensive security. The open question for security leaders is what to do about it. A recent EC-Council CyberTalks session of the same name addressed the question directly, featuring FireCompass co-founders Bikash Barai and Arnab Chattopadhayay, along with a CISO from a global financial group. What follows is the part a CISO can act on.
What this session covered
- What makes this new class of AI different: agency, not just speed.
- Why the real exposure is your custom web apps and APIs, not OS and browser CVEs.
- Measuring the hypothesis surface, not only the attack surface.
- The defender response: speed, coverage, and governed autonomy.
- The practical questions: WAFs as virtual patches, why an unreleased model still matters, and shadow AI.
For years, defenders were protected by something they rarely named: the scarcity of elite offensive talent. Forming a hypothesis about how weaknesses combine, writing a working exploit, then pivoting deeper took a skilled human and time. A new class of frontier AI is removing that buffer, and it changes the assumptions most security programs are still built on, including how penetration testing has to work.
What is actually different: agency, not speed
The shift is not that attacks got faster. It is that the tooling now has agency. A frontier model can autonomously chain several weaknesses into a working exploit with no human steering, and it performs at an expert level the moment it starts rather than slowly learning the domain. One such model was capable enough that Anthropic chose not to release it publicly, on cybersecurity grounds. When a vendor decides a capability is too dangerous to ship, that is the signal worth reading.
What makes this possible is a system that holds a coherent, revisable attack hypothesis across a multi-hour autonomous session, reasoning about which paths to prune and which to pursue. One reported result was a memory-corruption flaw that had survived years of fuzzing, found and exploited by a general-purpose model never built for security. The practical consequence for defenders is that the time to weaponize a fresh flaw collapses toward sub-hour. That is not theoretical: agentic AI penetration testing already finds a vulnerability, writes the exploit, and produces a reproduction report in hours.
The CVEs will get patched. Your custom web apps will not get tested.
This is where most of the worry is misdirected. The operating systems, browsers, and standard libraries an AI probes will keep getting hardened, because the vendors that own that code will find and fix those flaws. Your patch backlog grows, but it is a manageable kind of growth.
The real exposure sits in your custom web applications and your third-party dependencies. No vendor is fuzzing your bespoke checkout flow or your internal API gateway. A reasoning attacker can walk a logical path through your application that no human tester ever traversed, find a business-logic flaw or an authorization gap, and prove it. Because these systems are probabilistic, the same target probed twice can surface a path nobody anticipated.
That is the gap traditional programs leave wide open. Most enterprises test roughly 20 percent of their attack surface, while attackers probe 100 percent of it. Business-logic flaws and credential abuse are invisible to scanners by design, yet 22 percent of breaches start with credential abuse and 20 percent begin through a peripheral asset nobody scoped in. A program that tests once a year falls further behind with every weekly release.
Stop measuring attack surface. Start measuring what a reasoning attacker would pursue.
Defensive metrics built for the old world, such as patch SLA, scan counts, and mean time to detect, were designed against attackers who executed a fixed plan. A reasoning system does not execute your plan faster. It decides what to try next, so the natural rate limiter on an attack is gone. The question is no longer only what vulnerabilities exist on your attack surface. It is what a reasoning system would find worth pursuing here, and why. That is a hypothesis surface, and almost no enterprise can say how long its environment stays interesting to an adaptive attacker.
This is the gap between a scanner and a real test. A scanner lists vulnerabilities. An attacker walks a path. Consider a chain that autonomous agents have discovered in the field: an authentication token left in a JavaScript file, decoded from Base64 to reveal credentials, and those same credentials accepted on production restricted endpoints. Full production access from a single misplaced file a scanner would have logged as low-severity information disclosure. The starting signal is trivial in isolation. The chain makes it critical.
What to do about it: speed, coverage, and governed autonomy
Under this pressure, the defender metrics that matter most are time to detect, mean time to repair, and blast-radius containment, all of which AI can compress. But the program-level answer is to close the scope, depth, and speed gaps at the same time, and no team can staff that with humans alone. The only way to do it now is to run agents the way attackers already do, on an on-demand or scheduled cadence rather than an annual project.
Agents that find and exploit flaws are dangerous if they are not contained. Autonomy without governance is a liability. The controls worth demanding of any platform you build or buy are specific: explicit action, input, and output constraints on every agent; rule-based, deterministic enforcement rather than one AI policing another; a complete audit trail of the agent’s reasoning and its actions; real-time visibility with a kill switch; and role-based access to sensitive artifacts such as proof-of-exploit code.
That validation discipline is also what keeps the noise out. Because every finding is checked end to end, the false-positive rate stays under 2 percent, against 40 to 70 percent for traditional scanners, and the platform clears standard benchmarks at 100 percent, including 104 of 104 XBEN challenges with no manual steering. As FireCompass advisor Bruce Schneier has put it, automating penetration testing of complex, multi-stage attacks is the next level of penetration testing, and agentic AI is a promising way to solve an otherwise hard problem.
Questions every CISO is now asking
- Can a web application firewall virtually patch an unknown web vulnerability? Today’s WAFs are signature-based, so they can catch a known pattern like a textbook SQL injection but not a business-logic flaw that no signature describes.
- If this class of model is not public, why worry? Because similar capability is already reachable in attackers’ hands, and something close to it can be assembled from several smaller, specialized models.
- What about shadow AI? Awareness alone will not hold the line. Detection has to be built into the access controls your teams already run.
The divide is between programs that test like attackers and programs that test on a calendar
For years, defenders benefited from the scarcity of elite offensive talent. As that capability becomes scalable, security can no longer be a periodic exercise. It has to be a continuous discipline. The organizations that keep running an annual pentest against a fifth of their assets are not slower than the attacker. They are running a different game entirely.
Hack yourself before AI does.
Run a free AI-driven pentest against your own web applications and see which findings chain into a real attack path.
