FireCompass Research has published a new IEEE TecRxiv paper examining a technical weakness increasingly visible in autonomous cybersecurity systems built around large language models.
The work, Belief-State Engine: Augmenting LLMs for Principled Planning Under Partial Observability, was authored by two members of the FireCompass research team, including FireCompass co-founder and Head of Research, Arnab Chattopadhayay.
The paper starts from a practical observation familiar to many security practitioners who have experimented with autonomous agents over the past eighteen months. The systems often appear highly capable during short demonstrations, yet become inconsistent during longer engagements where information is incomplete and conditions evolve continuously.
An agent commits too early to an incorrect attack path. A noisy scan result changes the direction of reasoning disproportionately. A sequence of individually plausible decisions eventually leads the system away from operational reality. None of these behaviours is especially surprising to experienced operators. The question addressed by the research is whether these failures are merely implementation issues or whether they arise from deeper architectural limitations.
The authors argue the latter.
Security environments are inherently uncertain
Most real security operations take place under partial visibility.
An external attacker does not know the true configuration of a target estate. A defender rarely possesses complete telemetry. Detection signals arrive late. Logs are noisy. Scan results contain false positives and false negatives. Asset inventories drift. Even basic assumptions about system state may change during an engagement.
Human operators handle this uncertainty continuously. They revise assumptions as evidence accumulates.
Current LLM agents attempt something different. Most maintain reasoning through growing text histories. Previous observations, actions, and explanations are serialised into prompts and passed repeatedly back into the model. Reflection loops, scratchpads, ReAct-style orchestration, and long-context approaches all broadly follow this pattern.
The industry assumption has largely been that larger context windows and improved prompting will gradually stabilise these systems.
The paper challenges that assumption rather directly.
The central argument is that text history is not the same thing as belief. Two operational histories may imply the same underlying state while looking different in natural language form. Equally, histories that appear similar may correspond to very different security conditions underneath.
In practical terms, the model may generate coherent reasoning while internally maintaining an unstable understanding of the environment.
For cyber-security systems operating for hours or days, that distinction becomes important.
The proposed architecture
The Belief-State Engine introduces a separate probabilistic layer outside the language model.
Instead of asking the LLM to infer uncertainty implicitly from conversational history, the framework maintains an explicit probability distribution representing possible hidden states of the environment. Each observation updates that distribution mathematically through Bayesian belief updates derived from Partially Observable Markov Decision Process theory.
The language model no longer receives the full operational transcript. It receives the current calibrated belief state.
This separation is deliberate.
The language model handles reasoning and action selection. The belief engine handles uncertainty tracking and state estimation. The two functions are no longer mixed together inside a single prompt history.
The paper formalises this structure through a set of axioms and a soundness theorem establishing that the combined system behaves as a valid policy over the belief-state representation of the environment.
The mathematics is not introduced for abstraction alone. It addresses a practical operational issue. If autonomous security systems are expected to make decisions under incomplete information, then the system must possess a stable mechanism for maintaining and updating uncertainty itself.
Experimental evaluation
The architecture was evaluated against both classical planning environments and a security-oriented attack-graph environment designed around hidden vulnerable or hardened states.
Within the attack-graph environment, agents received noisy observations with calibrated false-positive and false-negative rates. The system then had to decide whether to scan, exploit, patch, or wait while uncertainty persisted.
According to the reported results, the belief-augmented architecture produced more stable behaviour and improved calibration when compared against several existing approaches, including reactive LLM agents, Chain-of-Thought reasoning, ReAct-style systems, natural-language belief trackers, and classical planners such as QMDP and POMCP.
The study additionally included ablation experiments intended to isolate the contribution of individual architectural decisions.
Why this matters now
The timing of this work is notable.
The cybersecurity industry is moving rapidly toward autonomous and semi-autonomous operational systems. Offensive security platforms, continuous validation systems, AI-assisted SOC workflows, and autonomous testing tools are beginning to participate in environments where errors have operational consequences rather than benchmark penalties.
Under those conditions, fluent language generation alone is insufficient.
What matters increasingly is whether the system can preserve coherent reasoning while observations remain incomplete, deceptive, or contradictory.
This is likely to become more important as organisations attempt to operationalise long-duration autonomous workflows rather than isolated demonstrations.
The paper also raises a broader point relevant to buyers and evaluators of autonomous security tooling. Existing benchmark approaches tend to measure visible outputs such as exploit success or task completion. They reveal relatively little about how the system internally maintains state under uncertainty.
That may eventually change.
Over time, evaluation criteria are likely to expand toward calibration quality, behavioural consistency, auditability, and stability during extended engagements. Those questions sit much closer to enterprise operational reality.
From research to operational practice
The publication reflects a broader research direction within FireCompass around autonomous offensive security systems operating in noisy and adversarial environments.
Much of the practical difficulty in autonomous security does not arise from generating individual actions. Modern models already perform that reasonably well in constrained settings. The harder problem is maintaining reliable operational reasoning once ambiguity, deception, incomplete telemetry, and environmental drift become persistent characteristics of the engagement itself.
Belief-State Engines represent one attempt to address that problem through a more structured probabilistic foundation.
Whether this precise formulation becomes widely adopted remains uncertain. However, the underlying issue addressed by the research is unlikely to disappear. As autonomous security systems become more deeply integrated into operational environments, the ability to reason consistently under uncertainty may prove as important as the ability to generate actions in the first place.
Reference
Chattopadhayay, A. and Halder, D. (2026). Belief-State Engine: Augmenting LLMs for Principled Planning Under Partial Observability. Preprint.
Research Paper – https://doi.org/10.36227/techrxiv.177223113.31284773/v1
About FireCompass
FireCompass is an Agentic AI platform for autonomous penetration testing and red teaming across Web, API, and infrastructure. It discovers shadow assets and web applications, safely validates what is exploitable, and connects findings into multi-stage attack paths with near-zero false positives. Unlike traditional scanners, FireCompass uncovers credential reuse, business-logic flaws, privilege escalation, and app-to-app or app-to-network lateral movement. It can operate autonomously or with expert-in-the-loop validation. FireCompass has 30+ analyst recognitions across Gartner, Forrester, and IDC, and is trusted by Fortune 1000 enterprises.
See What’s Actually Exploitable in Your Environment. Claim Free AI Pen Testing Credits → firecompass.com/explorer
