In the course of my work with LLMs, I’ve been examining a recurring pattern in how large language models are being used inside real systems. In many settings, I observed that LLMs are treated as planners where they are used to generate multi-step workflows, remediation strategies, operational playbooks, and even “autonomous” action sequences.

These plans often look convincing. They are structured, coherent, and plausible.

This has led to a growing belief that LLMs themselves can function as planning machines. Unfortunately, they cannot.

The distinction is subtle but fundamental: producing a plan that sounds right is not the same as choosing the right action under uncertainty. That difference is easy to overlook in toy problems or low-stakes workflows. It becomes critical in production systems, e.g., infrastructure, security environments, medical decision support, or financial operations, where acting incorrectly can be costly, irreversible, or catastrophic.

This essay is an attempt to make that distinction precise, and to explain why fluency should not be mistaken for decision-making.

Why “Sounds Right” Is Not Planning

When people mention “LLM planned correctly”, they usually mean “The sequence of steps sounded reasonable to me.”. But planning is not about sounding reasonable. Planning is about selecting actions under uncertainties, which being wrong has consequences.

The distinction matters a lot when we discuss planning in the context of problems beyond toy problems.

What LLMs Are Really Trained to Do

An LLM is trained to predict the next token in a text. It answers the following question:

Given what has been written so far, what is the next text?

It learns a probability distribution over text based on its training data. What it does not learn thought are following:

It does not learn about real-world states.
It does not optimize success, safety, or utility.
It does not consider the consequences.

The Hidden Averaging That Quietly Breaks Planning

A key idea that most LLM discussions miss:

LLMs implicitly average over all situations they have seen during training.

When you ask: “Generate a plan to handle situation X”, there are unstated truths:

Situation X occurs in many different real contexts.
Some contexts are safe.
Some are risky.
Some fail catastrophically if handled incorrectly.

The LLM cannot see which context you are in. So, it generates a plan that works on average across all those contexts. The averaging happens implicitly during training. Now, you may be surprised and ask that a query to generate a plan is given at runtime, how will LLM get trained with that? We need to dig deeper into the internals of LLM learning and inference.

What the model actually learns during training

During training, the model sees many training examples of the form:

prompt/prefix p (a piece of text)
continuation y (the next tokens)

It learns:

Now, define a latent variable that represents “true context” that generated the text that is used in training. The variable is not labelled in the data. The true data-generating process can be viewed as:

So, the best it can learn is the marginal:

We can now explain the notion of “implicit averaging”. When w (small omega) is not defined by the prompt, the learned conditional distribution P(y | p) collapses into a weighted mixture over multiple latent contexts inferred during training.

A natural question then arises: since LLMs are implemented using Transformer-based neural network architectures, in what sense is Bayesian reasoning relevant? The connection is not architectural but interpretive, as we explain next.

The LLM is a very large deterministic neural network (Transformer) that uses gradient descent to minimize cross-entropy loss. That is the implementation. However, the trained neural network approximates a statistical distribution that is expressed by a Bayesian model. To summarize:

How is the model implemented?
Deterministic neural network trained using SGD / Adams
What function does the train model approximate?
The conditional distribution of continuations given text, marginalized over latent causes (Bayesian model)

We now illustrate these ideas with a concrete example involving an LLM trained on latency spike data from cloud-hosted systems. Before doing so, it is useful to briefly outline the typical training process. A human annotator examines the available data, infers the underlying context, and produces a written representation accordingly. Importantly, this inferred context remains implicit and is not explicitly encoded in the training data.

Training example 1: [latent context (w_1): LOAD]

Human-written context:
Q: How do we handle latency spikes during peak traffic?
A: Scale out the service, add replicas, and monitor error rates.

Training example 2: [latent context (w_2): DB]

Human-written context:
Q: How do we handle latency spikes during peak traffic?
A: Investigate database contention, throttle requests, and optimize queries.

Training example 3: [latent context (w_3): BUG]

Human-written context:
Q: How do we handle latency spikes during peak traffic?
A: Roll back the recent deployment and disable the new feature flag.

Training example 4: [latent context (w_4): CIRCUIT]

Human-written context:
Q: How do we handle latency spikes during peak traffic?
A: Check circuit breaker thresholds and adjust rate limits before scaling.

The model does not see:

“Context = LOAD”
“Root cause = DB contention”
“Failure is catastrophic if wrong.”

It only sees many pairs of (question, answer) with similar wording.

The model learns conditional distribution: P( answer | “how do we handle latency spikes”),

Which mathematically is:

What does implicit averaging mean here?

Suppose in the training corpus:

50% of examples were LOAD-related
25% DB
15% BUG
10% CIRCUIT

Then the learned distribution might look like:

“Scale out the service” maps to high probability
“Check DB contention” maps to medium
“Rollback deployment” maps to lower
“Check circuit breakers” maps to lower

This is not reasoning. It is frequency-weighted marginalization.

Why “Average Plans” are Dangerous

An average plan can be the worst possible plan in real life.

Consider:

In 80% of cases, Action A works.
In 20% of cases, Action A causes severe failure.
Action B is slower but safe in all cases.

A planner that reasons about consequences will choose Action B when uncertainty is high. An LLM will usually propose Action A. Why? Because its training objective rewards what is most common, not what is most robust.

Iterative Prompts: Helpful but Fundamentally Limited

A common response to the limitations described above is: “We can just ask follow-up questions and refine the prompt.”.

Iterative prompting does help but only under specific conditions.

When a user adds more detail over multiple interactions, the prompt may partially disambiguate the latent context ω. In those cases, the model’s output distribution becomes narrower, and the generated plan may improve.

However, iterative prompting does not change the fundamental mechanism by which the LLM operates:

The model still does not maintain an explicit belief over possible contexts.
It does not represent multiple competing hypotheses simultaneously.
It does not decide whether it is safer to act, probe, or defer action.

Iterative prompting works if and only if:

All remaining plausible contexts recommend the same action, and
Acting incorrectly is cheap, reversible, or immediately detectable.

When these conditions are violated, which is common in real systems, implicit averaging persists. The LLM still converges toward the most frequent successful narrative seen during training, even if a minority context would make that action dangerous.

In other words, iterative prompting reduces ambiguity only when ambiguity is already mostly harmless.

What Real Planning Requires (and LLMs Don’t)

Planning, in the decision-theoretic sense, requires capabilities that LLMs fundamentally lack.

A real planner must:

Maintain an explicit belief over hidden states of the world.
Update that belief as new observations arrive.
Evaluate actions based on expected outcomes and risks.
Choose information-gathering actions when uncertainty is high.
Decide not to act when the risk of being wrong is unacceptable.

In real planning systems, uncertainty is not ignored or averaged away. It is represented explicitly.

A planner maintains a belief state: a probability distribution over all plausible hidden contexts that could explain what is being observed.

Conceptually, the belief state answers the question:

“Given everything I’ve seen and done so far, how likely is each possible explanation of what’s really going on?”

Formally,

And select actions by optimizing the expected utilities:

For example, in a cloud incident:

High latency could be caused by load, database contention, a bad deployment, or a network issue.
All of these explanations remain possible until evidence rules them out.

The belief state assigns a probability to each of these possibilities and updates those probabilities as new observations arrive or diagnostic actions are taken.

This explicit belief is what enables real planning behavior:

Acting cautiously when confidence is low
Running probes to reduce uncertainty
Avoiding actions that are safe in most cases but catastrophic in a few
Choosing to wait when the risk of being wrong is too high

LLMs do none of this.

They do not maintain competing hypotheses.
They do not update probabilities.
They do not value information.

They collapse all uncertainty into a single fluent response. That is why LLMs produce plausible plans while planners make safe decisions.

Why this matters in practice

In real systems, uncertainty is not a corner case; it is the default.

In cloud operations, the same symptoms can correspond to load, bugs, network failures, or cascading retries.
In security systems, identical signals can originate from benign services, compromised hosts, or deception environments.
In medicine, similar symptoms can indicate harmless conditions or life-threatening ones.

In these settings:

Minority contexts are rare but catastrophic.
Actions are often irreversible.
Failure is expensive and highly visible.

An average-case plan is therefore not a “good enough” plan. It is often the most dangerous one.

This is precisely where LLMs are most likely to fail, not because they are weak language models, but because they are optimized for plausibility, not consequence.

Key Takeaway

LLMs are not planning machines.

They generate fluent plans by implicitly averaging across many unseen contexts present in their training data. This makes them effective at producing plausible actions, but unreliable at choosing safe ones under uncertainty.

Real planning requires explicit belief, uncertainty management, and consequence-aware decision making. The correct architecture is not “better prompting,” but wrapping LLMs inside belief-aware planners, allowing them to propose, but never to decide.

Closing

This essay is not an argument against LLMs, nor against their use in planning-adjacent systems. It is an argument for architectural honesty.

If a system must operate under uncertainty, reason about consequences, and avoid catastrophic minority outcomes, then belief representation and decision logic must live outside the language model.

In future posts, I plan to explore belief-augmented planners, hybrid decision architectures, and what “planning” should mean for AI systems that operate beyond controlled or reversible environments.

If you’ve encountered this failure mode in production systems, especially where mistakes are expensive, I’d be interested in hearing where the line between plausible plans and safe decisions showed up in your own work.

About FireCompass

FireCompass is an Agentic AI platform for autonomous penetration testing and red teaming across Web, API, and infrastructure. It discovers shadow assets and web applications, safely validates what is exploitable, and connects findings into multi-stage attack paths with near-zero false positives. Unlike traditional scanners, FireCompass uncovers credential reuse, business-logic flaws, privilege escalation, and app-to-app or app-to-network lateral movement. It can operate autonomously or with expert-in-the-loop validation. FireCompass has 30+ analyst recognitions across Gartner, Forrester, and IDC, and is trusted by Fortune 1000 enterprises.

See What’s Actually Exploitable in Your Environment. Claim Free AI Pen Testing Credits → firecompass.com/explorer