The Assumption Gap
Why LLMs produce slop and how to close the distance
Using LLMs for subjective tasks like writing is genuinely challenging due to some deep reasons.
Yet it’s fairly easy to rapidly generate mediocre outputs that are “good enough”. This can be particularly deflating to all of you strong writers within an organization (e.g. PMs, Engineering Managers, UX Designers, etc) who approach your craft with a deep sense of aesthetics and caring.
This new world puts you between a rock and a hard place. On the one hand, you feel that if you don’t adopt AI you’d get left behind. Perhaps the PRDs or performance reviews that used to take you hours are being done by someone else within minutes. They’re clearly mediocre compared to your work but are “good enough” within your company’s culture. Moreover, you don’t yet trust AI or your skills in wielding AI to produce outputs you’d consider actually good. So it’s incredibly tempting to defer using it. On the other hand, you don’t want to produce the sloppy mediocrity yourself since it would challenge your deep sense of aesthetics and commitment to your craft.
Others, like you, caught in this tension will typically thrash and get endlessly stressed and frustrated until they give up. Eventually, they collapse into over-control or under-control.
It’s tempting to read into this frustration as a sign that you’re not suited for AI. Or that achieving these new standards of “productivity” necessarily requires surrendering your standards. Neither is true.
The discomfort is indeed very real. But I’d offer that there’s a happy middle between these extremes. One that can rapidly improve your LLM’s outputs without jeopardizing your productivity.
Why is it so hard to produce good outputs?
Natural language is inherently underspecified. There’s no natural language communication that fully captures the meaning of that sentence. Effective communication between two individuals relies on a shared set of implicit assumptions between those individuals.
Imagine reviewing a pull request from a brand-new hire and commenting only, “Clean it up and then I’ll approve it”. Without shared context, that instruction is wildly underspecified. You’d have to spell out what “clean” means in excruciating detail. Onboarding exists to build that shared set of assumptions so a vague comment like this becomes legible without explanation.
This onboarding heuristic can be helpful for working with LLMs. They’re expert role-players with access to all public knowledge that have no pre-existing relationship with you. Imagine that every time you submit a prompt a temp worker gets hired with a 200k context window. They have the sum of all human knowledge at their fingertips but absolutely no idea who you are. They’re oriented towards broad societal values but don’t know what your team values at high resolution. This is germane for highly subjective or context-sensitive tasks. It’s common for model outputs for such tasks to regress to the mean of the training distribution and produce slop.
Subjective tasks like creative writing are especially hard because nebulosity confronts you immediately at every layer of abstraction in a way that’s less true for tasks like code generation. Failure to sufficiently frame and communicate what you want rapidly leads to underspecification in the prompt and eventually slop.
Subjective tasks also lean far more on an embodied and intuitive sense of what’s “good”. You often can’t fully specify what you want because you don’t fully know until you see some concrete instances and see how you feel. So a frontier model’s propensity to generate slop isn’t just about missing “information”. It’s also about missing crucial details that specify the overall problem that doesn’t yet exist in an articulable form. You feel something is off before you can articulate why. This isn’t a bug. It’s the nature of nebulous concepts like “good writing” or “helpful feedback”.
LLM folk-wisdom has devised role-playing prompts (e.g. “Act as a software engineering manager at a FAANG-caliber company.”) in an attempt to provide more of this framing. Such instructions do move the model closer to the right spirit. A software engineering manager and a professional artist may approach the same software engineering task very differently. But “software engineering manager” is still a broad category. It doesn’t capture what your team values, your coding standards or your communication norms. Role-play narrows the space but often insufficiently so for many subjective tasks like writing.
Endless iteration on a prompt without a coherent strategy to navigate the nebulosity within the space of iteration often fails, because it treats a stochastic system as deterministic. But it’s common because creating value with LLMs is uncomfortable.
What can you do about it?
You shouldn’t simply attempt to specify everything up-front in your instructions (i.e. over-control) and neither should you meekly accept whatever the model produces the first time around (i.e. under-control).
There’s a better third choice. The technique below provides a structured process to iteratively surface and correct key implicit assumptions that actually matter for your subjective task. This technique is quite general and can also be useful for complex objectively verifiable tasks.
Here’s the high-level core loop as a list of steps.:
Add instructions to your prompt asking the model to surface its implicit assumptions before generating its output.
Editing something is often far easier than generating something. LLMs can rapidly generate lots of text.
Shifting the burden of generating these assumptions also shifts a substantial amount of onboarding effort onto the LLM. Your role shifts to iteratively critiquing the assumptions it generates.
Inspect the assumptions before evaluating the output. If any are wrong, add clarifying context to your prompt and re-run.
Once the assumptions look right, evaluate the output. If it’s still off, add instructions describing what “good” would look like and re-run.
Repeat until you’re satisfied or run out of patience.
The technique works because it creates a feedback loop between two capacities:
Embodied discernment: Your intuitive sense that something is off in the output, even before you can articulate why.
Specified Patterns: Your provisional, explicit sense of what “good” means, accumulated in your prompt.
Your embodied intuitive discernment notices something’s off. You take this nebulous feeling and articulate a concrete pattern that becomes a new pattern in the prompt. Articulating this pattern provides sharper language to capture what “good” intuitively means to you. Sharper language makes it easier to notice finer-grained distinctions the next iteration, which leads to more refined patterns, and so on.
Embodied discernment without the capacity to articulate concrete patterns remains vague (”I don’t like it but I can’t really say why”). Articulating concrete patterns without embodied discernment creates a tyranny of meaningless checklists. If integrated in feedback, the two spiral upwards together.
This approach allows you to rapidly “onboard” the model onto your culture (i.e. the specific values, framings, quality standards, etc.) within the scope of the context window to make its outputs feel like they’re yours. In this approach, prompt engineering is inherently iterative.
Detailed step-by-step explanation of the technique
Setup (one-time per task)
Step 1
Structure your prompt with three explicit sections:
Why is this task valuable?
What outcome do you want from the model?
Any additional details on how you’d like it to be done.
Any other informational context or assumptions that the model should make.
How much to include is an empirical question that’s deeply task-dependent. As a useful heuristic imagine a temp worker with 200k tokens of memory and access to all public knowledge. Would they find your description legible? They already know all public knowledge. So you wouldn’t waste time explaining basic concepts (e.g. profit and loss, how gravity works, etc.) to them. Prioritize knowledge that’s esoteric or idiosyncratic to your specific context.
This is also why the overall process works. Getting the model to surface assumptions lets you lazily bootstrap context from the minimal dose the model needs to produce “good enough” outputs for iteration, rather than trying to specify everything upfront.
Here’s an example of what such a prompt looks like:
## Why this matters
I’m trying to turn a messy brainstorm into something I can share with my editor. The outline needs to be legible to someone who wasn’t in my head when I wrote the original.
## What I want
Convert my stream of consciousness document into a coherent outline.
## How I want it done
Use nested bullets. Group related ideas. Preserve my phrasing where possible; don’t sanitize the language.
## Additional assumptions
I’m writing for a technical audience familiar with LLMs. The final essay will be ~1500 words.
Step 2
Add instructions asking the model to surface its implicit assumptions before generating its reply.
## Response Format
The definition of “success” on this task is deeply contextual. It involves making a substantial amount of implicit assumptions about a wide array of things. Especially since natural language is really underspecified.
It’s easy for me to forget what context you may or may not have.
Please start your reply with a section of nested bullets called `Implicit Assumptions`. The goal of this section is to make explicit any implicit assumptions you’ve made based on the instructions I’ve provided in this user turn. Correcting these implicit assumptions would allow the two of us to get onto the same page.
Each assumption should be concrete, specific and load-bearing.
- The assumptions should not be redundant with the explicit context and instructions I’ve already offered here. After all, they’re meant to be *implicit* assumptions that I’m asking you to make *explicit*.
- The assumptions should be load-bearing in the sense that if I invalidate any of these implicit assumptions, one would necessarily expect the rest of the reply to materially change.
Please group these implicit assumptions based on:
- Why I value the response.
-Concrete characteristics of a good response.
There’s room to experiment with these instructions. I’ve arbitrarily chosen the dimensions of “Why I value the response” and “Concrete characteristics of a good response”. But those may not be the most germane dimensions for your task. Choose dimensions that would help the LLM best onboard onto your task.
Evaluating Assumptions
Essentially, we first check that both the categories of assumptions and the assumptions themselves for this task seem correct/reasonable before we proceed to directly change the output.
Step 3: Run the prompt.
Step 4: Inspect the generated assumptions.
Step 5:
Do the default categories (i.e. value, concrete characteristics) of the generated assumptions seem germane and helpful for this task?
If not, consider how the categories could be improved, and try changing the original prompt. Then go back to step 3 to re-run it.
Step 6:
Do the assumptions within the current set of categories seem correct for this task?
If not, add a new section to the original prompt called ## Important Assumptions, and add bullet points of corrections in there. Then go back to step 3 and re-run it.
Evaluating Outputs
Step 7: Examine the output. Specifically, check if it feels “off” in any way. This is an example of embodied discernment.
Step 8:
If the output is “off”, imagine a specific pattern of what the model should have done instead. Add an instruction for it as a bullet point to the original prompt in a new section called
## Important Instructions. This crystallizes another pattern for the overall specification of the task.If you have trouble what
Ensure this additional instruction is a positive instruction. Negative instructions of “don’t do X” are harder for models to follow than positive instructions (“do Y”). Most instructions to correct model mistakes can be written as positive instructions. It’s easiest to get what you want when you’re clear about what you actually want. For example:
Example 1
Bad - Your reply shouldn’t be too long.
Good - Your reply should be no more than five sentences.
Example 2
Bad - Don’t reply in full sentences and don’t be long-winded in your response.
Good - Your response should be entirely in nested bullets. It’s okay to sacrifice grammar for length. Ensure each bullet makes a direct and specific point.
Suppose I’ve spent a bunch of time tuning the prompt. How can I be certain that my current attempts are the best that I can do for this task?
You can’t. Welcome to the nebulosity of stochastic systems. As discussed in my previous post, fault attribution and iterating on complex systems is extremely challenging.
Eventually, the output will stop triggering a reflexive “this feels off” gut response. Trust that instinct. Working effectively with AI places greater demands on your capacity for self-trust and embodied discernment. Similarly, having an unchecked sense of perfectionism can be an impediment to using AI since certainty is structurally unavailable.
It’s possible that your specific task is too complex for the method in this post. It’s helpful to approach all such stochastic tasks with a timebox. This creates space for check-ins to decide whether to keep pushing on the current approach, get help, or try a more sophisticated approach.
Closing
It’s not possible to create a prompt tuning process that can generate certainty for most subjective tasks. Moreover, crafting a process to seek such certainty is a non-goal. The process described above certainly doesn’t provide any such guarantees. This isn’t a limitation. Patterns are provisional. Grip them too tightly and you get over-control. Surrender to the model and you get under-control. The middle path involves forming patterns provisionally, staying connected to embodied discernment, and revising as necessary within an iterative loop.
Cultivating this capacity involves deliberate practice. Certain tools can make the practice easier. Custom GPTs, Claude Projects, Gemini Gems, etc. let you save and reuse accumulated patterns. A future post will provide concrete examples on GitHub accompanying instructions.


