There's been a meme floating around for a while that "LLMs are bad at writing". I've said it, and I've had other people say it to me. When it's said out loud there's something about it that immediately feels right. On the other hand, it's obviously not true. At least, it's not superficially true. For example, consider how good current SOTA models like Gemini 2.5 Pro are at generating text compared to LSTMs from ~10-12 years ago. It's frankly amazing how good they are at generating long and coherent prose.
So what do I actually mean when I say that "LLMs are bad at writing"? I thought about this a bit and concluded that I'm actually talking about intimacy. Specifically, my real complaint is that I don't feel seen, heard nor understood when I ask an LLM to engage in complex writing tasks. For example, I feel a certain friction whether I ask an LLM to generate an email based on a terse instruction, or whether I ask it to take a stream of consciousness of notes and transform them into a blog post in my voice. There's always this nagging feeling of - "that's not me", or perhaps "I wouldn't have written it like that".
So what would it take to feel seen by the machine in this way? It might be instructive to take a step back and examine what happens between two living systems (i.e. two people). Suppose I ask a friend to "take this bullet list draft and turn it into a blog post in my voice", with the associated data for the prompt. All language is inherently underspecified and requires an overall backdrop for it to be properly understood. Therefore for my friend to generate something that feels right to me, they'd have to have the ability to properly track my overall 4P stack of knowledge which I discussed in a previous post on relevance realization. That is, they'd have to have the ability to track and understand how the various aspects of knowledge that I posses, and what about them I find relevant, evolve over time. And then use this knowledge to infer my overall framing of the problem to achieve a "reasonably good" result to my prompt.
Maths, coding and reasoning tasks are an easier special case of this. The inherent power of mathematics, logic, coding, etc. is that they allow us to construct a fictional axiomatic universe that follows certain well-understood rules. By aligning that universe with inter-subjective agreement, we can construct "objective reality" and gain extremely profound predictive power. So when we communicate within those domains we're automatically able to import a substantial amount of framing that makes communication efficient. But more importantly, it makes communication clear even if all the parameters of the frame are not explicitly linguistically specified. As an aside, this is related to why scientists prefer extremely specific and rigorous definitions of technical terms. For example, when we prompt an LLM to "add 12+7 and print only the result", there would likely be very little inter-subjective agreement between two reasonable people. You'd actually be quite surprised if they weren't sure if "12" and "7" should be parsed as integers. Or the underlying rules governing the "+" operator. Or if they somehow believed that "print" was referring to a printer.
This same shared "objective" meaning and clarity for instructions within the domains of maths, coding and reasoning makes it easy to create verifiers for RL training pipelines.
But even maths, coding and reasoning are rife with tasks that would likely lend themselves to far more intersubjective disagreement. For example, how would one determine if a given proof was "elegant"? Or whether a given result was "trivial"? Superficially, we'd be tempted to say that the answers to such questions are normative in nature whereas the answer to "add 12+7 and only print the result" is objective in nature. But notice that both sets of questions are fundamentally normative in nature. It's just that the latter implicitly presupposes the frame of our familiar subjective/objective duality.
But if we're determined to use this subjective/objective frame, then tasks at the far end of subjectivity would be the sort that I've discussed above. For example, asking an AI to take a stream of consciousness that I've written to turn it into coherent prose in my voice.
How would we go about bridging this gap, at least theoretically? What we're really talking about is the extent to which the entity that we're communicating with (i.e. a human or an AI system), is able to track and leverage the evolution of my 4P stack of knowledge. In the limit, for a system to have a complete understanding of us, it would literally need to be us in the world, whilst also being non-competitive with us in the world. As an aside, there are some very interesting theological questions that one could ask here to explore the theoretical limits of intimacy either between people or human-AI relationships.
Fortunately, not all tasks require this depth of intimacy for the interaction to be minimally functional. Your barista rarely needs to know your life story to be able to make you a good cup of coffee, and vice versa. Your AI likely doesn't need to know your life story to be able to turn mundane prompts into boilerplate emails. And you likely don't need to know the details of how it was trained, or a theory of it's "mind" to find such generations minimally functional. However, it likely needs to have a greater depth of intimacy into your cognition to assist you in tasks that seem inherently more intersubjective in nature. And you likely need a proportionate theory into its mind to understand what it knows about you. I suspect that a breakdown in this reciprocal opening is what leads to what we often call the "jagged frontier".
What I find particularly exciting, fascinating and frightening is that I've been able to prompt existing SOTA models to simulate varying depths of such intimacy with me. I suspect that there's a "there" there. My current suspicion is that existing training pipelines are likely sufficient to train models that are "good at writing". That is, to have deep intimacy into the cognition of their users. I suspect that existing recipes don't train on the right sort of data, or evaluate on the right sort of benchmarks.
Some future posts will explore what the right set of evaluations might be to provide AI systems with enough intimacy to be good enough at generating reasonably good prose. This is closely related to the question of how one might construct appropriate verifiers to perform RL on this task.
Disclaimers
I should note that I was deliberately loose with my terms in this post. I wrote this mostly to clarify my thinking around the issue.
In particular, I know that some of my readers might find it challenging to see me use the word "intimacy" in this way. Especially since LLM-based systems don't seem to be alive in any conventional sense. I fully acknowledge this. I mostly did that to point directionally towards a specific phenomena. If I were writing an academic paper or something more formal, I would have either come up with a different term for this like "cognitive alignment" or perhaps reused an existing term.
It's possible that the overall ontology of knowledge (i.e. 4P) that we've discussed here might not be sufficient for the sort of tasks we care about. I'm very open to that.