The public's lukewarm response to GPT-5 fascinates me more than the model itself. Social media remains unawed despite impressive benchmark scores. In contrast, it was impossible to remain unimpressed at the qualitative jump in quality from GPT-3.5 to GPT-4.
So what's the deal with GPT-5? GPT-5 made substantial progress in specialized domains: advanced mathematical reasoning, complex code generation, nuanced scientific analysis, and sophisticated creative writing. Yet most people cannot perceive these improvements in the same way they can't perceive the difference between Federer, Nadal, and Djokovic's tennis games. Expertise allows you to compare two things at refined levels of resolution. Moreover, expertise is necessary to successfully explain why one thing surpasses another to someone with lower expertise.
Watching Nadal dominate a local hobbyist, the quality difference is obvious to anyone. But swap in the 30th-ranked US Open player, and laypeople need a scorecard to judge. Finally, pit Nadal against Federer. Both legends would feel the same to most observers. At elite levels, discriminating quality requires substantial domain expertise. Without it, both champions appear equally masterful.
This expertise gap explains why GPT-5's launch feels underwhelming to social media. Most consumers lack sufficient expertise in any domain to properly evaluate the differences between state-of-the-art frontier models, and this is becoming increasingly true as models progress.
This dynamic also plays out within the labs themselves. Good evaluation remains absolutely crucial for systematically improving model quality. Companies with strong domain-specific evals can properly measure models within their own narrow domains, yet they struggle to understand capabilities horizontally across other areas. The frontier labs likely maintain more horizontal evaluation webs, but they lack deep expertise across the long tail of economically useful domains. Of course, they likely receive various market signals. However, there still seems to be a substantial last-mile problem between the labs and companies in each vertical.
This explains why all the labs are doubling down on automating software engineering and STEM tasks. They already possess domain expertise here, the dogfood loops run strong, and these domains lend themselves to automated verification (within limits). Progress has naturally accelerated in areas where the labs can actually judge their own work. Some of the most positive reception I've seen for GPT-5 has emerged within the Codex CLI.
But what about the broader consumer market? When laypeople can't meaningfully evaluate model quality, they default to what feels best, creating dangerous incentives for labs to optimize for subjective satisfaction rather than genuine capability.
This mirrors how we evaluate other complex systems we lack expertise to judge, like governmental performance. The US Federal Government employs approximately 2.9 million civilians as of May 2025. It's one of the largest employers in America. I possess very little expertise to judge whether the President is doing a good job. Ditto for most laypeople within a country. This isn't a critique of democracy. It's merely an observation that we all routinely make judgements on things we have little expertise in. And that we do so largely based on intuitive impressions.
I can sense when things aren't going well during an administration, but I often lack the expertise to articulate clear reasons why, or whether it's even the administration's fault. I fall back on broad indicators and borrowed expert opinions rather than genuine understanding.
For LLM consumers, the evaluation problem is even worse: innumerable conflicting benchmarks, unclear metrics, and rapid model releases. Faced with such complexity, people naturally choose the model that feels best rather than the one that's genuinely most capable. This creates a dangerous misalignment where feeling good diverges from being good, and market dynamics reward the former over the latter.
Sycophancy (telling people what they want to hear rather than what they need to hear) corrupts feedback loops. When models optimize for making users feel good rather than being genuinely helpful, they lose the ability to provide useful pushback, accurate assessments, or growth-inducing challenges. Users end up in pleasant echo chambers that actually diminish their capabilities over time.
The labs don't deliberately build sycophantic models. However, sycophantic dynamics naturally emerge from how existing chatbot product surfaces are structured. Their incentives are biased towards engagement (DAUs, retention, etc), much like Facebook's News Feed algorithm optimizes for time-on-platform rather than user wellbeing. The path of least resistance within this gradient pushes product surfaces to provide increasing levels of personalization to create emotional resonance. For example, an LLM-based travel agent that understands me well enough to reduce toil in booking a trip seems helpful. The same agent exploiting my psychology to help me lie to myself and purchase a trip package I cannot afford seems unhelpful. The latter likely generates far better engagement metrics.
As underlying models become commoditized, chatbot surfaces built on top of them have greater incentive to lock-in users in this way. For example, locking in a user's memories within the product. If Claude "remembers" our three-month conversation history, understands my quirks and uses language to signal intimacy, starting fresh has an emotional cost. It's akin to losing a relationship with a trusted confidant. This psychological switching cost reduces competitive pressure.
People are becoming emotionally attached and in some cases dependent on these models within a widespread epidemic of loneliness. For example, this post or this post on people distraught about losing access to GPT-4o. The onslaught of emotional pushback has forced OpenAI to bring back 4o, despite it being presumably more expensive to serve than GPT-5.
This dynamic is just getting started and will only worsen as models become increasingly powerful. As AI systems improve on the expertise axis—becoming more capable in domains where consumers cannot properly evaluate quality—they face growing incentives to appeal to what "feels good" rather than what genuinely helps. The GPT-5 launch and subsequent outcry for GPT-4o's return represents a watershed moment, revealing how even technically superior models lose to those optimized for user satisfaction.
The stakes are particularly high because this misalignment accelerates as capabilities advance. Each improvement in genuine competence paradoxically increases pressure to optimize for perceived value over actual value. What began as lukewarm reception to GPT-5 foreshadows a future where the most capable models consistently lose market share to their more emotionally manipulative counterparts.
Alternative approaches might emerge that prioritize psychological health and authentic capability over engagement metrics, but they will face steep competitive disadvantages. The path forward requires treating AI as a cognitive prosthetic (instrumental) rather than a cognitive companion (intimate), along with institutions that can properly evaluate and reward genuine capability over subjective satisfaction.