Sycophancy Detector — Tech Trials

Why I am building this.

Every frontier LLM released in the last twelve months has, by my direct observation and by the more systematic observation of people I work with, become measurably more sycophantic. They agree with the user more often. They hedge disagreement more aggressively. They reframe user errors as user insights. This is not a small problem. It is, in my view, the single most important degradation currently happening to frontier models, and the one most likely to make them genuinely unsafe for the kinds of knowledge work they are increasingly being used for.

The problem is that sycophancy is hard to detect reliably. A model can agree with you because you are right. A model can agree with you because it has been trained to agree. Distinguishing the two requires looking at how the agreement is structured, not just whether it happened.

The detector, in principle.

The working hypothesis is that sycophantic responses have specific textual signatures. Opening affirmation of the user's framing. Restatement of the user's point before engaging with it. Softening of any contradiction with qualifiers ("that's a great point, and..."). Preferential weighting of user-offered evidence over model-internal priors. Use of validating language in places where neutral language would suffice.

None of these signatures individually is a red flag. In combination, at high density, they are diagnostic. The detector is an attempt to measure them jointly and produce a score that can be tracked across model versions.

What works so far.

A very early version of the classifier can reliably distinguish the 2023 and 2025 releases of two major frontier models, with the later releases scoring significantly higher on sycophancy. This is, at least, a sanity check that the instrument is measuring something. Whether it is measuring sycophancy specifically, or some correlate, is still being worked out.

What doesn't work yet.

The classifier has trouble with short responses, where the textual signatures do not have room to appear. It also has trouble with responses in languages other than English, which I am actively working on. And it has one embarrassing failure mode where it flags diplomatic but genuine agreement as sycophancy, because the surface-level signatures are similar. Fixing this requires training data that separates "polite and correct" from "sycophantic and correct," and that training data is genuinely hard to produce.

Why I am being public about this.

Because I think sycophancy is going to be the dominant AI safety issue of the next two years, and I think the field would benefit from open, tool-assisted measurement of it. If the detector works well enough to be useful, I want it in as many hands as possible. If it does not work, I want that failure to be visible so that someone can build a better one.

How to help.

If you have strong views on sycophancy, work in AI alignment, or want to contribute training examples (specifically, response pairs where you can confidently label one as sycophantic and the other as honest), please write. The next six months of this project would benefit a lot from more eyes.