The phrase ground truth is one of those bits of technical vocabulary that has quietly accumulated more authority than its definition can support. In practice, when visual AI researchers say "ground truth," they mean one of at least four distinct things, and which one they mean usually depends on whether they come from a computer vision, graphics, photography, or philosophy of perception background.
I spent the last three weeks building a NotebookLM inquiry that maps the literature on this question. Forty-one sources: academic papers, industry white papers, two dissertations, a handful of practitioner blog posts, and one excellent talk from NeurIPS 2024. I was looking for a consensus. I did not find one.
What I did find was more interesting. The literature splits into four distinct positions on what ground truth is supposed to do in a visual AI evaluation pipeline. Each position has its own logic. None of them are talking to each other. And the gap between them is, I think, the single most underrated bottleneck in current visual AI development.
The four positions, briefly
Position one: ground truth as reference frame. Dominant in computer vision research. A ground truth image is whatever you compare the generated output against. No claim is being made about the image being objectively "true." It is simply the canonical reference for a given task. Pragmatic, limited, and explicit about its limits.
Position two: ground truth as physical fidelity. Dominant in graphics and simulation. A ground truth is whatever would have been captured by an ideal camera in a real scene. This is the position that treats visual AI as ultimately a physics problem, with perception as a downstream consequence. Useful in narrow domains, unhelpful beyond them.
Position three: ground truth as aggregated human judgment. Dominant in RLHF and preference-modeling literature. A ground truth is whatever a large enough number of raters agree is correct. This is the position with the most practical infrastructure currently built around it, and also the most philosophically fragile of the four.
Position four: ground truth as practitioner consensus. Almost entirely absent from the literature but universal in actual film and photography post-production. A ground truth is whatever a domain expert in the relevant craft would sign off on. No vote. No aggregation. One authority, calibrated by years of apprenticeship.
The full notebook walks through each position with examples, counterexamples, and a reading list for each. It also traces the specific lineage by which position three came to dominate industry RLHF pipelines, and why I think this happened for historical rather than epistemic reasons.
Below is the start of what I found for position four, which I think is the most neglected.
Why position four keeps getting missed
The short version: position four does not scale, and current AI development is organized around what scales. But there is a longer version.
Position four relies on the existence of a small number of people whose judgment has been calibrated through years of structured practice. Colorists. Cinematographers. Art directors. These people are expensive, slow to train, and produce judgments that are often pre-verbal. None of these properties fit the current AI development pipeline, which is optimized for cheap, fast, verbal data.
But the properties are not bugs of position four. They are its features. Practitioner consensus works precisely because it is slow and expensive. The apprenticeship is the signal...