There is a particular way a generated frame can be wrong that I have started calling confident-but-flat. The lighting is plausible. The composition is recognizable. A model trained on ten billion parameters has produced something that, at a glance, looks like a photograph. And yet, to any working colorist, it reads immediately as not-quite-right. The failure is not in the pixels. It is in the argument the frame is making about its own light.

I spend a lot of my working week trying to articulate this failure to people whose training did not require them to articulate it. It is harder than it sounds. When I say "the key light is contradicting the ambient fill," I am not describing a rule. I am describing a decade of habit, compressed into a sentence, and the compression leaves a lot of air out. The engineers I work with are not wrong to ask for the rule. They just want the wrong shape of answer.

What I want to do in this piece is name three specific failure modes I see repeatedly, explain why each of them is almost invisible without a post-production background, and argue that the ceiling on current visual AI is going to turn out to be an aesthetic problem rather than a computational one. If that ceiling is real, then the people who will break it are not the people currently being hired to break it.

One. Light that contradicts itself

The most common failure is a frame where two or more light sources exist in the image but do not agree on what kind of room they are in. A window suggests 4 p.m. in Bombay. The warmth of the fill light suggests a tungsten practical. The shadows fall as if they were lit by neither. The model has successfully learned that frames often contain multiple light sources, and has confidently produced an image with multiple light sources, and has missed the quieter fact that those sources are supposed to be having a conversation with one another.

A working colorist will clock this in under a second. The clock is pre-verbal. It is the same clock that lets a jazz musician flinch at a missed time signature half a bar before they could explain why. There is no rule being checked. There is a sense of rightness, built over thousands of hours, that the frame is either honouring or failing.

A small confession. I did not have this vocabulary either, when I started. I spent two years at Papercloud Films watching senior colorists reject frames I thought were fine. I got annoyed. Then I got quiet. Then I started seeing what they were seeing. This is what domain expertise is, really. A long apprenticeship to a sense you did not arrive with.

Two. Depth without occupancy

The second failure is more subtle and, I think, more important. Generative models have become extremely good at producing images with a sense of depth. Foreground. Midground. Background. A plausible falloff of focus. Good lens character, even. What they are still bad at is producing images where those depth zones are occupied by an argument. Every frame in professional cinematography has a hierarchy. The eye is meant to travel somewhere first, and then somewhere second, and the geography of that travel is the difference between a shot and a snapshot.

When I run a generative video model through an adversarial prompt that asks for narrative staging, the output is almost always spatially full but narratively empty. The depth is there. The occupancy is not. This is not because the model cannot produce a staged image. It is because nothing in the training signal has taught it to distinguish between a composed frame and a furnished frame, and for most datasets, furniture counts as composition.

This is the failure mode most likely to be missed by a non-expert evaluator. A furnished frame looks, on a screen, indistinguishable from a composed one until you ask where am I supposed to be looking first. The answer, for a composed frame, is unambiguous. The answer for a furnished frame is "I don't know, maybe the brightest thing?"

Three. Skin that has never been alive

This one I find the most interesting and the most difficult to communicate. There is a specific thing skin does in real footage that generative models have not yet learned to produce, and that I do not have a neat technical name for. It is something like aliveness. Real skin has a history. Small variations in saturation across a face that do not follow the skin tone exactly. A slight unevenness in specular highlights that tracks the way light actually bounces off a living surface rather than a rendered one. The generated version is often smoother, more even, and in a technical sense more "correct," and it is exactly this correctness that makes it fail.

A colorist who has graded enough faces can tell you, within a second, which of two images was shot on a real person and which was generated. They cannot always tell you why. This is a problem for current evaluation pipelines, which tend to privilege legibility of judgment over accuracy of judgment. If you cannot articulate the rule, the rule does not get used to train the model, and the model stays bad in exactly the ways you cannot articulate.

The ceiling on visual AI is going to turn out to be an aesthetic problem, not a computational one. Which means the people who will break it are not, mostly, the people currently being hired to break it.

What this implies

I think the industry has an interesting couple of years coming. The dominant assumption in RLHF work on visual models is still that more data and more raters will close these gaps. I am not sure they will. The gaps I am describing are not gaps in quantity of preference signal. They are gaps in resolution of preference signal. A thousand generalist raters saying "this image is fine" does not give you the signal that one colorist saying "the key is fighting the ambient" gives you. Not even close.

The fix, I think, is structural. AI labs need to start staffing evaluation the way film post-production has always staffed colour grading. A small number of deep specialists, not a large number of generalists. The cost per hour is higher. The cost per correctly-caught failure is much lower. This is not a new idea in other industries. It is only new in AI because AI has been, until recently, dominated by a data scaling logic that treated domain expertise as a rounding error.

I have been arguing this, at varying volumes, inside xAI and out of it for several months. I do not expect to win the argument on my own. I do expect the argument to be won, in the next two to three years, by a specific failure mode going public in a way that a specialist could have caught and a generalist could not. That will be the moment the field adjusts. I would rather we adjust first.

In the meantime, I am collecting these failure modes in a private notebook, and occasionally writing them up here. If you work on visual AI and any of this resonates with you, you are welcome to write to me. If you work with post-production colorists and want to help me build a rubric they could actually contribute to, also write.

The short version of my argument is this. The models have gotten good enough that the things they still cannot do are the things only practitioners know how to name. Which means the next jump in quality is, in a sense, a humanities problem. It is going to require the kind of attention that used to be paid to frames one at a time. The industry is not currently set up for that. I think it should be.