Trust

TL;DR We looked at 23,247 pull requests written by AI coding agents and asked a simple question: does the description match the diff? In 1.7% of cases, no. Sounds tiny — until you see the consequences. Inconsistent PRs get 51.7% lower acceptance rates and take 3.5× longer to merge. The code is fine. The story the agent tells about the code is the problem. The part of a PR nobody measures When we benchmark AI coding agents, we measure code. Does it compile? Does it pass the tests? Is it clean? ...