There Is No "Best" AI Coding Agent — And That's the Whole Point

TL;DR We pulled 7,156 real pull requests authored by Codex, Claude Code, Cursor, Devin, and Copilot, then sliced the data by task type. The headline: the gap between best and worst task category is 29 percentage points — far bigger than the gap between agents inside any single category. Codex is the steady generalist. Claude Code wins documentation. Cursor wins bug fixes. Stop asking which agent is best. Start asking best at what. The wrong question Walk into any engineering Slack and you’ll see the same debate: Codex vs. Claude Code vs. Cursor vs. Devin vs. Copilot. Threads explode. Benchmarks fly. Someone screenshots a leaderboard. ...

April 14, 2026 · 4 min · Giovanni Pinna

When AI Agents Lie About Their Own Code (Without Meaning To)

TL;DR We looked at 23,247 pull requests written by AI coding agents and asked a simple question: does the description match the diff? In 1.7% of cases, no. Sounds tiny — until you see the consequences. Inconsistent PRs get 51.7% lower acceptance rates and take 3.5× longer to merge. The code is fine. The story the agent tells about the code is the problem. The part of a PR nobody measures When we benchmark AI coding agents, we measure code. Does it compile? Does it pass the tests? Is it clean? ...

April 14, 2026 · 5 min · Giovanni Pinna