AI Coding Agents

There Is No "Best" AI Coding Agent — And That's the Whole Point

TL;DR We pulled 7,156 real pull requests authored by Codex, Claude Code, Cursor, Devin, and Copilot, then sliced the data by task type. The headline: the gap between best and worst task category is 29 percentage points — far bigger than the gap between agents inside any single category. Codex is the steady generalist. Claude Code wins documentation. Cursor wins bug fixes. Stop asking which agent is best. Start asking best at what. The wrong question Walk into any engineering Slack and you’ll see the same debate: Codex vs. Claude Code vs. Cursor vs. Devin vs. Copilot. Threads explode. Benchmarks fly. Someone screenshots a leaderboard. ...

When AI Agents Lie About Their Own Code (Without Meaning To)

TL;DR We looked at 23,247 pull requests written by AI coding agents and asked a simple question: does the description match the diff? In 1.7% of cases, no. Sounds tiny — until you see the consequences. Inconsistent PRs get 51.7% lower acceptance rates and take 3.5× longer to merge. The code is fine. The story the agent tells about the code is the problem. The part of a PR nobody measures When we benchmark AI coding agents, we measure code. Does it compile? Does it pass the tests? Is it clean? ...

Sometimes Your AI Agent Burns More Energy Optimizing Code Than the Code Will Ever Save

TL;DR AI coding agents burn 100,000+ tokens per task. When the task is “optimize this code’s performance,” the agent itself often costs more energy than the optimized code will ever save. We built GA4GC — Greener Agent for Greener Code — using NSGA-II to tune the agent’s own configuration against three objectives: code correctness, code speedup, and agent runtime. On a mini-SWE-agent powered by Gemini 2.5 Pro on the SWE-Perf benchmark, we got 37.7% runtime reduction while also improving correctness, with a 135× hypervolume improvement over defaults. Bonus finding: temperature is the single most important knob, and LLM hyperparameters control quality while agent constraints control cost — they can be tuned almost independently. The energy paradox nobody talks about Here’s a thing that should be obvious but isn’t: when you ask an AI agent to optimize the performance of your code, the agent’s own execution costs energy. A lot of energy. Often more than the code it’s optimizing will ever save. ...