Blog | Giovanni Pinna

There Is No "Best" AI Coding Agent — And That's the Whole Point

TL;DR We pulled 7,156 real pull requests authored by Codex, Claude Code, Cursor, Devin, and Copilot, then sliced the data by task type. The headline: the gap between best and worst task category is 29 percentage points — far bigger than the gap between agents inside any single category. Codex is the steady generalist. Claude Code wins documentation. Cursor wins bug fixes. Stop asking which agent is best. Start asking best at what. The wrong question Walk into any engineering Slack and you’ll see the same debate: Codex vs. Claude Code vs. Cursor vs. Devin vs. Copilot. Threads explode. Benchmarks fly. Someone screenshots a leaderboard. ...

When AI Agents Lie About Their Own Code (Without Meaning To)

TL;DR We looked at 23,247 pull requests written by AI coding agents and asked a simple question: does the description match the diff? In 1.7% of cases, no. Sounds tiny — until you see the consequences. Inconsistent PRs get 51.7% lower acceptance rates and take 3.5× longer to merge. The code is fine. The story the agent tells about the code is the problem. The part of a PR nobody measures When we benchmark AI coding agents, we measure code. Does it compile? Does it pass the tests? Is it clean? ...

Sometimes the Best Feature Engineering Is Throwing Features Away

TL;DR Classifying software hotfixes — the panic-mode patches you ship to fix something that’s broken in production right now — is hard for ML: tiny dataset (88 entries, 17 categories), brutal class imbalance, and expensive LLM features. HotCat reframes feature engineering as a search problem: NSGA-II evolves binary masks over 18 features, optimizing accuracy, NMI, and runtime simultaneously. A two-stage data augmentation lifts generalization from 55% → 72%. The Pareto front gives a balanced config: 59% accuracy, 0.58 NMI, 129 seconds. Most surprising: some features actively hurt — pruning them is both faster and more accurate. Hotfixes are not normal bugs In any normal software project, bugs queue up. They get triaged, prioritized, scheduled into sprints. Some sit there for months. ...

Sometimes Your AI Agent Burns More Energy Optimizing Code Than the Code Will Ever Save

TL;DR AI coding agents burn 100,000+ tokens per task. When the task is “optimize this code’s performance,” the agent itself often costs more energy than the optimized code will ever save. We built GA4GC — Greener Agent for Greener Code — using NSGA-II to tune the agent’s own configuration against three objectives: code correctness, code speedup, and agent runtime. On a mini-SWE-agent powered by Gemini 2.5 Pro on the SWE-Perf benchmark, we got 37.7% runtime reduction while also improving correctness, with a 135× hypervolume improvement over defaults. Bonus finding: temperature is the single most important knob, and LLM hyperparameters control quality while agent constraints control cost — they can be tuned almost independently. The energy paradox nobody talks about Here’s a thing that should be obvious but isn’t: when you ask an AI agent to optimize the performance of your code, the agent’s own execution costs energy. A lot of energy. Often more than the code it’s optimizing will ever save. ...

The Text-to-SQL Field Has a Measurement Problem

TL;DR Text-to-SQL is everywhere, but we measure it badly. Exact Match punishes you for swapping users AS u. Execution Accuracy doesn’t care if you got 99 of 100 rows right — wrong is wrong. We built QAS (Query Accuracy Score): a continuous score that combines code-aware semantic similarity (how close is the SQL?) with edit-distance table similarity (how close is the answer?). Tested on 11 models on BIRD, QAS surfaces huge differences that binary metrics flatten into the same number. A field built on coin flips Text-to-SQL is one of those areas where the demos look magical. Type a question in English, get a SQL query back, get an answer from your database. No DBA needed. The promise is enormous. ...