The Text-to-SQL Field Has a Measurement Problem

TL;DR Text-to-SQL is everywhere, but we measure it badly. Exact Match punishes you for swapping users AS u. Execution Accuracy doesn’t care if you got 99 of 100 rows right — wrong is wrong. We built QAS (Query Accuracy Score): a continuous score that combines code-aware semantic similarity (how close is the SQL?) with edit-distance table similarity (how close is the answer?). Tested on 11 models on BIRD, QAS surfaces huge differences that binary metrics flatten into the same number. A field built on coin flips Text-to-SQL is one of those areas where the demos look magical. Type a question in English, get a SQL query back, get an answer from your database. No DBA needed. The promise is enormous. ...

July 2, 2025 · 6 min · Giovanni Pinna

Making the LLM-Plus-Evolution Pipeline Actually Smart

TL;DR Our EuroGP 2024 work showed Genetic Improvement (GI) can rescue LLM-generated code. This follow-up makes the GI part itself smarter. Three upgrades: lexicase selection to keep specialists alive, 10% down-sampling to cut compute, and a refined fitness function (F_E) that gives partial credit instead of pass/fail. On four LLMs (GPT-4, ChatGPT, Code Llama 7B, LLaMA 3 8B) over three PSB2 problems, we improved 11 of 12 model-problem combinations. Smaller models gain the most. GI is, increasingly, a capability amplifier for cheap models. What we left on the table last time The EuroGP 2024 paper proved the basic idea: take an LLM’s buggy first draft, hand it to Grammatical Evolution, get back better code. Statistically significant gains on every model. ...

July 1, 2025 · 5 min · Giovanni Pinna

Improving LLM-Generated Code via Genetic Improvement: A Summary of Recent Advances

Abstract This paper provides a comprehensive summary of our research program on applying Genetic Improvement to code generated by Large Language Models, consolidating findings from two published studies (EuroGP 2024 and SN Computer Science 2025). Across both works, we demonstrate that neural and evolutionary approaches are fundamentally complementary: LLMs excel at rapidly generating structurally plausible code, while Genetic Improvement refines it toward precise specifications through grammar-based evolutionary search. A consistent finding is the “capability amplifier” effect — smaller open-source models benefit disproportionately from GI, narrowing the gap with larger proprietary models. We also discuss key limitations including oracle dependency, scalability constraints to multi-file projects, bias propagation from LLM-generated grammars, and the stochastic nature of evolutionary algorithms. Presented at Ital-IA 2025, the 5th National Conference on Artificial Intelligence, Rome, Italy. Introduction The intersection of Large Language Models and evolutionary computation represents one of the most promising frontiers in automated software engineering. Over the past two years, our research group has developed and refined a methodology for systematically improving code generated by LLMs using Genetic Improvement (GI) techniques. This paper, presented at Ital-IA 2025 (the 5th National Conference on Artificial Intelligence, organized by CINI), provides a comprehensive summary of this research program and its key findings. ...

June 23, 2025 · 5 min · Giovanni Pinna

GPT-4 Can Make Court Rulings Easier to Read. It Can Also Lie to You About Them, Confidently.

TL;DR Italian Constitutional Court rulings are written for lawyers. Most citizens can’t follow them. Can an LLM fix that? We ran a 75-person human study comparing four versions of the same legal content: original judgments, expert “massime” summaries, GPT-4o summaries, and a fine-tuned LLaMA 2 7B. Comprehension rates: expert summaries 45%, GPT-4o 38%, raw judgments 33%, fine-tuned LLaMA 30%. GPT-4o really does make legal text more readable. It also produces a worrying pattern: confident, fluent, wrong — readers leave with strongly held but incorrect understandings. Use LLMs for legal summarization. Don’t use them without human review. A democratic problem dressed up as a technical one Constitutional Court rulings are some of the most important documents a country produces. They define what your government can and can’t do. They shape what your rights are. They are also written by lawyers, for lawyers, in dense Italian legal prose that the average citizen has no chance of understanding. ...

December 10, 2024 · 5 min · Giovanni Pinna

What If We Stopped Asking ChatGPT to Fix Its Own Code?

TL;DR LLMs write code that almost works. The usual fix is to ask them again — “self-correction” — but it tends to repeat the same mistakes. We took a different route: treat the buggy code as a seed and evolve it. Using Grammatical Evolution with a grammar built on-the-fly from the LLM’s own output, we improved code from GPT-4, ChatGPT, LLaMA-2, Alpaca-13B, and Alpaca-7B on 25 PSB2 problems — with statistically significant gains (p < 0.001) for every model. The smaller the model, the bigger the win. The trap of self-correction Ask any modern LLM to write a Python function and you’ll get something that looks right. Run the tests and you’ll often discover it isn’t. ...

April 3, 2024 · 4 min · Giovanni Pinna