The Text-to-SQL Field Has a Measurement Problem

TL;DR Text-to-SQL is everywhere, but we measure it badly. Exact Match punishes you for swapping users AS u. Execution Accuracy doesn’t care if you got 99 of 100 rows right — wrong is wrong. We built QAS (Query Accuracy Score): a continuous score that combines code-aware semantic similarity (how close is the SQL?) with edit-distance table similarity (how close is the answer?). Tested on 11 models on BIRD, QAS surfaces huge differences that binary metrics flatten into the same number. A field built on coin flips Text-to-SQL is one of those areas where the demos look magical. Type a question in English, get a SQL query back, get an answer from your database. No DBA needed. The promise is enormous. ...

July 2, 2025 · 6 min · Giovanni Pinna