[{"content":" TL;DR We pulled 7,156 real pull requests authored by Codex, Claude Code, Cursor, Devin, and Copilot, then sliced the data by task type. The headline: the gap between best and worst task category is 29 percentage points — far bigger than the gap between agents inside any single category. Codex is the steady generalist. Claude Code wins documentation. Cursor wins bug fixes. Stop asking which agent is best. Start asking best at what. The wrong question Walk into any engineering Slack and you\u0026rsquo;ll see the same debate: Codex vs. Claude Code vs. Cursor vs. Devin vs. Copilot. Threads explode. Benchmarks fly. Someone screenshots a leaderboard.\nThe honest answer to \u0026ldquo;which is best\u0026rdquo; is it depends — but that answer feels like a cop-out. So we tried to make it concrete. We grabbed 7,156 pull requests these five agents had actually opened against real open-source repos, and we measured what really matters: did a human maintainer merge it?\nThen we did the part most evaluations skip. We split the PRs by what kind of work they were doing.\nWhy a single number lies Bug fixes, feature work, documentation, refactors, dependency bumps, tests — these are not the same task. They have wildly different difficulty profiles for an AI agent. Documentation is mostly natural-language generation against a soft target. Feature implementation requires holding a mental model of an architecture. Bug fixing demands navigating someone else\u0026rsquo;s code.\nIf you average across all of them, you get a number. You also lose the entire story.\nSo instead of one acceptance rate per agent, we computed an acceptance rate per agent, per task type.\nThe headline finding Look at the chart. The thing your eye should jump to is vertical, not horizontal.\nThe gap between the best task category and the worst is roughly 29 percentage points. The gap between agents within any one category is far smaller. The kind of work you give an agent predicts acceptance much more strongly than which agent you picked.\nThat reframes the entire conversation. Asking \u0026ldquo;is Cursor better than Claude Code?\u0026rdquo; is like asking \u0026ldquo;is a chef better than a sushi chef?\u0026rdquo; — the question is missing a noun. Better at what?\nEach agent has a beat When you slice properly, the agents stop blurring together. They develop personalities.\nCodex — the generalist. No spectacular peaks, no embarrassing valleys. If you only get to pick one agent and you don\u0026rsquo;t know what\u0026rsquo;s coming next, this is probably the safe default.\nClaude Code — documentation. Its language fluency translates straight into PR descriptions, README rewrites, and inline comments that maintainers actually want to merge.\nCursor — bug fixes. The IDE integration gives it deeper context about the surrounding code, and that context is exactly what bug fixing needs.\nDevin and Copilot land elsewhere on this map — the full breakdown is in the paper. The point isn\u0026rsquo;t the leaderboard, it\u0026rsquo;s that there\u0026rsquo;s no single ranking. There\u0026rsquo;s a shape.\nWhat to actually do with this If you ship code with AI agents, two things follow:\nRoute by task, not by preference. A team that hands every kind of work to its favorite agent is leaving acceptance rate on the table. Documentation PR? Send it to Claude Code. Subtle bug? Cursor. Random Tuesday cleanup? Codex. Treat your agents like a team with different specialties, because that\u0026rsquo;s what they are.\nCalibrate your expectations. When a feature-implementation PR fails review, that may not be the agent\u0026rsquo;s fault — it may be the task\u0026rsquo;s fault. A 40% acceptance rate on hard tasks isn\u0026rsquo;t a broken agent. A 70% rate on easy tasks isn\u0026rsquo;t a magical one. Both are noise around a baseline set by the work itself.\nWhat this changes for the field For agent developers: stop chasing aggregate benchmark scores. They hide where you\u0026rsquo;re losing. The path to a better agent runs through the task category where you currently underperform.\nFor researchers: report per-task numbers, always. A single acceptance rate is not a measurement, it\u0026rsquo;s a smoothing function over your most interesting variation.\nAnd for everyone else: when someone tells you their agent is the best, ask them at what. If they can\u0026rsquo;t answer, neither can their agent.\nReference This post is a divulgative summary of:\nPinna, G., Sarro, F. (2026). Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance. In: Proceedings of the 23rd International Conference on Mining Software Repositories (MSR 2026) — Mining Challenge Track.\nRead the original paper (PDF)\nResearch conducted at University College London (UCL), CREST, with Prof. Federica Sarro.\n","permalink":"https://giovannipinna.net/posts/msr2026-comparing-ai-agents/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eTL;DR\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    We pulled \u003cstrong\u003e7,156 real pull requests\u003c/strong\u003e authored by Codex, Claude Code, Cursor, Devin, and Copilot, then sliced the data by \u003cem\u003etask type\u003c/em\u003e. The headline: the gap between best and worst task category is \u003cstrong\u003e29 percentage points\u003c/strong\u003e — far bigger than the gap between agents inside any single category. Codex is the steady generalist. Claude Code wins documentation. Cursor wins bug fixes. Stop asking \u003cem\u003ewhich agent is best\u003c/em\u003e. Start asking \u003cem\u003ebest at what\u003c/em\u003e.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"the-wrong-question\"\u003eThe wrong question\u003c/h2\u003e\n\u003cp\u003eWalk into any engineering Slack and you\u0026rsquo;ll see the same debate: Codex vs. Claude Code vs. Cursor vs. Devin vs. Copilot. Threads explode. Benchmarks fly. Someone screenshots a leaderboard.\u003c/p\u003e","title":"There Is No \"Best\" AI Coding Agent — And That's the Whole Point"},{"content":" TL;DR We looked at 23,247 pull requests written by AI coding agents and asked a simple question: does the description match the diff? In 1.7% of cases, no. Sounds tiny — until you see the consequences. Inconsistent PRs get 51.7% lower acceptance rates and take 3.5× longer to merge. The code is fine. The story the agent tells about the code is the problem. The part of a PR nobody measures When we benchmark AI coding agents, we measure code. Does it compile? Does it pass the tests? Is it clean?\nBut a pull request is not a diff. It\u0026rsquo;s a diff plus a story. The reviewer reads the title, then the description, then maybe the code. The description sets expectations. It tells the reviewer what to look for. If the story is wrong, every line of code that follows is being read against the wrong template.\nA PR titled \u0026ldquo;fixed the auth bug\u0026rdquo; that actually refactors the database layer doesn\u0026rsquo;t just fail to inform. It actively misleads. And when reviewers detect that mismatch — even subconsciously — trust collapses.\nWhy agents are weirdly bad at this Writing code and writing an honest summary of what you just wrote are different cognitive tasks. The first is algorithmic. The second is meta-cognitive — you have to know what you intended, what you tried, and what actually came out the other end.\nAI agents are good at the first. They struggle at the second, and the failure has a specific shape:\nThe agent reads the task and forms a plan. It hits unexpected friction — failing tests, weird dependencies, edge cases. It iterates, debugs, takes detours, makes compromises. The final code is not quite what the plan was. When asked to write a description, the agent often describes the plan, not the result. That\u0026rsquo;s where message-code inconsistency is born. Not malice. Not laziness. A drift between intent and outcome that the agent never noticed.\nHow we measured it We built a metric — PR-MCI, Pull Request Message-Code Inconsistency — that measures the semantic distance between what the description says and what the diff actually does. It\u0026rsquo;s a continuous score, not a yes/no, so we can rank PRs by how inconsistent they are.\nThen we ran it across 23,247 AI-authored pull requests and looked at what happened to the inconsistent ones.\nThe 1.7% problem Only 1.7% of PRs scored as highly inconsistent. That sounds like a non-issue. It\u0026rsquo;s not, for two reasons.\nFirst, scale. In a company shipping thousands of agent-authored PRs a month, 1.7% is dozens of misleading descriptions per week dropping into reviewer inboxes.\nSecond, each one is expensive.\nThe high-inconsistency PRs:\nGet accepted 51.7% less often than consistent PRs Take 3.5× longer to merge when they do get accepted Often have technically fine code — the rejection is about the story, not the substance A reviewer who finds that the description lied to them does not give the agent the benefit of the doubt on the next paragraph, or the next file, or the next PR. Trust is paid in advance and refunded slowly.\nWhy the cost is so high Two mechanisms compound:\nRe-orientation cost. A reviewer who expected auth fixes and found database refactors has to throw away their mental model and build a new one. That\u0026rsquo;s the most expensive operation in code review. It also tends to surface defensive instincts: \u0026ldquo;what else is in here that I didn\u0026rsquo;t expect?\u0026rdquo;\nTrust contagion. If the description is wrong, the reviewer no longer trusts the description as a summary — meaning they have to read the code more carefully than they would have. Every PR after the first inconsistent one inherits a slight discount on trust, especially if it came from the same agent.\nThe end result: a small percentage of misleading PRs degrades the throughput of the entire review pipeline.\nWhat to do about it For agent developers, the fix is structural. The description should not be generated from the plan. It should be generated from the final diff, in a separate pass, by something that hasn\u0026rsquo;t seen the original task. Cheap interventions that already help:\nVerification pass. A second LLM call reads the diff and the description and flags the mismatch. Heuristic checks. Files mentioned in the description should appear in the diff. Stated bug categories should match the test failures being touched. Description regeneration. Throw away the description the agent wrote during planning. Generate a new one from the final diff alone. For teams using these agents: assume descriptions are unreliable until proven otherwise. Build a habit of skimming the diff first, the description second.\nFor the field: stop evaluating agents only on code. A pull request is a deliverable, and the description is part of the deliverable. An agent that writes correct code with a misleading PR is not a good agent — it\u0026rsquo;s a fast way to destroy reviewer trust.\nThe real story As agents become less assistants and more autonomous contributors, the bottleneck shifts. It\u0026rsquo;s not whether they can write code. They can. The question is whether they can be trusted to describe what they wrote.\nRight now, on 1.7% of attempts, they can\u0026rsquo;t. And that 1.7% is doing more damage to the relationship between humans and agents than any compile error ever could.\nReference This post is a divulgative summary of:\nPinna, G., Sarro, F., Sutton, C. (2026). Analyzing Message-Code Inconsistency in AI Coding Agent-Authored Pull Requests. In: Proceedings of the 23rd International Conference on Mining Software Repositories (MSR 2026) — Mining Challenge Track.\nRead the original paper (PDF)\nResearch conducted at University College London (UCL) and King\u0026rsquo;s College London.\n","permalink":"https://giovannipinna.net/posts/msr2026-message-code-inconsistency/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eTL;DR\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    We looked at \u003cstrong\u003e23,247 pull requests\u003c/strong\u003e written by AI coding agents and asked a simple question: does the description match the diff? In \u003cstrong\u003e1.7%\u003c/strong\u003e of cases, no. Sounds tiny — until you see the consequences. Inconsistent PRs get \u003cstrong\u003e51.7% lower acceptance rates\u003c/strong\u003e and take \u003cstrong\u003e3.5× longer to merge\u003c/strong\u003e. The code is fine. The story the agent tells about the code is the problem.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"the-part-of-a-pr-nobody-measures\"\u003eThe part of a PR nobody measures\u003c/h2\u003e\n\u003cp\u003eWhen we benchmark AI coding agents, we measure code. Does it compile? Does it pass the tests? Is it clean?\u003c/p\u003e","title":"When AI Agents Lie About Their Own Code (Without Meaning To)"},{"content":" TL;DR Classifying software hotfixes — the panic-mode patches you ship to fix something that\u0026rsquo;s broken in production right now — is hard for ML: tiny dataset (88 entries, 17 categories), brutal class imbalance, and expensive LLM features. HotCat reframes feature engineering as a search problem: NSGA-II evolves binary masks over 18 features, optimizing accuracy, NMI, and runtime simultaneously. A two-stage data augmentation lifts generalization from 55% → 72%. The Pareto front gives a balanced config: 59% accuracy, 0.58 NMI, 129 seconds. Most surprising: some features actively hurt — pruning them is both faster and more accurate. Hotfixes are not normal bugs In any normal software project, bugs queue up. They get triaged, prioritized, scheduled into sprints. Some sit there for months.\nHotfixes are the bugs that don\u0026rsquo;t get to wait. Authentication broke. Payments stopped. Customer data is leaking. The release pipeline gets bypassed and a patch goes out now. These are the most expensive bugs to ship — both in calm-Friday-evenings lost and in dollars.\nUnderstanding the shape of your hotfixes — what kinds of failures keep needing emergency patches — is enormously useful. It tells you where your codebase is brittle, what testing categories are failing you, what processes need to change. The classic tool for this is a bug taxonomy: a structured way of saying \u0026ldquo;this hotfix was a memory leak, that one was a race condition.\u0026rdquo;\nBuilding a good taxonomy automatically is hard. Hotfixes are sparse, the categories are wildly imbalanced, and the semantic analysis you need usually requires LLMs — which cost real money to run at scale.\nTwo problems at once We had two motivations stacked on top of each other.\nMethodologically: can we classify hotfixes well despite tiny data and class imbalance?\nEnvironmentally: can we do it without burning unnecessary LLM cycles?\nThese aren\u0026rsquo;t separate questions. The cheapest classifier is one that uses fewer features. The most accurate one is whatever set of features happens to actually carry signal. If those two things overlap — if some features cost a lot and don\u0026rsquo;t help — then both problems can be solved at the same time.\nSo we asked the obvious question: which features actually matter?\nHow HotCat works We use the HotBugs dataset — 88 hotfix entries across 17 bug categories from real-world projects. For each, we collect three flavors of features:\nCode-level features from the diff itself (lines added/removed, files changed, syntactic stuff) Process metadata from Jira (time to fix, number of participants, priority) LLM-generated summaries of each patch, embedded with Sentence-BERT, then organized via K-Means Eighteen features in total. That gives 2^18 ≈ 260,000 possible feature subsets. Manual selection is hopeless.\nSo we threw NSGA-II at it. Each candidate is a binary mask over the 18 features — keep or drop, on each one. Three objectives, optimized jointly:\nMaximize classification accuracy. Maximize NMI (Normalized Mutual Information — robust to class imbalance, unlike raw accuracy). Minimize runtime. Population of 20, evolved for 20 generations. Tiny by NSGA-II standards. Sufficient.\nThe data is too small. Here\u0026rsquo;s how we coped. 88 entries across 17 categories means some categories have a handful of examples. Generalization on the raw dataset capped at around 55%, which is borderline useful.\nWe added a two-stage augmentation strategy:\nCategory balancing — synthetic examples to even out the rare categories. Post-optimization record generation — additional data after feature selection, to harden the generalization. That brought generalization from 55% to 72%. A 17 point jump. Augmentation isn\u0026rsquo;t glamorous but it\u0026rsquo;s exactly what was needed here.\nResults The Pareto front gives a menu, not an answer. Two useful spots on it:\nBalanced config: 59% accuracy, 0.58 NMI, 129 seconds runtime. Best-accuracy config: 63% accuracy, 132 seconds — three more seconds buys you four points of accuracy.\nThe headline finding is structural and a bit counterintuitive. Some features were making things worse. Selectively removing them improved both accuracy and runtime. That\u0026rsquo;s the Green AI dream: cheaper because it\u0026rsquo;s better, not in spite of it.\nIt also flips the usual feature-engineering instinct. The default move when classification underperforms is to add more features. HotCat\u0026rsquo;s evidence says: measure first. Some of the features you added are noise, and noise hurts.\nWhy this matters beyond hotfixes Two takeaways travel beyond this specific problem.\nMulti-objective optimization on the feature space is underused. Most ML pipelines treat feature engineering as a one-time human exercise. NSGA-II makes it a continuous search problem with explicit trade-offs you can choose between. That framing applies any time you have many candidate features and a real cost-quality tradeoff.\nGreen AI is not a tax — it can be a guide. Treating runtime as a first-class objective rather than an afterthought changes which features survive. The result is leaner and better. As LLM-based analysis tools spread through software engineering pipelines, the org that bothers to do this kind of tuning will pay less and ship better.\nIf you\u0026rsquo;re staring at a classification task with too many features and not enough data, the right next move might not be more features. It might be a Pareto front.\nReference This post is a divulgative summary of:\nPinna, G., Sarro, F. (2025). HotCat: Green and Effective Feature Selection for Hotfix Bug Taxonomy. In: Proceedings of the 17th Symposium on Search-Based Software Engineering (SSBSE 2025) — Challenge Track on Hot Fixing Benchmark.\nRead the original paper (PDF)\nResearch conducted at University College London (UCL) and the University of Trieste.\n","permalink":"https://giovannipinna.net/posts/ssbse2025-hotcat/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eTL;DR\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    Classifying software \u003cem\u003ehotfixes\u003c/em\u003e — the panic-mode patches you ship to fix something that\u0026rsquo;s broken in production right now — is hard for ML: tiny dataset (88 entries, 17 categories), brutal class imbalance, and expensive LLM features. \u003cstrong\u003eHotCat\u003c/strong\u003e reframes feature engineering as a search problem: NSGA-II evolves binary masks over 18 features, optimizing accuracy, NMI, \u003cem\u003eand\u003c/em\u003e runtime simultaneously. A two-stage data augmentation lifts generalization from \u003cstrong\u003e55% → 72%\u003c/strong\u003e. The Pareto front gives a balanced config: \u003cstrong\u003e59% accuracy, 0.58 NMI, 129 seconds\u003c/strong\u003e. Most surprising: \u003cstrong\u003esome features actively hurt\u003c/strong\u003e — pruning them is both faster \u003cem\u003eand\u003c/em\u003e more accurate.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"hotfixes-are-not-normal-bugs\"\u003eHotfixes are not normal bugs\u003c/h2\u003e\n\u003cp\u003eIn any normal software project, bugs queue up. They get triaged, prioritized, scheduled into sprints. Some sit there for months.\u003c/p\u003e","title":"Sometimes the Best Feature Engineering Is Throwing Features Away"},{"content":" TL;DR AI coding agents burn 100,000+ tokens per task. When the task is \u0026ldquo;optimize this code\u0026rsquo;s performance,\u0026rdquo; the agent itself often costs more energy than the optimized code will ever save. We built GA4GC — Greener Agent for Greener Code — using NSGA-II to tune the agent\u0026rsquo;s own configuration against three objectives: code correctness, code speedup, and agent runtime. On a mini-SWE-agent powered by Gemini 2.5 Pro on the SWE-Perf benchmark, we got 37.7% runtime reduction while also improving correctness, with a 135× hypervolume improvement over defaults. Bonus finding: temperature is the single most important knob, and LLM hyperparameters control quality while agent constraints control cost — they can be tuned almost independently. The energy paradox nobody talks about Here\u0026rsquo;s a thing that should be obvious but isn\u0026rsquo;t: when you ask an AI agent to optimize the performance of your code, the agent\u0026rsquo;s own execution costs energy. A lot of energy. Often more than the code it\u0026rsquo;s optimizing will ever save.\nThink about the math. The agent reads files, plans, generates code, runs tests, debugs, iterates. A real run on a non-trivial repo eats six figures of tokens. Now suppose it shaves 50ms off a function. How many times does that function need to run to break even on the energy spent making it faster?\nFor some tasks: hundreds of thousands of runs. For others: never. Some \u0026ldquo;optimizations\u0026rdquo; are net energy losses.\nThat\u0026rsquo;s an unsettling thing to discover when you\u0026rsquo;re being told AI is going to make software more efficient.\nWhy the agent\u0026rsquo;s defaults are a bad starting point AI coding agents have surprisingly large configuration spaces. Temperature. Top_p. Max tokens per step. Max number of steps. Prompt template variants. These knobs interact, often counterintuitively. Higher temperature can help on creative tasks and waste budget on simple ones. Loose step limits give the agent room to iterate but also room to wander.\nThe defaults that ship with these agents are picked by humans for reasonable-looking averages. They are not picked for your task, your codebase, or your energy budget. Most of them are visibly suboptimal once you actually measure.\nSo we asked: what if we treat agent configuration as a search problem?\nGA4GC: a search loop on top of the agent The setup is simple in spirit, gnarly in practice.\nWe took a mini-SWE-agent running on Gemini 2.5 Pro and let NSGA-II — a multi-objective evolutionary algorithm — evolve its configuration. NSGA-II doesn\u0026rsquo;t try to find a single best config. It maps out a Pareto front: a frontier of configs where you can\u0026rsquo;t improve one objective without sacrificing another.\nThree objectives:\nMinimize incorrect patches. Correct code first, always. Maximize performance gain. The whole point of the task is to make the target code faster. Minimize agent runtime. Don\u0026rsquo;t let the optimizer cost more than the optimization is worth. The agent runs on SWE-Perf, a benchmark of real performance-tuning tasks from the astropy Python library. Each candidate config is evaluated in an isolated Docker environment for reproducibility.\nNSGA-II handles the heterogeneous configuration space — continuous knobs (temperature, top_p), integer constraints (max tokens, step limits), categorical choices (prompt templates) — by applying the right operators to each.\nWhat we found in 25 evaluations Yes, 25. That\u0026rsquo;s the entire budget. The point of GA4GC isn\u0026rsquo;t to be expensive — it\u0026rsquo;s to be cheaper than the alternative of doing nothing.\nThe non-dominated configurations achieved:\n37.7% runtime reduction. Default config: 1,513 seconds. Best Pareto config: 943 seconds. Better correctness too. Not a tradeoff — actually better. 135× hypervolume improvement over the default baseline. (Hypervolume measures how much of the objective space the Pareto front covers — bigger is better.) The headline: the defaults aren\u0026rsquo;t just suboptimal, they\u0026rsquo;re badly suboptimal. Significant gains in both quality and efficiency are sitting there waiting for anyone who runs even a tiny tuning loop.\nThe structural finding that surprised us We ran a Random Forest regression to figure out which knobs actually matter. Two things popped out.\nTemperature dominates. Of all the knobs, temperature is the single most important one. That makes intuitive sense — it shapes the agent\u0026rsquo;s whole exploration style — but the magnitude of its influence was bigger than we expected.\nLLM hyperparameters drive quality. Agent constraints drive cost. They\u0026rsquo;re decoupled.\nThis is the actionable finding. If you tune temperature and top_p, you\u0026rsquo;re moving the dial on whether the agent produces good code. If you tune token caps and step limits, you\u0026rsquo;re moving the dial on how much it costs you. The two control surfaces don\u0026rsquo;t fight each other much. You can optimize quality and cost almost independently — which, methodologically, is great news.\nThree deployment recipes The Pareto front isn\u0026rsquo;t a single answer; it\u0026rsquo;s a menu. Three useful points on it:\nRuntime-critical. Low temperature, restrictive top_p. Less creative, faster, cheap. Use when you need answers quickly on relatively straightforward tasks.\nPerformance-critical. Moderate temperature (0.65–0.73), balanced top_p. The agent has room to actually find better solutions, at the cost of more compute. Use when the speedup you\u0026rsquo;re trying to extract is worth more than the agent\u0026rsquo;s runtime.\nContext-specific. Run GA4GC on your own codebase and task distribution. You\u0026rsquo;ll get a Pareto front tailored to your environment, which beats picking from generic recipes.\nWhy this is more than a benchmarking trick As AI coding agents move from cool demos to standard infrastructure, their cumulative compute footprint becomes a real sustainability question. An org running hundreds of agent tasks a day is spending serious money and serious energy. Most of it is preventable.\nThe lesson here is that configuration tuning is a sustainability lever, not just a performance one. You don\u0026rsquo;t need a smaller model or special hardware to make AI tooling greener — you need to stop accepting defaults that nobody picked for your situation.\nIf you\u0026rsquo;re shipping AI agents into production, run a small NSGA-II loop on your config space before you scale up. The energy you save will be its own reward, and the better correctness you\u0026rsquo;ll get is a free side effect.\nReference This post is a divulgative summary of:\nPinna, G., Sarro, F. (2025). GA4GC: Greener Agent for Greener Code. In: Proceedings of the 17th Symposium on Search-Based Software Engineering (SSBSE 2025) — Challenge Track on Green SBSE.\nRead the original paper (PDF)\nResearch conducted at University College London (UCL) and the University of Trieste.\n","permalink":"https://giovannipinna.net/posts/ssbse2025-ga4gc/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eTL;DR\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    AI coding agents burn 100,000+ tokens per task. When the task is \u0026ldquo;optimize this code\u0026rsquo;s performance,\u0026rdquo; the agent itself often costs more energy than the optimized code will ever save. We built \u003cstrong\u003eGA4GC\u003c/strong\u003e — Greener Agent for Greener Code — using \u003cstrong\u003eNSGA-II\u003c/strong\u003e to tune the agent\u0026rsquo;s own configuration against three objectives: code correctness, code speedup, and \u003cem\u003eagent runtime\u003c/em\u003e. On a mini-SWE-agent powered by Gemini 2.5 Pro on the SWE-Perf benchmark, we got \u003cstrong\u003e37.7% runtime reduction\u003c/strong\u003e while \u003cem\u003ealso\u003c/em\u003e improving correctness, with a \u003cstrong\u003e135× hypervolume improvement\u003c/strong\u003e over defaults. Bonus finding: temperature is the single most important knob, and LLM hyperparameters control quality while agent constraints control cost — they can be tuned almost independently.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"the-energy-paradox-nobody-talks-about\"\u003eThe energy paradox nobody talks about\u003c/h2\u003e\n\u003cp\u003eHere\u0026rsquo;s a thing that should be obvious but isn\u0026rsquo;t: when you ask an AI agent to \u003cem\u003eoptimize the performance of your code\u003c/em\u003e, the agent\u0026rsquo;s own execution costs energy. A lot of energy. Often more than the code it\u0026rsquo;s optimizing will ever save.\u003c/p\u003e","title":"Sometimes Your AI Agent Burns More Energy Optimizing Code Than the Code Will Ever Save"},{"content":" TL;DR Text-to-SQL is everywhere, but we measure it badly. Exact Match punishes you for swapping users AS u. Execution Accuracy doesn\u0026rsquo;t care if you got 99 of 100 rows right — wrong is wrong. We built QAS (Query Accuracy Score): a continuous score that combines code-aware semantic similarity (how close is the SQL?) with edit-distance table similarity (how close is the answer?). Tested on 11 models on BIRD, QAS surfaces huge differences that binary metrics flatten into the same number. A field built on coin flips Text-to-SQL is one of those areas where the demos look magical. Type a question in English, get a SQL query back, get an answer from your database. No DBA needed. The promise is enormous.\nThe progress is real. The measurement of the progress is not.\nLook at any text-to-SQL leaderboard and you\u0026rsquo;ll see two metrics doing all the work: Exact Match and Execution Accuracy. Both are binary. A query is correct or it\u0026rsquo;s not. There is no in-between.\nThis is a problem. Most queries that \u0026ldquo;fail\u0026rdquo; are not actually random nonsense — they\u0026rsquo;re 80% right, with a wrong filter, or a missing column, or an extra row. Treating them as identical to \u0026ldquo;completely wrong query against the wrong table\u0026rdquo; throws away exactly the information we need to make models better.\nWhy Exact Match is a bad joke Exact Match compares two SQL strings character by character. Same string, score 1. Different string, score 0.\n-- Reference SELECT name FROM users WHERE age \u0026gt; 25 -- Generated SELECT u.name FROM users AS u WHERE u.age \u0026gt; 25 These return identical results on every row of every database in the universe. Exact Match scores the second one as a complete failure.\nSQL is a declarative language. There are dozens of valid ways to write the same query. Aliases, JOIN order, subquery vs. JOIN, WHERE vs. HAVING — all syntactic flexibility, all semantically equivalent, all torpedoed by a metric that compares strings.\nWhy Execution Accuracy is better but still wrong Execution Accuracy at least runs both queries and compares the result tables. If the rows match, score 1. Otherwise, 0.\nThis handles the alias problem elegantly. It also recreates a different one: no partial credit.\nA query that returns 99 of 100 correct rows scores zero. A query that selects the right columns from the right tables but with one slightly off filter scores zero. A query against completely the wrong tables — also zero.\nThese are not the same kind of failure. Treating them as one is destroying signal that researchers and practitioners desperately need.\nQAS: a continuous score with two eyes We built QAS — the Query Accuracy Score — to fix this. It\u0026rsquo;s a number between 0 and 1, and it has two components measuring different things:\nS_C — Semantic Similarity. How close are the queries themselves? We embed both queries with UAE-Code-Large-V1, a model trained specifically on code. General-purpose text embeddings don\u0026rsquo;t understand SQL — they don\u0026rsquo;t know that LEFT JOIN and RIGHT JOIN aren\u0026rsquo;t synonyms, or that subqueries can be functionally identical to JOINs. Code-specialized embeddings do. We take the cosine similarity of the embeddings. That\u0026rsquo;s S_C.\nS_T — Table Similarity. How close are the results? We run both queries and compare the result tables with edit distance — the minimum number of insertions, deletions, and substitutions to transform one table into the other, normalized by size. Off by one row? High S_T. Off by every value? Low S_T.\nThe final score is just:\nQAS = w · S_T + (1 − w) · S_C\nWe tested how sensitive the ranking is to w using Kendall distance and the answer was: not very. Rankings are stable for w ∈ [0.25, 0.75]. We picked w = 0.5 because we wanted intent and outcome to count equally.\nA small but important sub-result: simple proxies like \u0026ldquo;do the result tables have the same number of rows?\u0026rdquo; don\u0026rsquo;t work. We measured the correlation between table-shape similarity and actual content similarity and it was essentially zero. Two tables with identical shape can contain entirely different data. Edit distance over the actual content is necessary.\nWhat QAS shows that binary metrics hide We ran QAS against 11 text-to-SQL models on the BIRD benchmark — fine-tuned specialists, general-purpose GPT-4-class systems, open-source models of various sizes.\nTwo findings stood out.\nHidden differences in \u0026ldquo;equivalent\u0026rdquo; models. Two models with the same Execution Accuracy of ~65% can have completely different shapes of failure. One was a reliable mediocre — close every time, but rarely perfect. The other was bipolar — perfect or catastrophic, nothing in between. Binary metrics call these \u0026ldquo;the same model.\u0026rdquo; They are not.\nDiagnostic power. The two components of QAS act like a tiny diagnostic kit:\nHigh S_C, low S_T → the model understood the query but messed up the values (wrong filter, missing condition). It speaks SQL, but it can\u0026rsquo;t do the math. Low S_C, high S_T → structurally different query, similar output. Either a clever reformulation or an accidental match. Low S_C, low S_T → the model didn\u0026rsquo;t understand the question. That distinction matters. The first case is fixable with better grounding on database content. The third needs better intent understanding. Binary metrics give you neither — just \u0026ldquo;wrong.\u0026rdquo;\nWhat this enables For researchers: stop reporting one number. Report a distribution. Showing that your model has higher mean QAS on failures than the baseline is a meaningful claim, even when binary accuracy is identical.\nFor practitioners: a 70% accurate model with high mean QAS on its failures is very different from a 70% accurate model with low mean QAS on its failures. The first is \u0026ldquo;close most of the time, ship it carefully.\u0026rdquo; The second is \u0026ldquo;binary success, plan for fallback.\u0026rdquo; Deployment decisions hinge on this.\nFor training: QAS is continuous and differentiable in spirit. It could be used as a richer reward signal than the pass/fail signals models train against today. We don\u0026rsquo;t have to settle for binary supervision when our evaluation metric finally has texture.\nThe TL;DR is simple. You can\u0026rsquo;t optimize what you can\u0026rsquo;t see. Binary metrics can\u0026rsquo;t see the gradient of \u0026ldquo;almost right.\u0026rdquo; QAS can.\nReference This post is a divulgative summary of:\nPinna, G., Manzoni, L., De Lorenzo, A., Castelli, M. (2025). Beyond Exact Set and Execution Matches: Redefining Text-to-SQL Metrics with Semantic and Structural Similarity. Scientific Reports, 15(1): 22357.\nRead the original paper (PDF) — Code on GitHub\nResearch conducted at the University of Trieste and NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa.\n","permalink":"https://giovannipinna.net/posts/scireports2025-text-to-sql-metrics/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eTL;DR\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    Text-to-SQL is everywhere, but we measure it badly. \u003cstrong\u003eExact Match\u003c/strong\u003e punishes you for swapping \u003ccode\u003eusers AS u\u003c/code\u003e. \u003cstrong\u003eExecution Accuracy\u003c/strong\u003e doesn\u0026rsquo;t care if you got 99 of 100 rows right — wrong is wrong. We built \u003cstrong\u003eQAS (Query Accuracy Score)\u003c/strong\u003e: a continuous score that combines code-aware semantic similarity (how close is the SQL?) with edit-distance table similarity (how close is the answer?). Tested on 11 models on BIRD, QAS surfaces \u003cem\u003ehuge\u003c/em\u003e differences that binary metrics flatten into the same number.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"a-field-built-on-coin-flips\"\u003eA field built on coin flips\u003c/h2\u003e\n\u003cp\u003eText-to-SQL is one of those areas where the demos look magical. Type a question in English, get a SQL query back, get an answer from your database. No DBA needed. The promise is enormous.\u003c/p\u003e","title":"The Text-to-SQL Field Has a Measurement Problem"},{"content":" TL;DR Our EuroGP 2024 work showed Genetic Improvement (GI) can rescue LLM-generated code. This follow-up makes the GI part itself smarter. Three upgrades: lexicase selection to keep specialists alive, 10% down-sampling to cut compute, and a refined fitness function (F_E) that gives partial credit instead of pass/fail. On four LLMs (GPT-4, ChatGPT, Code Llama 7B, LLaMA 3 8B) over three PSB2 problems, we improved 11 of 12 model-problem combinations. Smaller models gain the most. GI is, increasingly, a capability amplifier for cheap models. What we left on the table last time The EuroGP 2024 paper proved the basic idea: take an LLM\u0026rsquo;s buggy first draft, hand it to Grammatical Evolution, get back better code. Statistically significant gains on every model.\nBut the evolution itself was crude. Tournament selection. A binary fitness function. A search budget that scaled badly with the number of test cases. We had a working pipeline that was leaving wins on the floor.\nThis paper is the audit. We rebuilt three pieces of the GI loop — selection, sampling, fitness — and asked whether smarter evolution buys you more.\nThe short answer: yes.\nWhy tournament selection is the wrong default Tournament selection picks individuals by mini-competitions: grab a few at random, keep the best. It\u0026rsquo;s fast and easy and it has a known weakness — it loves generalists and kills specialists.\nThat matters for code. Imagine two variants of an LLM\u0026rsquo;s draft:\nVariant A: passes 60% of test cases, fails the rest mediocrely. Variant B: aces every integer-related test case, fails on string handling. Variant A wins the tournament every time. Variant B carries valuable partial knowledge that crossover could have combined with another specialist on string handling — but it never makes it past round one.\nTournament selection treats program improvement like a single dimension. Real programs fail along many dimensions at once.\nLexicase: keeping the weirdos alive Lexicase selection evaluates candidates one test case at a time, in a random order, filtering out anyone who isn\u0026rsquo;t tied for the best on that case. The order is shuffled every selection event, so being a specialist on any subset of cases is a survival strategy.\nThis sounds expensive — and on 1,000 test cases per problem, it would be. So we paired it with down-sampling: at every generation, only 10% of test cases are used. Different 10% each generation, so the full test suite still exerts pressure over time, just spread out.\nThe combination keeps specialists alive without the compute bill of full lexicase on full test suites.\nGiving the search a finer compass The original fitness function was the fraction of test cases passed. Binary per case. A test expecting [1, 2, 3, 4, 5] rewards [1, 2, 3, 4, 6] (one digit off) the same as \u0026quot;hello\u0026quot; (semantic chaos).\nThat\u0026rsquo;s wasted gradient information. We built F_E, a fitness function that measures how close the output is to the expected one, per test case. For numbers, distance. For sequences, element-wise comparison. Now \u0026ldquo;almost right\u0026rdquo; is a different number from \u0026ldquo;completely wrong,\u0026rdquo; and the search can climb the right hill instead of treating the whole landscape as a cliff.\nWhat we ran Four LLMs spanning the spectrum: GPT-4, ChatGPT, Code Llama 7B, LLaMA 3 8B. Three PSB2 problems chosen for difficulty diversity. Population of 200 individuals (down from 1,000 — better selection means we don\u0026rsquo;t need brute force), up to 100 generations, 30 repeats each for statistical robustness.\nWhat we got 11 out of 12 model-problem combinations improved. That\u0026rsquo;s not luck.\nSome details worth highlighting:\nSmaller models gained the most, again. Code Llama 7B and LLaMA 3 8B saw the biggest relative jumps. GPT-4 also gained, but in absolute terms its starting point was already strong. Lexicase actually maintains diversity. We could see it in the population dynamics — multiple distinct specialists co-existing for many generations, recombining through crossover into hybrids that neither tournament-selection nor self-correction would ever discover. Down-sampling is essentially free. Cutting evaluations to 10% of test cases per generation didn\u0026rsquo;t degrade final solution quality on our problems. This matters: GI is much more deployable when the per-generation cost is bearable. F_E pays off most on hard problems. When the LLM\u0026rsquo;s seed is already close, partial credit and binary credit converge. When the seed is far, F_E gives the search something to follow. And once again, GI beats self-correction We re-ran the comparison with self-correction. Same conclusion as last year, with stronger evidence: the evolutionary loop finds fixes that the model can\u0026rsquo;t find by re-prompting itself, especially when the original code has structural issues the model is blind to.\nIf you\u0026rsquo;re already running self-correction in production, this isn\u0026rsquo;t a replacement — it\u0026rsquo;s a stack. Self-correct first if you want; then run GI on top. The two failure modes are different, and the gains compound.\nWhat this confirms The big-picture lesson hasn\u0026rsquo;t changed since EuroGP 2024, but it\u0026rsquo;s getting more solid: GI is a capability amplifier. It compresses the gap between cheap models and expensive ones. A 7B parameter model with a smart GI loop on top can land in the same neighborhood as a frontier model running raw — at a fraction of the inference cost.\nFor organizations that can\u0026rsquo;t afford to call GPT-4 on every code generation request, that\u0026rsquo;s not a footnote. That\u0026rsquo;s the headline.\nWhat\u0026rsquo;s still hard Three honest limitations:\nOracle dependency. GI needs a fitness signal. No test cases? You\u0026rsquo;re stuck. Generating tests automatically is a separate hard problem. Scale. PSB2 is small programs. We don\u0026rsquo;t yet know how this behaves on multi-file repository changes. Grammar bias. Building the mutation grammar from the LLM\u0026rsquo;s output means the grammar inherits the LLM\u0026rsquo;s blind spots. If the model never produces a while loop, the search will never explore one either. These are the next things we\u0026rsquo;re chasing.\nReference This post is a divulgative summary of:\nPinna, G., Manzoni, L., De Lorenzo, A., Castelli, M. (2025). Exploring the Effect of Genetic Improvement for Large Language Models generated Code. SN Computer Science, 6(7).\nRead the original paper (PDF)\nResearch conducted at the University of Trieste and NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa.\n","permalink":"https://giovannipinna.net/posts/sncs2025-exploring-gi-effect/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eTL;DR\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    Our EuroGP 2024 work showed Genetic Improvement (GI) can rescue LLM-generated code. This follow-up makes the GI part itself smarter. Three upgrades: \u003cstrong\u003elexicase selection\u003c/strong\u003e to keep specialists alive, \u003cstrong\u003e10% down-sampling\u003c/strong\u003e to cut compute, and a \u003cstrong\u003erefined fitness function (F_E)\u003c/strong\u003e that gives partial credit instead of pass/fail. On four LLMs (GPT-4, ChatGPT, Code Llama 7B, LLaMA 3 8B) over three PSB2 problems, we improved \u003cstrong\u003e11 of 12 model-problem combinations\u003c/strong\u003e. Smaller models gain the most. GI is, increasingly, a \u003cstrong\u003ecapability amplifier\u003c/strong\u003e for cheap models.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"what-we-left-on-the-table-last-time\"\u003eWhat we left on the table last time\u003c/h2\u003e\n\u003cp\u003eThe EuroGP 2024 paper proved the basic idea: take an LLM\u0026rsquo;s buggy first draft, hand it to Grammatical Evolution, get back better code. Statistically significant gains on every model.\u003c/p\u003e","title":"Making the LLM-Plus-Evolution Pipeline Actually Smart"},{"content":" Abstract This paper provides a comprehensive summary of our research program on applying Genetic Improvement to code generated by Large Language Models, consolidating findings from two published studies (EuroGP 2024 and SN Computer Science 2025). Across both works, we demonstrate that neural and evolutionary approaches are fundamentally complementary: LLMs excel at rapidly generating structurally plausible code, while Genetic Improvement refines it toward precise specifications through grammar-based evolutionary search. A consistent finding is the \u0026ldquo;capability amplifier\u0026rdquo; effect — smaller open-source models benefit disproportionately from GI, narrowing the gap with larger proprietary models. We also discuss key limitations including oracle dependency, scalability constraints to multi-file projects, bias propagation from LLM-generated grammars, and the stochastic nature of evolutionary algorithms. Presented at Ital-IA 2025, the 5th National Conference on Artificial Intelligence, Rome, Italy. Introduction The intersection of Large Language Models and evolutionary computation represents one of the most promising frontiers in automated software engineering. Over the past two years, our research group has developed and refined a methodology for systematically improving code generated by LLMs using Genetic Improvement (GI) techniques. This paper, presented at Ital-IA 2025 (the 5th National Conference on Artificial Intelligence, organized by CINI), provides a comprehensive summary of this research program and its key findings.\nThe Research Program Our work builds on a simple but powerful observation: LLM-generated code, even when it fails to fully meet specifications, typically contains valuable structural information. The generated programs use appropriate data types, implement reasonable algorithms, and capture the general shape of correct solutions. What they lack is precision — the exact conditions, edge case handling, and algorithmic details that separate \u0026ldquo;roughly right\u0026rdquo; from \u0026ldquo;fully correct.\u0026rdquo;\nGenetic Improvement exploits this observation by treating LLM outputs as starting points for evolutionary optimization. Rather than discarding incorrect code and starting from scratch, GI evolves it toward correctness through a process of guided variation and selection.\nKey Contributions Across Two Studies Study 1: GI with Grammatical Evolution (EuroGP 2024) Our first study introduced the three-phase pipeline of code extraction, dynamic grammar specialization, and evolutionary search. The key technical innovation was the automatic generation of BNF grammars tailored to each LLM-generated program, ensuring that mutations remain syntactically valid and focused on relevant code constructs.\nEvaluated on 25 PSB2 problems across 5 LLMs (GPT-4, ChatGPT, LLaMA-2 13B, Alpaca-13B, Alpaca-7B), the approach achieved statistically significant improvements (p \u0026lt; 0.001) for every model tested, consistently outperforming LLM self-correction.\nStudy 2: Enhanced GI with Lexicase Selection (SN Computer Science 2025) Our second study refined the evolutionary components. We introduced lexicase selection — a strategy that evaluates individuals on test cases sequentially, preserving specialist solutions that excel on specific subsets. Combined with 10% down-sampling for computational efficiency and a refined fitness function F_E providing partial credit rather than binary pass/fail, the enhanced pipeline improved performance in 11 out of 12 model-problem combinations.\nThe updated model roster included Code Llama 7B and LLaMA 3 8B, with results confirming that GI provides the greatest relative benefit to smaller, less capable models — effectively amplifying their capabilities.\nCross-Cutting Findings Several findings emerged consistently across both studies:\nThe Complementarity of Neural and Evolutionary Approaches LLMs and evolutionary algorithms have fundamentally different strengths. LLMs excel at rapid generation of plausible solutions by leveraging patterns learned from vast code corpora. Evolutionary algorithms excel at systematic, specification-driven refinement. Combining them yields results neither approach achieves alone.\nThe \u0026ldquo;Capability Amplifier\u0026rdquo; Effect GI\u0026rsquo;s relative benefit is inversely proportional to the initial capability of the LLM. Weaker models (Alpaca-7B, Code Llama 7B) show the most dramatic improvements, while stronger models (GPT-4) show smaller but still significant gains. This has important practical implications: organizations using smaller, open-source models for cost or privacy reasons can use GI to narrow the gap with larger proprietary models.\nGrammar Specialization as Search Space Design The dynamic generation of problem-specific grammars proved to be more than a technical convenience — it is a form of intelligent search space design. By leveraging the LLM\u0026rsquo;s structural choices as prior knowledge, the evolutionary search operates in a much smaller, more productive region of the program space.\nLimitations and Open Challenges We identified several limitations that define the boundaries of the current approach:\nOracle dependency: The fitness function requires test cases or a reference oracle for evaluation. Problems without clear specifications or test suites cannot be addressed with the current methodology.\nScalability constraints: Our evaluation focused on small, self-contained programs. Real-world software engineering involves multi-file projects with complex dependencies, which the current grammar-based approach does not handle.\nLLM bias propagation: Since the grammar is derived from the LLM\u0026rsquo;s output, structural biases in the generated code — such as preferring certain loop constructs or data structures — are inherited by the search space. This may prevent the discovery of solutions requiring fundamentally different architectural choices.\nLack of guarantees: Evolutionary algorithms are stochastic by nature. While our results show consistent improvements on average, there is no guarantee of improvement for any individual run or problem instance.\nFuture Directions The research program continues along several fronts: exploring grammar-free GI approaches that can operate directly on ASTs without the BNF intermediary; developing fitness approximation techniques to reduce the dependency on exhaustive test case evaluation; and investigating the integration of GI into larger-scale software engineering workflows where multi-file modifications are necessary.\nPresented at the 5th National Conference on Artificial Intelligence (Ital-IA 2025), organized by CINI, June 23-24, 2025, Rome, Italy. This research was conducted at the University of Trieste and NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa.\n","permalink":"https://giovannipinna.net/posts/italia2025-gi-summary/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eAbstract\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    This paper provides a comprehensive summary of our research program on applying Genetic Improvement to code generated by Large Language Models, consolidating findings from two published studies (EuroGP 2024 and SN Computer Science 2025). Across both works, we demonstrate that neural and evolutionary approaches are fundamentally complementary: LLMs excel at rapidly generating structurally plausible code, while Genetic Improvement refines it toward precise specifications through grammar-based evolutionary search. A consistent finding is the \u0026ldquo;capability amplifier\u0026rdquo; effect — smaller open-source models benefit disproportionately from GI, narrowing the gap with larger proprietary models. We also discuss key limitations including oracle dependency, scalability constraints to multi-file projects, bias propagation from LLM-generated grammars, and the stochastic nature of evolutionary algorithms. Presented at Ital-IA 2025, the 5th National Conference on Artificial Intelligence, Rome, Italy.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eThe intersection of Large Language Models and evolutionary computation represents one of the most promising frontiers in automated software engineering. Over the past two years, our research group has developed and refined a methodology for systematically improving code generated by LLMs using Genetic Improvement (GI) techniques. This paper, presented at \u003cstrong\u003eItal-IA 2025\u003c/strong\u003e (the 5th National Conference on Artificial Intelligence, organized by CINI), provides a comprehensive summary of this research program and its key findings.\u003c/p\u003e","title":"Improving LLM-Generated Code via Genetic Improvement: A Summary of Recent Advances"},{"content":" TL;DR Italian Constitutional Court rulings are written for lawyers. Most citizens can\u0026rsquo;t follow them. Can an LLM fix that? We ran a 75-person human study comparing four versions of the same legal content: original judgments, expert \u0026ldquo;massime\u0026rdquo; summaries, GPT-4o summaries, and a fine-tuned LLaMA 2 7B. Comprehension rates: expert summaries 45%, GPT-4o 38%, raw judgments 33%, fine-tuned LLaMA 30%. GPT-4o really does make legal text more readable. It also produces a worrying pattern: confident, fluent, wrong — readers leave with strongly held but incorrect understandings. Use LLMs for legal summarization. Don\u0026rsquo;t use them without human review. A democratic problem dressed up as a technical one Constitutional Court rulings are some of the most important documents a country produces. They define what your government can and can\u0026rsquo;t do. They shape what your rights are. They are also written by lawyers, for lawyers, in dense Italian legal prose that the average citizen has no chance of understanding.\nThe Italian Corte Costituzionale already tries to fix this. For each judgment they publish a massima — a condensed summary written by legal experts whose entire job is making case law accessible. Massime are better than full judgments, but they still assume legal literacy most people don\u0026rsquo;t have.\nSo the question is straightforward: can an LLM help close the gap? Not by replacing judges or replacing legal experts, but by producing summaries that an ordinary citizen can actually read?\nWe ran the experiment.\nWhat we tested Four versions of the same legal content:\nOriginal judgments (sentenze) — the raw text from the court. Expert massime — the human-written summaries, our quality ceiling. GPT-4o summaries — generated by prompting OpenAI\u0026rsquo;s GPT-4o on each judgment. Fine-tuned LLaMA 2 7B — a smaller open-source model trained on 10,000 judgment-massima pairs scraped from the Court\u0026rsquo;s archives. We also tried Gemma 2B/7B and LLaMantino 7B (an Italian-specialized LLaMA) in the fine-tuning pipeline; LLaMA 2 7B was the best performer, so it represents the open-source side in the human study.\nHow we measured comprehension 75 participants. Roughly 25% with legal knowledge (law students, professionals), 75% without (general public). Each person read summaries across text types and answered comprehension questions on the actual content.\nWe ran it as a between-subjects design — each underlying case was seen by each participant in only one of the four formats — to kill learning effects. Differences across formats were tested with chi-squared.\nWhat we found The headline numbers:\nExpert massime: 45% comprehension. Our ceiling, as expected. GPT-4o: 38% comprehension. Significantly better than raw judgments. Original judgments: 33% comprehension. The status quo. Fine-tuned LLaMA 2 7B: 30% comprehension. Slightly worse than the raw judgment. That last one is worth a beat. Fine-tuning a small open model on 10,000 expert summaries didn\u0026rsquo;t help. It hurt. Capacity matters; for this task, 7B parameters appears to be too small to internalize the structural understanding that makes a good massima.\nGPT-4o, on the other hand, gives a meaningful 5-point lift over reading the judgment yourself. That\u0026rsquo;s real.\nAnd then it gets uncomfortable Here\u0026rsquo;s the part that should give you pause.\nWhen we looked at which kinds of wrong answers people gave, GPT-4o readers showed a much higher rate of confident incorrectness. They didn\u0026rsquo;t just misunderstand — they came away with strong, definite, wrong understandings of what the court had ruled.\nThe text was fluent. Authoritative. Smooth. It read like an expert wrote it. And in the cases where it was wrong, that fluency made the readers more, not less, sure that they understood.\nThis isn\u0026rsquo;t unique to legal summarization. It\u0026rsquo;s the well-known LLM pattern of fluent confabulation. But the stakes change radically when the topic is \u0026ldquo;what did the constitutional court say about your rights.\u0026rdquo; A confidently wrong reader of a court ruling is worse than a confused reader of one. Confusion prompts you to ask. Confidence does not.\nWhat educational background did to the picture Participants with legal knowledge had more uniform comprehension across all text types — they could read the original judgment about as well as the summary. The format mattered less because they brought their own grounding.\nParticipants without legal knowledge were enormously dependent on the format. They benefited the most from a good summary — and were the most vulnerable to a confidently wrong summary.\nIn other words: the people LLM summarization is supposed to help are also the people most exposed to its failure modes. That\u0026rsquo;s the design constraint anyone deploying this kind of tool needs to take seriously.\nWhat to actually do with this Three concrete takeaways.\nFor legal-tech builders: LLM summaries of legal text are a real win on accessibility, but raw deployment is dangerous. The right pattern is LLM drafts, expert review — use the model for scale, use the human for accuracy. The cost saving is in reviewing a draft instead of writing one from scratch.\nFor AI researchers: evaluation metrics that reward fluency and coherence will miss this entire class of failure. We need evaluation methods that probe for confident incorrectness specifically. A summary that reads beautifully but tells you the wrong thing is a worse failure than one that\u0026rsquo;s clunky but right.\nFor everyone else: when you read an AI-summarized legal document — or any high-stakes document — calibrate. The confidence in the prose is not evidence of the truth of the prose. The fluency is the package, not the contents.\nThe bigger picture The accessibility of legal information is, ultimately, a question of democratic participation. When citizens can\u0026rsquo;t read the rulings that govern their lives, the principles of transparency and accountability erode. LLMs can help close that gap. They can also, if deployed without care, widen a different gap — the one between what people think they understand and what they actually do.\nThe technology is ready to assist. It\u0026rsquo;s not ready to be left alone.\nReference This post is a divulgative summary of:\nPinna, G., Manzoni, L., De Lorenzo, A., Castelli, M. (2024). From Courts to Comprehension: Can LLMs Make Judgments More Accessible?. In: Proceedings of the 23rd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2024), December 2024.\nRead the original paper (PDF)\nResearch conducted at the University of Trieste and NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa.\n","permalink":"https://giovannipinna.net/posts/wiat2024-courts-to-comprehension/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eTL;DR\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    Italian Constitutional Court rulings are written for lawyers. Most citizens can\u0026rsquo;t follow them. Can an LLM fix that? We ran a 75-person human study comparing four versions of the same legal content: original judgments, expert \u0026ldquo;massime\u0026rdquo; summaries, GPT-4o summaries, and a fine-tuned LLaMA 2 7B. Comprehension rates: \u003cstrong\u003eexpert summaries 45%\u003c/strong\u003e, \u003cstrong\u003eGPT-4o 38%\u003c/strong\u003e, \u003cstrong\u003eraw judgments 33%\u003c/strong\u003e, \u003cstrong\u003efine-tuned LLaMA 30%\u003c/strong\u003e. GPT-4o really does make legal text more readable. It also produces a worrying pattern: \u003cstrong\u003econfident, fluent, wrong\u003c/strong\u003e — readers leave with strongly held but incorrect understandings. Use LLMs for legal summarization. Don\u0026rsquo;t use them without human review.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"a-democratic-problem-dressed-up-as-a-technical-one\"\u003eA democratic problem dressed up as a technical one\u003c/h2\u003e\n\u003cp\u003eConstitutional Court rulings are some of the most important documents a country produces. They define what your government can and can\u0026rsquo;t do. They shape what your rights are. They are also written by lawyers, for lawyers, in dense Italian legal prose that the average citizen has no chance of understanding.\u003c/p\u003e","title":"GPT-4 Can Make Court Rulings Easier to Read. It Can Also Lie to You About Them, Confidently."},{"content":" TL;DR LLMs write code that almost works. The usual fix is to ask them again — \u0026ldquo;self-correction\u0026rdquo; — but it tends to repeat the same mistakes. We took a different route: treat the buggy code as a seed and evolve it. Using Grammatical Evolution with a grammar built on-the-fly from the LLM\u0026rsquo;s own output, we improved code from GPT-4, ChatGPT, LLaMA-2, Alpaca-13B, and Alpaca-7B on 25 PSB2 problems — with statistically significant gains (p \u0026lt; 0.001) for every model. The smaller the model, the bigger the win. The trap of self-correction Ask any modern LLM to write a Python function and you\u0026rsquo;ll get something that looks right. Run the tests and you\u0026rsquo;ll often discover it isn\u0026rsquo;t.\nThe reflex is obvious: paste the failing tests back into the chat and ask the model to try again. It\u0026rsquo;s intuitive, it\u0026rsquo;s free, and it sometimes works. But it has a ceiling — and the ceiling is the model itself. A model can\u0026rsquo;t easily debug what it can\u0026rsquo;t see. Self-correction loops tend to recycle the same blind spots, the same off-by-one errors, the same missed edge cases.\nSo we asked a different question: what if the LLM is just the first draft?\nCode as evolvable material Genetic Improvement (GI) is a search-based technique that treats programs the way evolution treats genomes. Mutate, recombine, select, repeat. It\u0026rsquo;s been used to fix bugs and shave runtime out of legacy systems. We wondered if it could rescue LLM code.\nThe intuition is that LLM-generated code is rarely wildly wrong. It\u0026rsquo;s usually structurally fine — right data types, sensible algorithm — but with a subtle defect. That makes it a great seed: there are probably good neighbors in the search space, you just need a smart way to find them.\nThe catch: random mutations on source code almost always produce gibberish that won\u0026rsquo;t even parse. Our trick was to constrain mutations through a grammar that\u0026rsquo;s specialized to each program.\nThe pipeline, in three moves 1. Extract. Pull the actual Python out of the LLM\u0026rsquo;s verbose, markdown-laced reply. Easier said than done — models love to wrap code in prose.\n2. Specialize. Parse the code into an Abstract Syntax Tree, then automatically build a BNF grammar from what you see there. Only for loops and integers in the original? Then the grammar only allows for loops and integers. The mutation space stays small and meaningful.\n3. Evolve. Hand the seed and grammar to PonyGE2 and let Grammatical Evolution do its thing — population of 1,000, up to 100 generations, fitness = fraction of PSB2 test cases passed.\nThe dynamic grammar is the secret sauce. A universal grammar for \u0026ldquo;all valid Python\u0026rdquo; would explode the search space; a hand-tuned grammar would be brittle. By generating it from the LLM\u0026rsquo;s own output, we use the model\u0026rsquo;s draft as prior knowledge about where the right answer probably lives.\nWhat we found We ran this across 25 PSB2 benchmark problems and five LLMs spanning the capability spectrum: GPT-4, ChatGPT, LLaMA-2, Alpaca-13B, Alpaca-7B. Each experiment was repeated 30 times. Wilcoxon signed-rank tests for significance.\nThe summary is short:\nEvery model improved. Statistically significant (p \u0026lt; 0.001) across the board. Smaller models gained the most. Alpaca-7B problems went from \u0026ldquo;0 tests passing\u0026rdquo; to \u0026ldquo;actually working\u0026rdquo; in many cases. Even GPT-4 benefited. Smaller absolute gains, because the seed was already strong, but still real. GI beat self-correction. Same starting code, same test feedback — evolution found better fixes than asking the model to try again. That last point is the one I\u0026rsquo;d put on a billboard. Self-correction is bounded by the model\u0026rsquo;s own representation of the problem. Evolutionary search isn\u0026rsquo;t.\nWhy this works Three things, stacked:\nThe seed is good enough. LLMs give you the right shape of solution. You don\u0026rsquo;t need to invent the algorithm — you need to nudge it.\nThe grammar focuses the search. Tailoring mutations to what the program actually contains kills 99% of the search space — the part that was never going to help anyway.\nPopulations don\u0026rsquo;t get stuck. Greedy \u0026ldquo;fix this one bug\u0026rdquo; approaches dead-end in local optima. A population keeps multiple bets alive and can recombine partial wins through crossover.\nWhat it suggests If you ship LLM-generated code, this hints that the right pipeline isn\u0026rsquo;t prompt → output → ship. It\u0026rsquo;s prompt → output → optimize → ship. The optimization stage doesn\u0026rsquo;t need a bigger model or more tokens — it needs a search loop with a smart fitness function.\nAnd there\u0026rsquo;s a bigger picture here: neural and evolutionary methods aren\u0026rsquo;t competitors, they\u0026rsquo;re complements. LLMs are fast, fluent, and creative; evolution is patient, systematic, and goal-directed. Stack them, and you get something neither does alone.\nReference This post is a divulgative summary of:\nPinna, G., Ravalico, D., Rovito, L., Manzoni, L., De Lorenzo, A. (2024). Improving Large Language Models Code Generation by Leveraging Genetic Improvement. In: Proceedings of the 27th European Conference on Genetic Programming (EuroGP 2024), part of EvoStar 2024, Aberystwyth, UK, April 3–5.\nRead the original paper (PDF)\nResearch conducted at the University of Trieste and NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa.\n","permalink":"https://giovannipinna.net/posts/eurogp2024-gi-for-llm-code/","summary":"\u003cdiv class=\"summary-box\"\u003e\n  \u003cdiv class=\"summary-box-header\"\u003e\n    \u003csvg xmlns=\"http://www.w3.org/2000/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cline x1=\"4\" y1=\"6\" x2=\"20\" y2=\"6\"/\u003e\u003cline x1=\"4\" y1=\"12\" x2=\"14\" y2=\"12\"/\u003e\u003cline x1=\"4\" y1=\"18\" x2=\"18\" y2=\"18\"/\u003e\u003c/svg\u003e\n    \u003cspan\u003eTL;DR\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"summary-box-content\"\u003e\n    LLMs write code that \u003cem\u003ealmost\u003c/em\u003e works. The usual fix is to ask them again — \u0026ldquo;self-correction\u0026rdquo; — but it tends to repeat the same mistakes. We took a different route: \u003cstrong\u003etreat the buggy code as a seed and evolve it\u003c/strong\u003e. Using Grammatical Evolution with a grammar built on-the-fly from the LLM\u0026rsquo;s own output, we improved code from GPT-4, ChatGPT, LLaMA-2, Alpaca-13B, and Alpaca-7B on 25 PSB2 problems — with statistically significant gains (p \u0026lt; 0.001) for \u003cem\u003eevery\u003c/em\u003e model. The smaller the model, the bigger the win.\n  \u003c/div\u003e\n\u003c/div\u003e\n\n\u003ch2 id=\"the-trap-of-self-correction\"\u003eThe trap of self-correction\u003c/h2\u003e\n\u003cp\u003eAsk any modern LLM to write a Python function and you\u0026rsquo;ll get something that looks right. Run the tests and you\u0026rsquo;ll often discover it isn\u0026rsquo;t.\u003c/p\u003e","title":"What If We Stopped Asking ChatGPT to Fix Its Own Code?"},{"content":"Influence — Marketing meets AI Influence is a project that combines marketing strategies with artificial intelligence to analyze user data and generate social media content tailored to audience interests.\nFeatures Text Mining — Extracts information from social media to identify audience-preferred topics and recurring keywords Sentiment Analysis — Evaluates public sentiment as positive, negative, or neutral Social Network Analysis — Predicts content appeal and estimates viral potential Background Developed after completing the Digital Transformation course at the University of Trieste (July 2020) and further refined during the Contamination LAB competition, where it reached the finals.\nTechnologies Natural Language Processing (NLP) Machine Learning Social Media APIs Data Visualization Read the full article\n","permalink":"https://giovannipinna.net/projects/influence/","summary":"\u003ch2 id=\"influence--marketing-meets-ai\"\u003eInfluence — Marketing meets AI\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eInfluence\u003c/strong\u003e is a project that combines marketing strategies with artificial intelligence to analyze user data and generate social media content tailored to audience interests.\u003c/p\u003e\n\u003cdiv style=\"position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;\"\u003e\n      \u003ciframe allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen\" loading=\"eager\" referrerpolicy=\"strict-origin-when-cross-origin\" src=\"https://www.youtube.com/embed/NXmEs4L-rRg?autoplay=0\u0026amp;controls=1\u0026amp;end=0\u0026amp;loop=0\u0026amp;mute=0\u0026amp;start=0\" style=\"position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;\" title=\"YouTube video\"\u003e\u003c/iframe\u003e\n    \u003c/div\u003e\n\n\u003ch3 id=\"features\"\u003eFeatures\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eText Mining\u003c/strong\u003e — Extracts information from social media to identify audience-preferred topics and recurring keywords\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSentiment Analysis\u003c/strong\u003e — Evaluates public sentiment as positive, negative, or neutral\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSocial Network Analysis\u003c/strong\u003e — Predicts content appeal and estimates viral potential\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"background\"\u003eBackground\u003c/h3\u003e\n\u003cp\u003eDeveloped after completing the Digital Transformation course at the University of Trieste (July 2020) and further refined during the Contamination LAB competition, where it reached the finals.\u003c/p\u003e","title":"Influence"},{"content":"Project Overview Influence is a project that represents a synthesis between the world of marketing and artificial intelligence. The core idea is to analyze user data to generate social media content that matches the interests and preferences of a specific audience.\nThe platform offers detailed audience analysis regarding public sentiment and popular themes, providing actionable insights for content creators and marketers.\nTechnical Implementation The platform is powered by three key machine learning algorithms:\n1. Text Mining Text Mining techniques are used to extract meaningful information from written sources. This allows us to identify the topics preferred by the target audience and detect recurring keywords in social media comments and interactions.\n2. Sentiment Analysis Sentiment Analysis is a Natural Language Processing (NLP) technique that evaluates text to determine whether the expressed opinion is positive, negative, or neutral. This provides immediate feedback on the emotional responses of the audience to different types of content.\n3. Social Network Analysis Social Network Analysis examines the structure and dynamics of social networks to predict how appealing content will be and estimate its viral potential with measurable accuracy.\nProject Genesis I developed this concept after completing a Digital Transformation course in July 2020, offered by the University of Trieste in collaboration with the Friuli Venezia Giulia region. The course focused on user experience and online positioning strategies.\nFollowing this, I participated in the Contamination LAB at the University of Trieste, where I created formal business models and plans for the project, eventually reaching the finals of the competition.\nKey Takeaways This project taught me the importance of bridging different disciplines — combining technical skills in AI and machine learning with practical knowledge of marketing and user behavior. The intersection of these fields creates powerful tools for understanding and engaging audiences.\n","permalink":"https://giovannipinna.net/posts/influence-project/","summary":"\u003ch2 id=\"project-overview\"\u003eProject Overview\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003eInfluence\u003c/strong\u003e is a project that represents a synthesis between the world of marketing and artificial intelligence. The core idea is to analyze user data to generate social media content that matches the interests and preferences of a specific audience.\u003c/p\u003e\n\u003cp\u003eThe platform offers detailed audience analysis regarding public sentiment and popular themes, providing actionable insights for content creators and marketers.\u003c/p\u003e\n\u003ch2 id=\"technical-implementation\"\u003eTechnical Implementation\u003c/h2\u003e\n\u003cp\u003eThe platform is powered by three key machine learning algorithms:\u003c/p\u003e","title":"Influence: Where Marketing Meets Artificial Intelligence"},{"content":"Overview \u0026ldquo;Thinking, Fast and Slow\u0026rdquo; by Daniel Kahneman is a masterpiece that explores how our decision-making processes are influenced by external factors rather than pure rationality. Kahneman, a Nobel Prize-winning psychologist, presents decades of research in an accessible format.\nThe Two Systems The book introduces two fundamental systems of thought:\nSystem 1 — Fast, instinctive, and emotional. It operates automatically but is easily influenced by biases and heuristics. System 2 — Slow, deliberate, and logical. However, it tends to be lazy and often endorses System 1\u0026rsquo;s choices without proper verification. Key Concepts Cognitive Biases Kahneman identifies 14 major cognitive biases, including:\nAnchoring Effect — Our tendency to rely heavily on the first piece of information encountered Availability Heuristic — Judging probability based on how easily examples come to mind Framing Effect — How the presentation of information influences our decisions Narrative Fallacy — Our tendency to construct stories even when evidence is lacking Risk and Decision-Making The book explores loss aversion and financial decision-making patterns, demonstrating that humans feel losses more intensely than equivalent gains.\nThe Two Selves Kahneman distinguishes between:\nThe experiencing self — how we feel in the moment The remembering self — how we recall past experiences My Takeaway This book fundamentally changed how I understand decision-making. We are not as rational as we like to believe, but understanding these patterns helps us make better choices — whether in personal life, professional contexts, or marketing strategy.\nA must-read for anyone interested in psychology, economics, or understanding human behavior.\n","permalink":"https://giovannipinna.net/posts/pensieri-lenti-e-veloci/","summary":"\u003ch2 id=\"overview\"\u003eOverview\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026ldquo;Thinking, Fast and Slow\u0026rdquo;\u003c/strong\u003e by Daniel Kahneman is a masterpiece that explores how our decision-making processes are influenced by external factors rather than pure rationality. Kahneman, a Nobel Prize-winning psychologist, presents decades of research in an accessible format.\u003c/p\u003e\n\u003ch2 id=\"the-two-systems\"\u003eThe Two Systems\u003c/h2\u003e\n\u003cp\u003eThe book introduces two fundamental systems of thought:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSystem 1\u003c/strong\u003e — Fast, instinctive, and emotional. It operates automatically but is easily influenced by biases and heuristics.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSystem 2\u003c/strong\u003e — Slow, deliberate, and logical. However, it tends to be lazy and often endorses System 1\u0026rsquo;s choices without proper verification.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"key-concepts\"\u003eKey Concepts\u003c/h2\u003e\n\u003ch3 id=\"cognitive-biases\"\u003eCognitive Biases\u003c/h3\u003e\n\u003cp\u003eKahneman identifies 14 major cognitive biases, including:\u003c/p\u003e","title":"Book Review: Thinking, Fast and Slow by Daniel Kahneman"},{"content":"Overview \u0026ldquo;Don\u0026rsquo;t Make Me Think\u0026rdquo; by Steve Krug is a definitive guide to web usability. The book provides practical, common-sense advice for designing websites that are intuitive and easy to use.\nCore Principles Krug defines usability as: a person with average capacity and experience should understand how to use something to accomplish a goal without encountering more problems than it\u0026rsquo;s worth.\nKey Rules Don\u0026rsquo;t make users think — Navigation and actions should be self-evident Make important content visible — Users scan pages, they don\u0026rsquo;t read them word by word Use conventions and visual hierarchy — Leverage what users already know Simplify choices — Reduce cognitive load by limiting options Reduce word count — Eliminate half the words on every page, then eliminate half of what remains Chapter Highlights Site Navigation Good navigation answers three questions: Where am I? Where can I go? How do I get there?\nMobile-First Design Krug emphasizes the importance of designing for mobile devices first, then scaling up for larger screens.\nUsability Testing The book advocates for regular, informal usability testing — even testing with a small number of users reveals major issues.\nBuilding Trust Users need to trust your site before they\u0026rsquo;ll engage with it. Trust comes from professional design, transparency, and reliability.\nAccessibility Designing for accessibility benefits all users, not just those with disabilities. Good usability and good accessibility go hand in hand.\nMy Takeaway This book should be required reading for anyone building websites or digital products. The principle of clarity over consistency resonates deeply with how I approach software design. Every decision should reduce friction for the user.\n","permalink":"https://giovannipinna.net/posts/dont-make-me-think/","summary":"\u003ch2 id=\"overview\"\u003eOverview\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026ldquo;Don\u0026rsquo;t Make Me Think\u0026rdquo;\u003c/strong\u003e by Steve Krug is a definitive guide to web usability. The book provides practical, common-sense advice for designing websites that are intuitive and easy to use.\u003c/p\u003e\n\u003ch2 id=\"core-principles\"\u003eCore Principles\u003c/h2\u003e\n\u003cp\u003eKrug defines usability as: a person with average capacity and experience should understand how to use something to accomplish a goal without encountering more problems than it\u0026rsquo;s worth.\u003c/p\u003e\n\u003ch3 id=\"key-rules\"\u003eKey Rules\u003c/h3\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eDon\u0026rsquo;t make users think\u003c/strong\u003e — Navigation and actions should be self-evident\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMake important content visible\u003c/strong\u003e — Users scan pages, they don\u0026rsquo;t read them word by word\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eUse conventions and visual hierarchy\u003c/strong\u003e — Leverage what users already know\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSimplify choices\u003c/strong\u003e — Reduce cognitive load by limiting options\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eReduce word count\u003c/strong\u003e — Eliminate half the words on every page, then eliminate half of what remains\u003c/li\u003e\n\u003c/ol\u003e\n\u003ch2 id=\"chapter-highlights\"\u003eChapter Highlights\u003c/h2\u003e\n\u003ch3 id=\"site-navigation\"\u003eSite Navigation\u003c/h3\u003e\n\u003cp\u003eGood navigation answers three questions: Where am I? Where can I go? How do I get there?\u003c/p\u003e","title":"Book Review: Don't Make Me Think by Steve Krug"},{"content":" Hi, I'm Giovanni 👋 AI Researcher \u0026middot; ML Engineer \u0026middot; Ph.D.\nI'm an AI Researcher and Engineer based in Trieste, Italy, with a Ph.D. in Applied Data Science \u0026 Artificial Intelligence. My work sits at the intersection of NLP, Large Language Models, and AI Agents, and Evolutionary Computation \u0026mdash; building systems that improve and evaluate AI-generated code.\nGet in Touch View My CV 🎯 My Story My journey in computer science started in high school, where I fell in love with programming and electronics. That curiosity led me to a B.Sc. in Electronic and Computer Engineering (2015\u0026ndash;2019) at the University of Trieste, where my first AI courses sparked a passion that shaped everything that followed \u0026mdash; continuing with a M.Sc. in Computer Engineering (2019\u0026ndash;2022) focused on Data Science and AI. A semester at Montanuniversit\u0026auml;t Leoben through Erasmus+ pushed me out of my comfort zone: applying AI to energy and logistics taught me how transferable these methods really are.\nFor my Master's thesis I turned to computer vision applied to the humanities, working alongside art historians and digital humanities experts. It was my first real contact with academic research, and I loved every part of it \u0026mdash; engaging with the state of the art, designing evaluations, and digging into how modern models actually work. The thesis was later published in IEEE Access (2023) and convinced me to stay on this road.\nI continued with an Industrial Ph.D. (2023\u0026ndash;2025), co-funded by PLUS S.r.l. within Area Science Park. I chose NLP \u0026mdash; another field I had always loved \u0026mdash; but my engineering mindset pulled me toward applied research useful for business. My thesis, Application of Large Language Models: Addressing Real-World Challenges, tackles three concrete questions: how to improve LLM outputs, how to evaluate them properly (with new metrics for Text-to-SQL), and how to make LLMs more sustainable through Green AI. Along the way I spent time as a visiting researcher at NOVA IMS in Lisbon and at UCL in London, attended 5 international summer and winter schools on NLP and ML, and published across journals, conferences and workshops.\nI am a proactive person with a deep desire to learn and push beyond my comfort zone. I believe that great engineering and great research share the same foundation: understanding a problem deeply, building solutions methodically, and measuring impact rigorously.\n💬 NLP \u0026 LLMs Extensive experience with Large Language Models, text classification, sentiment analysis, and building NLP pipelines for real-world applications from legal text to historical newspapers.\n🤖 AI Agents Developed AI agents for SQL code generation and analyzed the performance of AI agents in creating pull requests.\n🗄️ Text-to-SQL Proposed new evaluation metrics integrating semantic and structural similarity for Text-to-SQL systems, published in Scientific Reports (Nature/Springer).\n🔍 RAG Systems Designed and deployed a production RAG system using LangChain and LlamaIndex that reduced client call center volume by 30%, demonstrating real business impact.\n🧬 Genetic Improvement Developed novel pipelines using Grammatical Evolution to automatically correct and improve SQL code generated by LLMs, published at EuroGP and SN Computer Science.\n🔬 Research Interests My research is driven by a core question: how can we build AI systems that bridge the gap between human intent and structured information? Key areas include:\nNLP \u0026 Large Language Models \u0026mdash; Improving and evaluating LLM outputs for structured tasks like code generation Text-to-SQL \u0026mdash; Developing systems and metrics for natural language database querying Genetic Improvement of Code \u0026mdash; Using evolutionary computation as a post-processing layer for LLM-generated code AI Coding Agents \u0026mdash; Empirical analysis and optimization of AI agents for software development RAG \u0026 Retrieval Systems \u0026mdash; Building intelligent agents that interact with complex data environments 🌍 International Experience I believe the best research happens at the intersection of diverse perspectives:\n🇬🇧 Visiting Researcher at UCL, London 🇵🇹 Two visiting stays at NOVA IMS, Lisbon 🇦🇹 Erasmus semester at Montanuniversit\u0026auml;t Leoben 🎓 5 international summer/winter schools (Oxford, Lisbon, Athens, French Alps, Gran Canaria) Let's Connect! I'm always open to new collaborations, research opportunities, and interesting conversations. Whether you want to discuss a project, share ideas, or just say hi \u0026mdash; feel free to reach out.\n📧 Email Me ","permalink":"https://giovannipinna.net/about/","summary":"\u003cdiv class=\"about-hero\"\u003e\n  \u003cdiv class=\"about-hero-img\"\u003e\n    \u003cimg src=\"/images/profile.png\" alt=\"Giovanni Pinna\"\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"about-hero-text\"\u003e\n    \u003ch1\u003eHi, I'm Giovanni \u003cspan class=\"wave\"\u003e👋\u003c/span\u003e\u003c/h1\u003e\n    \u003cp class=\"about-tagline\"\u003eAI Researcher \u0026middot; ML Engineer \u0026middot; Ph.D.\u003c/p\u003e\n    \u003cp\u003eI'm an AI Researcher and Engineer based in \u003cstrong\u003eTrieste, Italy\u003c/strong\u003e, with a Ph.D. in Applied Data Science \u0026 Artificial Intelligence. My work sits at the intersection of \u003cstrong\u003eNLP\u003c/strong\u003e, \u003cstrong\u003eLarge Language Models\u003c/strong\u003e, and \u003cstrong\u003eAI Agents\u003c/strong\u003e, and \u003cstrong\u003eEvolutionary Computation\u003c/strong\u003e \u0026mdash; building systems that improve and evaluate AI-generated code.\u003c/p\u003e\n    \u003cdiv class=\"about-cta\"\u003e\n      \u003ca href=\"/contact/\" class=\"btn-primary\"\u003eGet in Touch\u003c/a\u003e\n      \u003ca href=\"/cv/\" class=\"btn-secondary\"\u003eView My CV\u003c/a\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv class=\"about-section\"\u003e\n  \u003ch2\u003e🎯 My Story\u003c/h2\u003e\n  \u003cp\u003eMy journey in computer science started in high school, where I fell in love with programming and electronics. That curiosity led me to a \u003cstrong\u003eB.Sc. in Electronic and Computer Engineering\u003c/strong\u003e (2015\u0026ndash;2019) at the University of Trieste, where my first AI courses sparked a passion that shaped everything that followed \u0026mdash; continuing with a \u003cstrong\u003eM.Sc. in Computer Engineering\u003c/strong\u003e (2019\u0026ndash;2022) focused on Data Science and AI. A semester at \u003cstrong\u003eMontanuniversit\u0026auml;t Leoben\u003c/strong\u003e through Erasmus+ pushed me out of my comfort zone: applying AI to energy and logistics taught me how transferable these methods really are.\u003c/p\u003e","title":"About Me"},{"content":"Teaching \u0026amp; Mentoring Teaching Assistant — Database Systems, University of Trieste, Mar–Dec 2024. Designed lab exercises for SQL, relational algebra, ER modeling. Reviewed all student projects. Department Tutor — Data Science \u0026amp; AI, University of Trieste, Jan–Dec 2024. Restructured departmental websites for 200+ students. Department Tutor — Computer Engineering, University of Trieste, Aug 2020 – Feb 2021. Managed internship administration. Thesis Supervision — Mentored 4 master\u0026rsquo;s students and 1 bachelor\u0026rsquo;s student on NLP, LLM code generation, Genetic Improvement, and Text-to-SQL evaluation. Summer \u0026amp; Winter Schools AthNLP — Athens Natural Language Processing Summer School, Athens, Greece, Sep 2024. Selected from 200+ applicants. LxMLS — Lisbon Machine Learning School, Lisbon, Portugal, Jul 2024. One of the premier ML summer schools in Europe. ALPS — Advanced Language Processing Winter School, French Alps, France, Apr 2024. Highly selective (~50 attendees). DeepLearn — International Summer School on Deep Learning, Gran Canaria, Spain, Jul 2023. OxML — Oxford Machine Learning Summer School (NLP track), Oxford, UK, Jul 2023. Volunteering EACL 2024 — European Chapter of the Association for Computational Linguistics, St Julian\u0026rsquo;s, Malta, Mar 2024. Volunteer at one of the top NLP conferences in Europe. IEEE CCTA 2022 — Conference on Control Technology and Applications, Trieste, Italy, Aug 2022. Volunteer supporting conference operations. Professional Memberships AILC — Associazione Italiana di Linguistica Computazionale (Italian Association of Computational Linguistics). Member since 2023. AI2S — Trieste-based association promoting artificial intelligence. Member since 2020. Mentors4U — Europe\u0026rsquo;s largest non-profit mentoring association. Mentee since Jul 2022. Certifications McKinsey Forward Program — 10-week digital learning program by McKinsey.org covering adaptability, structured problem-solving, communication, and AI essentials. ","permalink":"https://giovannipinna.net/activities/","summary":"\u003ch2 id=\"teaching--mentoring\"\u003eTeaching \u0026amp; Mentoring\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eTeaching Assistant\u003c/strong\u003e — Database Systems, University of Trieste, Mar–Dec 2024. Designed lab exercises for SQL, relational algebra, ER modeling. Reviewed all student projects.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDepartment Tutor\u003c/strong\u003e — Data Science \u0026amp; AI, University of Trieste, Jan–Dec 2024. Restructured departmental websites for 200+ students.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDepartment Tutor\u003c/strong\u003e — Computer Engineering, University of Trieste, Aug 2020 – Feb 2021. Managed internship administration.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eThesis Supervision\u003c/strong\u003e — Mentored 4 master\u0026rsquo;s students and 1 bachelor\u0026rsquo;s student on NLP, LLM code generation, Genetic Improvement, and Text-to-SQL evaluation.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"summer--winter-schools\"\u003eSummer \u0026amp; Winter Schools\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAthNLP\u003c/strong\u003e — Athens Natural Language Processing Summer School, Athens, Greece, Sep 2024. Selected from 200+ applicants.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLxMLS\u003c/strong\u003e — Lisbon Machine Learning School, Lisbon, Portugal, Jul 2024. One of the premier ML summer schools in Europe.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eALPS\u003c/strong\u003e — Advanced Language Processing Winter School, French Alps, France, Apr 2024. Highly selective (~50 attendees).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDeepLearn\u003c/strong\u003e — International Summer School on Deep Learning, Gran Canaria, Spain, Jul 2023.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eOxML\u003c/strong\u003e — Oxford Machine Learning Summer School (NLP track), Oxford, UK, Jul 2023.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"volunteering\"\u003eVolunteering\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eEACL 2024\u003c/strong\u003e — European Chapter of the Association for Computational Linguistics, St Julian\u0026rsquo;s, Malta, Mar 2024. Volunteer at one of the top NLP conferences in Europe.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eIEEE CCTA 2022\u003c/strong\u003e — Conference on Control Technology and Applications, Trieste, Italy, Aug 2022. Volunteer supporting conference operations.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"professional-memberships\"\u003eProfessional Memberships\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAILC\u003c/strong\u003e — Associazione Italiana di Linguistica Computazionale (Italian Association of Computational Linguistics). Member since 2023.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAI2S\u003c/strong\u003e — Trieste-based association promoting artificial intelligence. Member since 2020.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMentors4U\u003c/strong\u003e — Europe\u0026rsquo;s largest non-profit mentoring association. Mentee since Jul 2022.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"certifications\"\u003eCertifications\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eMcKinsey Forward Program\u003c/strong\u003e — 10-week digital learning program by McKinsey.org covering adaptability, structured problem-solving, communication, and AI essentials.\u003c/li\u003e\n\u003c/ul\u003e","title":"Activities"},{"content":"Let\u0026rsquo;s Stay in Touch I\u0026rsquo;m always open to new opportunities, collaborations, and conversations. Feel free to reach out!\nDirect Contact 📧 giovanni.pinna.17l96@gmail.com 📍 Trieste, Italy Find Me Online GitHub LinkedIn Google Scholar YouTube Instagram 📍 Based in Trieste, Italy ","permalink":"https://giovannipinna.net/contact/","summary":"\u003ch2 id=\"lets-stay-in-touch\"\u003eLet\u0026rsquo;s Stay in Touch\u003c/h2\u003e\n\u003cp\u003eI\u0026rsquo;m always open to new opportunities, collaborations, and conversations. Feel free to reach out!\u003c/p\u003e\n\u003cdiv class=\"contact-grid\"\u003e\n\u003cdiv class=\"contact-card\"\u003e\n  \u003ch3\u003eDirect Contact\u003c/h3\u003e\n  \u003cdiv class=\"contact-links\"\u003e\n    \u003ca href=\"mailto:giovanni.pinna.17l96@gmail.com\"\u003e\u003cspan class=\"contact-icon\"\u003e📧\u003c/span\u003e giovanni.pinna.17l96@gmail.com\u003c/a\u003e\n    \u003cspan class=\"contact-location\"\u003e\u003cspan class=\"contact-icon\"\u003e📍\u003c/span\u003e Trieste, Italy\u003c/span\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv class=\"contact-card\"\u003e\n  \u003ch3\u003eFind Me Online\u003c/h3\u003e\n  \u003cdiv class=\"contact-social\"\u003e\n    \u003ca href=\"https://github.com/giovannipinna96\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"social-link\"\u003e\n      \u003csvg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 496 512\"\u003e\u003cpath fill=\"currentColor\" d=\"M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8z\"/\u003e\u003c/svg\u003e\n      \u003cspan\u003eGitHub\u003c/span\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.linkedin.com/in/giovanni-pinna/\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"social-link\"\u003e\n      \u003csvg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 448 512\"\u003e\u003cpath fill=\"currentColor\" d=\"M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z\"/\u003e\u003c/svg\u003e\n      \u003cspan\u003eLinkedIn\u003c/span\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://scholar.google.com/citations?user=bf2RiOAAAAAJ\u0026hl=en\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"social-link\"\u003e\n      \u003csvg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 24 25\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003cpath d=\"M5.242 13.769L0.5 9.5 12 1l11.5 9-5.242 3.769C17.548 11.249 14.978 9.5 12 9.5c-2.977 0-5.548 1.748-6.758 4.269zM12 10a7 7 0 1 0 0 14 7 7 0 0 0 0-14z\"/\u003e\u003c/svg\u003e\n      \u003cspan\u003eGoogle Scholar\u003c/span\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.youtube.com/channel/UCiVRlONi_xbN0Qro6C22y-A\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"social-link\"\u003e\n      \u003csvg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 576 512\"\u003e\u003cpath fill=\"currentColor\" d=\"M549.655 124.083c-6.281-23.65-24.787-42.276-48.284-48.597C458.781 64 288 64 288 64S117.22 64 74.629 75.486c-23.497 6.322-42.003 24.947-48.284 48.597-11.412 42.867-11.412 132.305-11.412 132.305s0 89.438 11.412 132.305c6.281 23.65 24.787 41.5 48.284 47.821C117.22 448 288 448 288 448s170.78 0 213.371-11.486c23.497-6.321 42.003-24.171 48.284-47.821 11.412-42.867 11.412-132.305 11.412-132.305s0-89.438-11.412-132.305zm-317.51 213.508V175.185l142.739 81.205-142.739 81.201z\"/\u003e\u003c/svg\u003e\n      \u003cspan\u003eYouTube\u003c/span\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.instagram.com/pinna.giova/\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"social-link\"\u003e\n      \u003csvg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 448 512\"\u003e\u003cpath fill=\"currentColor\" d=\"M224.1 141c-63.6 0-114.9 51.3-114.9 114.9s51.3 114.9 114.9 114.9S339 319.5 339 255.9 287.7 141 224.1 141zm0 189.6c-41.1 0-74.7-33.5-74.7-74.7s33.5-74.7 74.7-74.7 74.7 33.5 74.7 74.7-33.6 74.7-74.7 74.7zm146.4-194.3c0 14.9-12 26.8-26.8 26.8-14.9 0-26.8-12-26.8-26.8s12-26.8 26.8-26.8 26.8 12 26.8 26.8zm76.1 27.2c-1.7-35.9-9.9-67.7-36.2-93.9-26.2-26.2-58-34.4-93.9-36.2-37-2.1-147.9-2.1-184.9 0-35.8 1.7-67.6 9.9-93.9 36.1s-34.4 58-36.2 93.9c-2.1 37-2.1 147.9 0 184.9 1.7 35.9 9.9 67.7 36.2 93.9s58 34.4 93.9 36.2c37 2.1 147.9 2.1 184.9 0 35.9-1.7 67.7-9.9 93.9-36.2 26.2-26.2 34.4-58 36.2-93.9 2.1-37 2.1-147.8 0-184.8zM398.8 388c-7.8 19.6-22.9 34.7-42.6 42.6-29.5 11.7-99.5 9-132.1 9s-102.7 2.6-132.1-9c-19.6-7.8-34.7-22.9-42.6-42.6-11.7-29.5-9-99.5-9-132.1s-2.6-102.7 9-132.1c7.8-19.6 22.9-34.7 42.6-42.6 29.5-11.7 99.5-9 132.1-9s102.7-2.6 132.1 9c19.6 7.8 34.7 22.9 42.6 42.6 11.7 29.5 9 99.5 9 132.1s2.7 102.7-9 132.1z\"/\u003e\u003c/svg\u003e\n      \u003cspan\u003eInstagram\u003c/span\u003e\n    \u003c/a\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv class=\"contact-map\"\u003e\n  \u003ch3\u003e📍 Based in Trieste, Italy\u003c/h3\u003e\n  \u003cdiv class=\"map-container\"\u003e\n    \u003ciframe src=\"https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d44537.89874413384!2d13.737732!3d45.6495264!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x477b6b06e4edf533%3A0x666a2484d4dd2b50!2sTrieste%2C%20Province%20of%20Trieste%2C%20Italy!5e0!3m2!1sen!2sus!4v1710000000000\" allowfullscreen=\"\" loading=\"lazy\" referrerpolicy=\"no-referrer-when-downgrade\"\u003e\u003c/iframe\u003e\n  \u003c/div\u003e\n\u003c/div\u003e","title":"Contact"},{"content":" Download a copy of my CV:\n📄 Download Full CV (PDF) 📋 Download Short CV (PDF) Last updated: April 2026\n💼 Experience Visiting Researcher University College London (UCL) Sep 2025 – Dec 2025 · London, UK Research on AI coding agents, green AI, and multi-objective optimization. Co-developed GA4GC framework (135× hypervolume improvement). Led empirical study analyzing 7,156 AI-authored pull requests. Visiting Researcher NOVA IMS, Universidade NOVA de Lisboa May–Aug 2025 · Lisbon, Portugal Research visit where I developed an AI Agent for Text-to-SQL tasks using only small open-source models, enabling cost-effective natural language database querying without proprietary APIs. Teaching Assistant — Database Systems University of Trieste Mar 2024 – Dec 2024 · Trieste, Italy Guided students through database system development. Designed lab exercises for SQL, relational algebra, ER modeling. Reviewed and evaluated all student projects. Visiting Researcher NOVA IMS, Universidade NOVA de Lisboa May–Aug 2024 · Lisbon, Portugal Research visit where I created new Text-to-SQL evaluation metric (published in Scientific Reports / Nature). Applied AI Scientist PLUS S.r.l., Area Science Park Jan 2023 – Dec 2025 · Trieste, Italy Designed and deployed a production RAG system (reducing call center volume by 30%). Developed Genetic Improvement pipelines for LLM code correction. Proposed novel Text-to-SQL evaluation metrics. Mentored 5 thesis students. Department Tutor — Data Science \u0026amp; AI University of Trieste Jan 2024 – Dec 2024 · Trieste, Italy Restructured departmental websites. Primary contact for 200+ students for academic guidance. 🎓 Education Ph.D. in Applied Data Science \u0026amp; Artificial Intelligence University of Trieste Jan 2023 – Dec 2025 · Trieste, Italy Thesis: \"Application of Large Language Models: Addressing Real-World Challenges\". Industrial Ph.D. co-funded by PLUS S.r.l. (Area Science Park). Supervisors: Prof. Luca Manzoni, Prof. Andrea De Lorenzo. International visits: UCL London, NOVA IMS Lisbon. M.Sc. in Computer Engineering University of Trieste Sep 2019 – Oct 2022 · Grade: 108/110 Thesis: \"An Automatic Tool for the Recognition of Punches in Late-Medieval Panel Paintings\" (published in IEEE Access). Erasmus exchange at Montanuniversität Leoben, Austria. B.Sc. in Electronic and Computer Engineering University of Trieste Sep 2015 – Mar 2019 · Trieste, Italy 🛠️ Technical Skills Programming Languages Python (5+ yrs)SQLJavaC++ ML \u0026amp; NLP Frameworks PyTorchHuggingFacescikit-learnspaCyNLTKBERTopic LLM \u0026amp; Agent Frameworks LangChainLlamaIndexLangGraphRAG PipelinesPrompt Engineering Tools \u0026amp; Infrastructure Git/GitHubDockerLinuxLaTeXStreamlitGradio 🌍 Languages Italian Native English C1 (Advanced) ","permalink":"https://giovannipinna.net/cv/","summary":"\u003cdiv class=\"cv-download-section\"\u003e\n  \u003cp\u003eDownload a copy of my CV:\u003c/p\u003e\n  \u003cdiv class=\"cv-buttons\"\u003e\n    \u003ca href=\"/files/Giovanni_Pinna_CV_long.pdf\" class=\"btn-primary\" download\u003e📄 Download Full CV (PDF)\u003c/a\u003e\n    \u003ca href=\"/files/Giovanni_Pinna_CV_short.pdf\" class=\"btn-secondary\" download\u003e📋 Download Short CV (PDF)\u003c/a\u003e\n  \u003c/div\u003e\n  \u003cp class=\"cv-updated\"\u003e\u003cem\u003eLast updated: April 2026\u003c/em\u003e\u003c/p\u003e\n\u003c/div\u003e\n\u003chr\u003e\n\u003ch2 id=\"-experience\"\u003e💼 Experience\u003c/h2\u003e\n\u003cdiv class=\"cv-timeline\"\u003e\n  \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/ucl2.jpg\" alt=\"UCL\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003eVisiting Researcher\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"https://www.ucl.ac.uk/\"\u003eUniversity College London (UCL)\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eSep 2025 – Dec 2025 · London, UK\u003c/div\u003e\n      \u003cdiv class=\"cv-desc\"\u003eResearch on AI coding agents, green AI, and multi-objective optimization. Co-developed GA4GC framework (135× hypervolume improvement). Led empirical study analyzing 7,156 AI-authored pull requests.\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/nova.png\" alt=\"NOVA IMS\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003eVisiting Researcher\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"https://www.novaims.unl.pt/\"\u003eNOVA IMS, Universidade NOVA de Lisboa\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eMay–Aug 2025 · Lisbon, Portugal\u003c/div\u003e\n      \u003cdiv class=\"cv-desc\"\u003eResearch visit where I developed an AI Agent for Text-to-SQL tasks using only small open-source models, enabling cost-effective natural language database querying without proprietary APIs.\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/units.jpg\" alt=\"UniTS\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003eTeaching Assistant — Database Systems\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"https://www.units.it/\"\u003eUniversity of Trieste\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eMar 2024 – Dec 2024 · Trieste, Italy\u003c/div\u003e\n      \u003cdiv class=\"cv-desc\"\u003eGuided students through database system development. Designed lab exercises for SQL, relational algebra, ER modeling. Reviewed and evaluated all student projects.\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/nova.png\" alt=\"NOVA IMS\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003eVisiting Researcher\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"https://www.novaims.unl.pt/\"\u003eNOVA IMS, Universidade NOVA de Lisboa\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eMay–Aug 2024 · Lisbon, Portugal\u003c/div\u003e\n      \u003cdiv class=\"cv-desc\"\u003eResearch visit where I created new Text-to-SQL evaluation metric (published in Scientific Reports / Nature).\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n    \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/plus.jpg\" alt=\"PLUS S.r.l.\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003eApplied AI Scientist\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"#\"\u003ePLUS S.r.l., Area Science Park\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eJan 2023 – Dec 2025 · Trieste, Italy\u003c/div\u003e\n      \u003cdiv class=\"cv-desc\"\u003eDesigned and deployed a production RAG system (reducing call center volume by 30%). Developed Genetic Improvement pipelines for LLM code correction. Proposed novel Text-to-SQL evaluation metrics. Mentored 5 thesis students.\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/units.jpg\" alt=\"UniTS\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003eDepartment Tutor — Data Science \u0026amp; AI\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"https://www.units.it/\"\u003eUniversity of Trieste\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eJan 2024 – Dec 2024 · Trieste, Italy\u003c/div\u003e\n      \u003cdiv class=\"cv-desc\"\u003eRestructured departmental websites. Primary contact for 200+ students for academic guidance.\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003ch2 id=\"-education\"\u003e🎓 Education\u003c/h2\u003e\n\u003cdiv class=\"cv-timeline\"\u003e\n  \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/units.jpg\" alt=\"UniTS\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003ePh.D. in Applied Data Science \u0026amp; Artificial Intelligence\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"https://www.units.it/\"\u003eUniversity of Trieste\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eJan 2023 – Dec 2025 · Trieste, Italy\u003c/div\u003e\n      \u003cdiv class=\"cv-desc\"\u003eThesis: \"Application of Large Language Models: Addressing Real-World Challenges\". Industrial Ph.D. co-funded by PLUS S.r.l. (Area Science Park). Supervisors: Prof. Luca Manzoni, Prof. Andrea De Lorenzo. International visits: UCL London, NOVA IMS Lisbon.\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/units.jpg\" alt=\"UniTS\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003eM.Sc. in Computer Engineering\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"https://www.units.it/\"\u003eUniversity of Trieste\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eSep 2019 – Oct 2022 · Grade: 108/110\u003c/div\u003e\n      \u003cdiv class=\"cv-desc\"\u003eThesis: \"An Automatic Tool for the Recognition of Punches in Late-Medieval Panel Paintings\" (published in IEEE Access). Erasmus exchange at Montanuniversität Leoben, Austria.\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-entry\"\u003e\n    \u003cdiv class=\"cv-logo\"\u003e\u003cimg src=\"/images/orgs/units.jpg\" alt=\"UniTS\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-dot\"\u003e\u003c/div\u003e\n    \u003cdiv class=\"cv-card\"\u003e\n      \u003cdiv class=\"cv-title\"\u003eB.Sc. in Electronic and Computer Engineering\u003c/div\u003e\n      \u003cdiv class=\"cv-org\"\u003e\u003ca href=\"https://www.units.it/\"\u003eUniversity of Trieste\u003c/a\u003e\u003c/div\u003e\n      \u003cdiv class=\"cv-meta\"\u003eSep 2015 – Mar 2019 · Trieste, Italy\u003c/div\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003ch2 id=\"-technical-skills\"\u003e🛠️ Technical Skills\u003c/h2\u003e\n\u003cdiv class=\"cv-skills-grid\"\u003e\n  \u003cdiv class=\"cv-skill-col\"\u003e\n    \u003ch4\u003eProgramming Languages\u003c/h4\u003e\n    \u003cdiv class=\"cv-skill-tags\"\u003e\n      \u003cspan\u003ePython (5+ yrs)\u003c/span\u003e\u003cspan\u003eSQL\u003c/span\u003e\u003cspan\u003eJava\u003c/span\u003e\u003cspan\u003eC++\u003c/span\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-skill-col\"\u003e\n    \u003ch4\u003eML \u0026amp; NLP Frameworks\u003c/h4\u003e\n    \u003cdiv class=\"cv-skill-tags\"\u003e\n      \u003cspan\u003ePyTorch\u003c/span\u003e\u003cspan\u003eHuggingFace\u003c/span\u003e\u003cspan\u003escikit-learn\u003c/span\u003e\u003cspan\u003espaCy\u003c/span\u003e\u003cspan\u003eNLTK\u003c/span\u003e\u003cspan\u003eBERTopic\u003c/span\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-skill-col\"\u003e\n    \u003ch4\u003eLLM \u0026amp; Agent Frameworks\u003c/h4\u003e\n    \u003cdiv class=\"cv-skill-tags\"\u003e\n      \u003cspan\u003eLangChain\u003c/span\u003e\u003cspan\u003eLlamaIndex\u003c/span\u003e\u003cspan\u003eLangGraph\u003c/span\u003e\u003cspan\u003eRAG Pipelines\u003c/span\u003e\u003cspan\u003ePrompt Engineering\u003c/span\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-skill-col\"\u003e\n    \u003ch4\u003eTools \u0026amp; Infrastructure\u003c/h4\u003e\n    \u003cdiv class=\"cv-skill-tags\"\u003e\n      \u003cspan\u003eGit/GitHub\u003c/span\u003e\u003cspan\u003eDocker\u003c/span\u003e\u003cspan\u003eLinux\u003c/span\u003e\u003cspan\u003eLaTeX\u003c/span\u003e\u003cspan\u003eStreamlit\u003c/span\u003e\u003cspan\u003eGradio\u003c/span\u003e\n    \u003c/div\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\u003ch2 id=\"-languages\"\u003e🌍 Languages\u003c/h2\u003e\n\u003cdiv class=\"cv-languages\"\u003e\n  \u003cdiv class=\"cv-lang-item\"\u003e\n    \u003cspan class=\"cv-lang-name\"\u003eItalian\u003c/span\u003e\n    \u003cspan class=\"cv-lang-level\"\u003eNative\u003c/span\u003e\n  \u003c/div\u003e\n  \u003cdiv class=\"cv-lang-item\"\u003e\n    \u003cspan class=\"cv-lang-name\"\u003eEnglish\u003c/span\u003e\n    \u003cspan class=\"cv-lang-level\"\u003eC1 (Advanced)\u003c/span\u003e\n  \u003c/div\u003e\n\u003c/div\u003e","title":"Curriculum Vitae"},{"content":"Journal Articles Pinna, G., Perezhohin, Y., Manzoni, L., Castelli, M., De Lorenzo, A. (2025). \u0026ldquo;Redefining Text-to-SQL Metrics by Incorporating Semantic and Structural Similarity.\u0026rdquo; Scientific Reports 15.1. (Nature) DOI\nPinna, G., Ravalico, D., Rovito, L., Manzoni, L., De Lorenzo, A. (2025). \u0026ldquo;Exploring the Effect of Genetic Improvement for Large Language Models-Generated Code.\u0026rdquo; SN Computer Science 6.7. Paper\nZullich, M., Macovaz, V., Pinna, G., Pellegrino, F.A. (2023). \u0026ldquo;An Artificial Intelligence System for Automatic Recognition of Punches in Fourteenth-Century Panel Painting.\u0026rdquo; IEEE Access. DOI\nConference Papers Pinna, G., Gong, J., Williams, D., Sarro, F. (2026). \u0026ldquo;Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance.\u0026rdquo; arXiv:2602.08915. arXiv\nGong, J., Pinna, G. (2026). \u0026ldquo;Analyzing Message-Code Inconsistency in AI Coding Agent-Authored Pull Requests.\u0026rdquo; arXiv:2601.04886. 🏆 Distinguished Mining Challenge Paper Award, MSR 2026 arXiv\nPinna, G., Ravalico, D., Rovito, L., Manzoni, L., De Lorenzo, A. (2024). \u0026ldquo;Enhancing Large Language Models-Based Code Generation by Leveraging Genetic Improvement.\u0026rdquo; EuroGP 2024. Lecture Notes in Computer Science, vol. 14631. Springer, Cham. DOI\nPinna, G., Tugnoli, D., Manzoni, L., De Lorenzo, A. (2024). \u0026ldquo;From Courts to Comprehension: Can LLMs Make Judgments More Accessible?\u0026rdquo; IEEE/WIC WI-IAT 2024. Paper\nGong, J., Bian, Y., de la Cal, L., Pinna, G., et al. (2025). \u0026ldquo;GA4GC: Greener Agent for Greener Code via Multi-Objective Configuration Optimization.\u0026rdquo; SSBSE 2025. Paper\nde la Cal, L., Cao, Y., Ercevik, A.I., Pinna, G., et al. (2025). \u0026ldquo;HotCat: Green and Effective Feature Selection toward Hotfix Bug Taxonomy.\u0026rdquo; SSBSE 2025. Paper\nWorkshop Papers Pinna, G., Ravalico, D., Rovito, L., Manzoni, L., De Lorenzo, A. (2025). \u0026ldquo;Improving LLM-Generated Code via Genetic Improvement: A Summary of Recent Advances.\u0026rdquo; CEUR Workshop Proceedings. Ital-IA 2025. ","permalink":"https://giovannipinna.net/publications/","summary":"\u003ch2 id=\"journal-articles\"\u003eJournal Articles\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003ePinna, G.\u003c/strong\u003e, Perezhohin, Y., Manzoni, L., Castelli, M., De Lorenzo, A. (2025). \u0026ldquo;Redefining Text-to-SQL Metrics by Incorporating Semantic and Structural Similarity.\u0026rdquo; \u003cem\u003eScientific Reports\u003c/em\u003e 15.1. (Nature) \u003ca href=\"https://doi.org/10.1038/s41598-025-85645-0\"\u003eDOI\u003c/a\u003e\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003ePinna, G.\u003c/strong\u003e, Ravalico, D., Rovito, L., Manzoni, L., De Lorenzo, A. (2025). \u0026ldquo;Exploring the Effect of Genetic Improvement for Large Language Models-Generated Code.\u0026rdquo; \u003cem\u003eSN Computer Science\u003c/em\u003e 6.7. \u003ca href=\"https://link.springer.com/content/pdf/10.1007/s42979-025-04281-x.pdf\"\u003ePaper\u003c/a\u003e\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eZullich, M., Macovaz, V., \u003cstrong\u003ePinna, G.\u003c/strong\u003e, Pellegrino, F.A. (2023). \u0026ldquo;An Artificial Intelligence System for Automatic Recognition of Punches in Fourteenth-Century Panel Painting.\u0026rdquo; \u003cem\u003eIEEE Access\u003c/em\u003e. \u003ca href=\"https://doi.org/10.1109/ACCESS.2023.3276538\"\u003eDOI\u003c/a\u003e\u003c/p\u003e","title":"Publications"}]