Improving LLM-Generated Code via Genetic Improvement: A Summary of Recent Advances

Abstract

This paper provides a comprehensive summary of our research program on applying Genetic Improvement to code generated by Large Language Models, consolidating findings from two published studies (EuroGP 2024 and SN Computer Science 2025). Across both works, we demonstrate that neural and evolutionary approaches are fundamentally complementary: LLMs excel at rapidly generating structurally plausible code, while Genetic Improvement refines it toward precise specifications through grammar-based evolutionary search. A consistent finding is the “capability amplifier” effect — smaller open-source models benefit disproportionately from GI, narrowing the gap with larger proprietary models. We also discuss key limitations including oracle dependency, scalability constraints to multi-file projects, bias propagation from LLM-generated grammars, and the stochastic nature of evolutionary algorithms. Presented at Ital-IA 2025, the 5th National Conference on Artificial Intelligence, Rome, Italy.

Introduction

The intersection of Large Language Models and evolutionary computation represents one of the most promising frontiers in automated software engineering. Over the past two years, our research group has developed and refined a methodology for systematically improving code generated by LLMs using Genetic Improvement (GI) techniques. This paper, presented at Ital-IA 2025 (the 5th National Conference on Artificial Intelligence, organized by CINI), provides a comprehensive summary of this research program and its key findings.

The Research Program

Our work builds on a simple but powerful observation: LLM-generated code, even when it fails to fully meet specifications, typically contains valuable structural information. The generated programs use appropriate data types, implement reasonable algorithms, and capture the general shape of correct solutions. What they lack is precision — the exact conditions, edge case handling, and algorithmic details that separate “roughly right” from “fully correct.”

Genetic Improvement exploits this observation by treating LLM outputs as starting points for evolutionary optimization. Rather than discarding incorrect code and starting from scratch, GI evolves it toward correctness through a process of guided variation and selection.

Key Contributions Across Two Studies

Study 1: GI with Grammatical Evolution (EuroGP 2024)

Our first study introduced the three-phase pipeline of code extraction, dynamic grammar specialization, and evolutionary search. The key technical innovation was the automatic generation of BNF grammars tailored to each LLM-generated program, ensuring that mutations remain syntactically valid and focused on relevant code constructs.

Evaluated on 25 PSB2 problems across 5 LLMs (GPT-4, ChatGPT, LLaMA-2 13B, Alpaca-13B, Alpaca-7B), the approach achieved statistically significant improvements (p < 0.001) for every model tested, consistently outperforming LLM self-correction.

Study 2: Enhanced GI with Lexicase Selection (SN Computer Science 2025)

Our second study refined the evolutionary components. We introduced lexicase selection — a strategy that evaluates individuals on test cases sequentially, preserving specialist solutions that excel on specific subsets. Combined with 10% down-sampling for computational efficiency and a refined fitness function F_E providing partial credit rather than binary pass/fail, the enhanced pipeline improved performance in 11 out of 12 model-problem combinations.

The updated model roster included Code Llama 7B and LLaMA 3 8B, with results confirming that GI provides the greatest relative benefit to smaller, less capable models — effectively amplifying their capabilities.

Cross-Cutting Findings

Several findings emerged consistently across both studies:

The Complementarity of Neural and Evolutionary Approaches

LLMs and evolutionary algorithms have fundamentally different strengths. LLMs excel at rapid generation of plausible solutions by leveraging patterns learned from vast code corpora. Evolutionary algorithms excel at systematic, specification-driven refinement. Combining them yields results neither approach achieves alone.

The “Capability Amplifier” Effect

GI’s relative benefit is inversely proportional to the initial capability of the LLM. Weaker models (Alpaca-7B, Code Llama 7B) show the most dramatic improvements, while stronger models (GPT-4) show smaller but still significant gains. This has important practical implications: organizations using smaller, open-source models for cost or privacy reasons can use GI to narrow the gap with larger proprietary models.

Grammar Specialization as Search Space Design

The dynamic generation of problem-specific grammars proved to be more than a technical convenience — it is a form of intelligent search space design. By leveraging the LLM’s structural choices as prior knowledge, the evolutionary search operates in a much smaller, more productive region of the program space.

Limitations and Open Challenges

We identified several limitations that define the boundaries of the current approach:

Oracle dependency: The fitness function requires test cases or a reference oracle for evaluation. Problems without clear specifications or test suites cannot be addressed with the current methodology.
Scalability constraints: Our evaluation focused on small, self-contained programs. Real-world software engineering involves multi-file projects with complex dependencies, which the current grammar-based approach does not handle.
LLM bias propagation: Since the grammar is derived from the LLM’s output, structural biases in the generated code — such as preferring certain loop constructs or data structures — are inherited by the search space. This may prevent the discovery of solutions requiring fundamentally different architectural choices.
Lack of guarantees: Evolutionary algorithms are stochastic by nature. While our results show consistent improvements on average, there is no guarantee of improvement for any individual run or problem instance.

Future Directions

The research program continues along several fronts: exploring grammar-free GI approaches that can operate directly on ASTs without the BNF intermediary; developing fitness approximation techniques to reduce the dependency on exhaustive test case evaluation; and investigating the integration of GI into larger-scale software engineering workflows where multi-file modifications are necessary.

Presented at the 5th National Conference on Artificial Intelligence (Ital-IA 2025), organized by CINI, June 23-24, 2025, Rome, Italy. This research was conducted at the University of Trieste and NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa.

Introduction#

The Research Program#

Key Contributions Across Two Studies#

Study 1: GI with Grammatical Evolution (EuroGP 2024)#

Study 2: Enhanced GI with Lexicase Selection (SN Computer Science 2025)#

Cross-Cutting Findings#

The Complementarity of Neural and Evolutionary Approaches#

The “Capability Amplifier” Effect#

Grammar Specialization as Search Space Design#

Limitations and Open Challenges#

Future Directions#