Introduction
The intersection of Large Language Models and evolutionary computation represents one of the most promising frontiers in automated software engineering. Over the past two years, our research group has developed and refined a methodology for systematically improving code generated by LLMs using Genetic Improvement (GI) techniques. This paper, presented at Ital-IA 2025 (the 5th National Conference on Artificial Intelligence, organized by CINI), provides a comprehensive summary of this research program and its key findings.
The Research Program
Our work builds on a simple but powerful observation: LLM-generated code, even when it fails to fully meet specifications, typically contains valuable structural information. The generated programs use appropriate data types, implement reasonable algorithms, and capture the general shape of correct solutions. What they lack is precision — the exact conditions, edge case handling, and algorithmic details that separate “roughly right” from “fully correct.”
Genetic Improvement exploits this observation by treating LLM outputs as starting points for evolutionary optimization. Rather than discarding incorrect code and starting from scratch, GI evolves it toward correctness through a process of guided variation and selection.
Key Contributions Across Two Studies
Study 1: GI with Grammatical Evolution (EuroGP 2024)
Our first study introduced the three-phase pipeline of code extraction, dynamic grammar specialization, and evolutionary search. The key technical innovation was the automatic generation of BNF grammars tailored to each LLM-generated program, ensuring that mutations remain syntactically valid and focused on relevant code constructs.
Evaluated on 25 PSB2 problems across 5 LLMs (GPT-4, ChatGPT, LLaMA-2 13B, Alpaca-13B, Alpaca-7B), the approach achieved statistically significant improvements (p < 0.001) for every model tested, consistently outperforming LLM self-correction.
Study 2: Enhanced GI with Lexicase Selection (SN Computer Science 2025)
Our second study refined the evolutionary components. We introduced lexicase selection — a strategy that evaluates individuals on test cases sequentially, preserving specialist solutions that excel on specific subsets. Combined with 10% down-sampling for computational efficiency and a refined fitness function F_E providing partial credit rather than binary pass/fail, the enhanced pipeline improved performance in 11 out of 12 model-problem combinations.
The updated model roster included Code Llama 7B and LLaMA 3 8B, with results confirming that GI provides the greatest relative benefit to smaller, less capable models — effectively amplifying their capabilities.
Cross-Cutting Findings
Several findings emerged consistently across both studies:
The Complementarity of Neural and Evolutionary Approaches
LLMs and evolutionary algorithms have fundamentally different strengths. LLMs excel at rapid generation of plausible solutions by leveraging patterns learned from vast code corpora. Evolutionary algorithms excel at systematic, specification-driven refinement. Combining them yields results neither approach achieves alone.
The “Capability Amplifier” Effect
GI’s relative benefit is inversely proportional to the initial capability of the LLM. Weaker models (Alpaca-7B, Code Llama 7B) show the most dramatic improvements, while stronger models (GPT-4) show smaller but still significant gains. This has important practical implications: organizations using smaller, open-source models for cost or privacy reasons can use GI to narrow the gap with larger proprietary models.
Grammar Specialization as Search Space Design
The dynamic generation of problem-specific grammars proved to be more than a technical convenience — it is a form of intelligent search space design. By leveraging the LLM’s structural choices as prior knowledge, the evolutionary search operates in a much smaller, more productive region of the program space.
Limitations and Open Challenges
We identified several limitations that define the boundaries of the current approach:
Oracle dependency: The fitness function requires test cases or a reference oracle for evaluation. Problems without clear specifications or test suites cannot be addressed with the current methodology.
Scalability constraints: Our evaluation focused on small, self-contained programs. Real-world software engineering involves multi-file projects with complex dependencies, which the current grammar-based approach does not handle.
LLM bias propagation: Since the grammar is derived from the LLM’s output, structural biases in the generated code — such as preferring certain loop constructs or data structures — are inherited by the search space. This may prevent the discovery of solutions requiring fundamentally different architectural choices.
Lack of guarantees: Evolutionary algorithms are stochastic by nature. While our results show consistent improvements on average, there is no guarantee of improvement for any individual run or problem instance.
Future Directions
The research program continues along several fronts: exploring grammar-free GI approaches that can operate directly on ASTs without the BNF intermediary; developing fitness approximation techniques to reduce the dependency on exhaustive test case evaluation; and investigating the integration of GI into larger-scale software engineering workflows where multi-file modifications are necessary.
Presented at the 5th National Conference on Artificial Intelligence (Ital-IA 2025), organized by CINI, June 23-24, 2025, Rome, Italy. This research was conducted at the University of Trieste and NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa.