EuroGP

TL;DR LLMs write code that almost works. The usual fix is to ask them again — “self-correction” — but it tends to repeat the same mistakes. We took a different route: treat the buggy code as a seed and evolve it. Using Grammatical Evolution with a grammar built on-the-fly from the LLM’s own output, we improved code from GPT-4, ChatGPT, LLaMA-2, Alpaca-13B, and Alpaca-7B on 25 PSB2 problems — with statistically significant gains (p < 0.001) for every model. The smaller the model, the bigger the win. The trap of self-correction Ask any modern LLM to write a Python function and you’ll get something that looks right. Run the tests and you’ll often discover it isn’t. ...