Microsoft study claims AI is still struggling to debug software

AI is great for generating code, but it’s still underperforming when it comes to simple debugging tasks.

Apr 11, 2025 - 11:30
 0
Microsoft study claims AI is still struggling to debug software

  • AI promises a huge revolution for developers, but is it just for code creation?
  • Popular AI models from Anthropic and OpenAI aren’t great at debugging
  • Microsoft’s researchers are open-sourcing their tools to facilitate research

Although generative AI is increasingly being integrated into programming workflows, new research from Microsoft reveals that large language models still aren’t quite up to scratch when it comes to debugging.

The research suggests that even advanced models still struggle with debugging tasks that are pretty simple for experienced developers, highlighting the continued importance of human programmers.

AI does appear to have a solid use case, though, with Google now claiming that around 25% of new code is AI-generated. Meta has also noted the wide deployment of AI for coding.

AI is good for code creation, but not for debugging

The report explores how 11 Microsoft researchers tested nine AI models on SWE-bench Lite – a popular debugging benchmark. Claude 3.7 Sonnet offered the highest success rate at a far-from-perfect 48.4%. OpenAI’s o1 and o3-mini posted lower success rates of 30.2% and 22.1% respectively.

“Even with debugging tools, our simple prompt-based agent rarely solves more than half of the SWE-bench Lite issues,” the researchers wrote, blaming the suboptimal performance on a lack of data representing sequential decision-making behavior.

All hope is not lost, though. “We believe that training or fine-tuning LLMs can enhance their interactive debugging abilities,” they added. The researchers intend to fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs, but in the meantime, they promise to open-source debug-gym to make it easier for others to conduct similar research.

Debug-gym is described as an “environment that allows code-repairing agents to access tools for active information-seeking behavior.”

However, for now, artificial intelligence might not be bringing as much value to developers’ lives as AI companies suggest. “Most developers spend the majority of their time debugging code,” the researchers wrote, indicating that even if they are benefitting from code generation, it might not be saving them that much time.

You might also like