Advances and Frontiers of LLM-based Issue Resolution in Software Engineering A Comprehensive Survey
π Documentation Website | π Full Paper | π Tables & Resources
Based on a systematic review of 135 papers and online resources, this survey establishes a holistic theoretical framework for Issue Resolution in software engineering. We examine how Large Language Models (LLMs) are transforming the automation of GitHub issue resolution. Beyond the theoretical analysis, we have curated a comprehensive collection of datasets and model training resources, which are continuously synchronized with our GitHub repository and project documentation website.
π Explore This Survey:
- π Data: Evaluation and training datasets, data collection and synthesis methods
- π οΈ Methods: Training-free (agent/workflow) and training-based (SFT/RL) approaches
- π Analysis: Insights into both data characteristics and method performance
- π Tables & Resources: Comprehensive statistical tables and resources
- π Full Paper: Read the complete survey paper
We comprehensively survey evaluation benchmarks for issue resolution, categorizing them by programming language, multimodal support, and reproducible execution environments.
Key Datasets:
- SWE-bench: Python-based benchmark with 2,294 real-world issues from 12 repositories
- SWE-bench Lite: Curated subset of 300 high-quality instances
- Multi-SWE-bench: Multilingual extension covering 7+ programming languages
- SWE-bench Multimodal: Incorporates visual elements (JS, TS, HTML, CSS)
- Visual SWE-bench: Focus on vision-intensive issue resolution
β Explore all evaluation datasets
We analyze trajectory datasets used for agent training, including both human-annotated and synthetically generated examples.
Notable Resources:
- R2E-Gym: 3,321 trajectories for reinforcement learning
- SWE-Gym: 491 expert trajectories for supervised fine-tuning
- SWE-Fixer: Large-scale dataset with 69,752 editing chains of thought
Autonomous agents that leverage tool use, memory, and planning to resolve issues without task-specific training.
Representative Works:
- OpenHands: Multi-agent collaboration framework
- Agentless: Localization + repair pipeline without agent loops
- AutoCodeRover: Hierarchical search-based code navigation
Structured pipelines optimizing specific stages of issue resolution.
Key Innovations:
- Meta-RAG: Code summarization for enhanced retrieval
- TestAider: Test-driven development integration
- PatchPilot: Automated patch validation and refinement
β Explore training-free methods
Models trained on expert trajectories to internalize issue resolution patterns.
Notable Models:
- Devstral (22B): 46.8% on SWE-bench Verified
- Co-PatcheR (14B): Multi-stage training with code editing focus
- SWE-Swiss (32B): Synthetic data augmentation for improved generalization
Models optimized through environmental feedback and reward signals.
State-of-the-Art:
- OpenHands Critic (32B): 66.4% on SWE-bench Verified
- Kimi-Dev (72B): 60.4% with outcome-based rewards
- DeepSWE (32B): Trained from scratch using RL on code repositories
β Explore training-based methods
- Quality vs. Quantity: Analysis of dataset characteristics and their impact on model performance
- Contamination Detection: Protocols for ensuring benchmark integrity
- Difficulty Spectrum: Stratification of issues by complexity
- Performance Trends: Comparative evaluation across model families and sizes
- Scaling Laws: Analysis of parameter count vs. performance gains
- Efficiency Metrics: Cost-benefit analysis of different approaches
The scalability of SWE agents is bottlenecked by the high costs of sandboxed environments and long-context inference. Optimization strategies are required to streamline these resource-intensive loops without sacrificing performance.
Benchmarks often overlook efficiency, masking the high costs of techniques like inference-time scaling. Standardized reporting of latency and token usage is crucial for guiding the development of cost-effective agents.
Reliance on text proxies for UI interpretation limits effectiveness. Future research can adopt intrinsic multi-modal solutions, such as code-centric MLLMs, to better bridge the gap between visual rendering and underlying code logic.
High autonomy carries risks of destructive actions, such as accidental code deletion. Future systems should integrate safeguards, such as Git-based version control, to ensure autonomous modifications remain secure and reversible.
Reinforcement learning is hindered by sparse, binary feedback. Integrating fine-grained signals from compiler diagnostics and execution traces is necessary to guide models through complex reasoning steps.
As benchmarks approach saturation, evaluation validity is compromised by data leakage. Future frameworks must strictly enforce decontamination protocols to ensure fairness and reliability.
While current issue resolution tasks mirror development workflows, they represent only a fraction of the full Software Development Life Cycle (SDLC). Future research should broaden the scope of issue resolution tasks to develop more versatile automated software generation methods.
Visit our Tables & Resources page for comprehensive statistical tables including:
- π Evaluation Datasets Overview: Detailed comparison of 30+ benchmarks
- π― Training Trajectory Datasets: Analysis of 5 major trajectory datasets
- π§ Supervised Fine-Tuning Models: Performance metrics for 10+ SFT models
- π€ Reinforcement Learning Models: Comprehensive analysis of 30+ RL-trained models
- π General Foundation Models: Evaluation of 15+ general-purpose LLMs
We welcome contributions to this survey! If you'd like to add new papers or fix errors:
- Fork this repository
- Add paper entries in the corresponding YAML file under
data/directory (e.g.,papers_evaluation_datasets.yaml,papers_single_agent.yaml, etc.) - Follow the existing format with fields:
short_name,title,authors,venue,year, andlinks(arxiv, github, huggingface) - Run
python scripts/render_papers.pyto update the documentation - Submit a PR with your changes
If you use this project or related survey in your research or system, please cite the following BibTeX:
@misc{li2025awesome_issue_resolution,
title = {Advances and Frontiers of LLM-based Issue Resolution in Software Engineering A Comprehensive Survey},
author = {Caihua Li and Lianghong Guo and Yanlin Wang and Wei Tao and Zhenyu Shan and Mingwei Liu and Jiachi Chen and Haoyu Song and Duyu Tang and Hongyu Zhang and Zibin Zheng},
year = {2025},
howpublished = {\url{https://github.com/DeepSoftwareAnalytics/Awesome-Issue-Resolution}}
}Once published on arXiv or at a conference, please replace the entry with the official citation information (authors, DOI/arXiv ID, conference name, etc.).
If you have any questions or suggestions, please contact us through:
- π§ Email: noranotdor4@gmail.com
- π¬ GitHub Issues: Open an issue
This project is licensed under the MIT License - see the LICENSE file for details.
β Star this repository if you find it helpful!
Made with β€οΈ by the DeepSoftwareAnalytics team
Documentation | Paper | Tables | About | Cite