SWE-Bench Lite Results
Chronos’ benchmark performance on SWE-Bench Lite and comparison with leading models.
Chronos-1 delivers state-of-the-art performance on SWE-Bench Lite, the industry standard benchmark for evaluating real software debugging.
Benchmark Results (Nov 2025)
- Rank 1: Chronos: 80.33% (241 / 300 solved)
- Rank 2: ExpeRepair v1.0 (Claude 4.5 Sonnet): 60.33%
- Rank 3: Refact.ai Agent: 60.00%
- Rank 4: KGCompass (Claude 4.5 Sonnet): 58.33%
- Rank 5: SWE Agent (Claude 4.5 Sonnet): 56.67%
General-purpose models (no agent frameworks)
- Claude 4.5 Sonnet (Bash-only): ~14%
- Claude 4.1 Opus (Bash-only): 14.2%
- GPT-4.1: 13.8%
Chronos’ 20-point absolute lead over the second-best system comes from:
- debugging-specific training on 15M sessions
- Persistent Debug Memory
- Adaptive Graph-Guided Retrieval
- autonomous fix-test-refine loops
These components enable Chronos-1 to solve failures that general models cannot.