SWE-Bench Lite Results

Chronos’ benchmark performance on SWE-Bench Lite and comparison with leading models.

Chronos-1 delivers state-of-the-art performance on SWE-Bench Lite, the industry standard benchmark for evaluating real software debugging.

Benchmark Results (Nov 2025)

  • Rank 1: Chronos: 80.33% (241 / 300 solved)
  • Rank 2: ExpeRepair v1.0 (Claude 4.5 Sonnet): 60.33%
  • Rank 3: Refact.ai Agent: 60.00%
  • Rank 4: KGCompass (Claude 4.5 Sonnet): 58.33%
  • Rank 5: SWE Agent (Claude 4.5 Sonnet): 56.67%

General-purpose models (no agent frameworks)

  • Claude 4.5 Sonnet (Bash-only): ~14%
  • Claude 4.1 Opus (Bash-only): 14.2%
  • GPT-4.1: 13.8%

Chronos’ 20-point absolute lead over the second-best system comes from:

  • debugging-specific training on 15M sessions
  • Persistent Debug Memory
  • Adaptive Graph-Guided Retrieval
  • autonomous fix-test-refine loops

These components enable Chronos-1 to solve failures that general models cannot.