Anthropic's Claude 3.7 Sonnet Hits 62% on SWE-bench — The Reasoning Wars Are On

The reasoning arms race just got a new leaderboard entry. Anthropic dropped Claude 3.7 Sonnet with a hybrid thinking mode — and the benchmark numbers are hard to ignore.

🍌 The Big Story

Anthropic released Claude 3.7 Sonnet, its first model with extended thinking built in. Toggle it on and the model reasons step-by-step before answering — similar to OpenAI’s o1, but available in the API and Claude.ai today. The headline number: 62.3% on SWE-bench, which measures real-world software engineering tasks. That’s up from Claude 3.5’s 49% and above GPT-4o’s 58%. For developers, this means Claude can now handle multi-file refactors and tricky bug diagnoses without hand-holding. read more

⚡ Quick Peels

Mistral — Released Mistral Small 3.1, a 24B open model that runs locally on an M3 MacBook Pro and outperforms GPT-4o-mini on most benchmarks — free to download via Ollama
Google — Gemini Advanced now includes a Research Mode that auto-generates multi-step reports with cited sources, available to all paid users at no extra cost
OpenAI — Quietly expanded GPT-4o’s context window to 256K tokens for all Plus users, putting it on par with Gemini 1.5 Pro for long-document work

🛠️ Tool of the Day

Cursor — 🍌 Ripe

The AI code editor that’s quietly become the default for developers who want to ship faster. Tab completion predicts your next edit across files (not just the next line), and Composer lets you describe a feature in plain English and watch it scaffold across your whole codebase. Free tier includes 2,000 completions per month — enough to get genuinely hooked before you hit the paywall.

🧠 Banana Tip

Stuck on a bug? Paste the error and the relevant code into Claude with extended thinking enabled and add: “Think through every possible cause before suggesting a fix — ranked from most to least likely, with one sentence explaining why each could be the culprit.” You’ll get a proper diagnosis instead of the usual “try checking your imports.”

The reasoning wars are just getting started — and the tools on the front lines are free to try. 🍌

— The Bonita Banana Team