I wasn't really hunting for a winner. I was curious where they'd disagree. Because that's usually where the interesting stuff lives. It did not disappoint.
Three Temperaments, Not Three Tools
It felt less like a contest and more like watching three colleagues with very different personalities squint at the same problem.
Claude played the architect, holding the whole plan in its head. It caught a structural flaw the others sailed straight past, staged the work sensibly, and, rather charmingly, admitted when it wasn't sure.
Copilot was the craftsman with its hands already in the code: fast, repo-aware, test code you could paste and run — though, heads-down as it was, it cheerfully broke a rule it had written itself two paragraphs earlier.
Gemini was the quick outsider: a sixty-second read, a sprawling spec boiled down to something you could react to, and — trained on different ground — it spotted a couple of things the other two had stopped noticing. Ask it to dig deep, though, and it lost interest.
"Ask each model which is best, and each quietly points at itself. They were trained on human behaviour, after all."
Hands in the Code: When "Agentic" Earns Its Keep
The bit that surprised me most had nothing to do with cleverness, and everything to do with whether a model was allowed to touch the code.
Give one tools and access to the repository, and it can read the files, run the tests, and check its own claims against what's really there. That changes everything. Copilot caught most of the real defects precisely because it could cross-reference spec, source and tests at once. The same sort of reasoning, unplugged from the files, wrote the sharpest critique of the lot — and then missed a one-line bug it simply couldn't see. It was thinking beautifully about a line it had no way of reading.
So now I reach for the agentic setup when the truth lives in the code. Building, refactoring, hunting gaps, fixing until the tests go green. And I leave the tools switched off when the truth lives in judgement. Architecture, trade-offs, picking holes in a plan before it exists.
Lovely thinking and confident fiction can look identical, right up until something runs.
What I'll Actually Do Differently
Here's the fun part: run all three in sequence and you get something better than any one alone a fast first read, a deep in-context pass, an independent second opinion, then a reasoning model brought in cold to pick holes in the result. Which is, of course, exactly how a decent team already reviews code: one author, one reviewer who knows the domain, one who deliberately doesn't. We didn't invent anything.
We rediscovered the value of a second pair of eyes and gave it three.
Was it rigorous? Not remotely... One codebase, one prompt, one curious afternoon. But the habit it left me with has earned its keep since: before writing anything tricky, talk it through with a model that's good at *thinking*. The cheapest minute in the whole cycle is the one that catches the flaw on the whiteboard instead of in production.
A Final Thought
There's no winner here, and honestly I'm glad. These models aren't rivals fighting over one seat, they're a small crew. And each with a strength, a quirk and a blind spot. The craft is in the casting.
Hence our motto: Commitment you can count on.
So if you're trying to fold these tools into how your team actually delivers and you'd happily skip the league-table arguments, that's a conversation we rather enjoy.