Complex Agentic Coding with Copilot: GPT-5 vs Claude 4 Sonnet

Aug 08, 2025

OpenAI released GPT-5 yesterday, promoting it as their best model yet for agentic coding. When it arrived in my GitHub Copilot this morning, I immediately decided to test it with a complex, long-running agentic coding task — and later gave the exact same task to Claude 4 Sonnet 4 for comparison.

While this isn't a tightly controlled scientific comparison — more of a "vibe check" — both models impressed me with their results. It's worth noting that while Claude Sonnet has been established for coding for a while, GPT-5 is brand new, available in preview with some kinks to iron out, and had only been available for a few hours when I tested it.

The Task

The challenge I set for both models: review the current implementation of Ruler (a tool I built with AI help in TypeScript) and port it to Rust. Ruler isn't especially sophisticated — it's primarily a tool for managing text and configuration, moving text from here to there — but it has enough complexity with support for many different agents and formats to make it a stretch for something to do in one go.

Testing GPT-5: Intelligence and Agency

I opened Visual Studio Code, set GitHub Copilot Chat to GPT-5, and dictated my requirements pretty much stream of consciousness. I intentionally didn't do much planning myself — this model is advertised as good at following instructions with minimal prompting, and I wanted to see how good it is at planning.

The task is to create a port of Ruler, the tool implemented in this repository, in the Rust programming language. You will create a new branch to work in and create a new directory where the Rust implementation will live. You will analyze the code base and everything you have here to understand exactly how the tool ruler works. You will create a functional test that uses a shell script to test both the TypeScript implementation and the Rust implementation independently by running the tool in a temporary directory and seeing how it behaves. When you have that functional test in a shell script and it works correctly with the TypeScript implementation, you can go ahead and create the Rust implementation. The Rust implementation needs to work exactly the same as the TypeScript implementation. all the invocation parameters, all the locations of files and all of its behavior. Unlike the TypeScript implementation it doesn't need to have testing like unit tests or end-to-end tests. You will only use the functional test you wrote in a shell script to validate that it is working correctly. You need to continuously work on the Rust implementation and test it with your functional test until it works correctly exactly as the TypeScript implementation works. When you have a working implementation, you will commit the changes Remember that you're working in a different branch, not in the main branch and you'll push that branch to github but don't create a pull request. Before you begin, research the task based on everything I just told you. look at the code base and do research online if you need, and write up a very detailed plan for what you're going to do and put it in a markdown file plan.md and then stop and let me review that file.

The model immediately impressed. It reviewed my codebase, read online information about relevant packages and libraries, then produced a very nice detailed plan. After I reviewed and approved it, I instructed GPT-5 to execute.

# Ruler Rust Port: Detailed Plan

Date: 2025-08-08

## Goal
Create a faithful Rust port of the Ruler CLI that behaves identically to the TypeScript implementation, validated by an end-to-end functional shell test. Work happens on a dedicated branch and will be pushed without opening a PR.

## Scope
- Implement the Ruler CLI in Rust with the same commands, flags, config discovery, and outputs:
  - Commands: `apply`, `init`, `revert`.
  - Options parity (names, defaults, negations, behavior):
    - `--project-root <path>` (default: cwd)
    - `--agents <list>` (comma-separated identifiers or substrings)
    - `--config <file>`
    - `--mcp`/`--with-mcp` (default: true)
    - `--no-mcp`
    - `--mcp-overwrite` (strategy: overwrite vs merge)
    - `--gitignore` / `--no-gitignore` (default: enabled)
    - `--verbose` / `-v`
    - `--dry-run`
    - `--local-only`
    - `revert` additional: `--keep-backups`
    - `init` additional: `--global`
  - Exit codes and error prefix: `[RulerError]` for errors.
  - Output files and directory structure are identical.

...
> See https://github.com/intellectronica/ruler/blob/feat/rust-port/plan.md

The first thing I noticed: GPT-5 is way more agentic than previous models. It went on and on — looking at different parts of the codebase, writing new things, all without stopping, doing lots of work with almost no intervention. It stopped twice during the interaction, declaring it had finished some work, explaining clearly what it did and what was left to do, and asking how to continue. This was acceptable — it gave me correct information on the project status — though I would have been fine with it continuing without returning to me.

I asked GPT-5 to take a particular approach: first write a functional test as a shell script, then use that to validate the implementation was identical between original and port. It did quite well, although it spent a disproportionate amount of time on the shell script compared to reading the existing codebase and writing Rust code. Perhaps I should have chosen a different strategy, but it did work.

Twice during the operation, it got stuck — the chat interface just displayed a spinner, not doing anything. Eventually I gave up, stopped it, and prompted it to continue from where it left off. This didn't create serious problems, but meant my supervision was more necessary. After babysitting the agent, restarting it a couple of times, and twice confirming it should continue with the implementation, it managed to create a complete port that was provably working functionally the same as the original.

The code it produced was somewhat disappointing. It chose to put everything in a single Rust file with lots of if-then spaghetti, even though its own plan called for a more structured approach. It worked absolutely fine, but this isn't something I would accept as a pull request. Still, I was hugely impressed by the level of intelligence, ability to understand context, follow instructions, and execute such a complex task.

See the complete GPT-5 port at github.com/intellectronica/ruler/tree/feat/rust-port/rust/ruler-rs

Testing Claude 4 Sonnet: Fast and Elegant, Less Intelligent and Disciplined

With GPT-5's task complete, I used the exact same prompt with Claude 4 Sonnet.

I immediately noticed Claude works faster and communicates well — verbose where it needs to be, with clear formatting. One continuous difference: while GPT-5 would do lots of thinking then do something right the first time, Claude frantically tried different things — writing code, executing commands, making pretty dumb mistakes (like saving files with obvious mistakes and syntax errors or not understanding what exactly it is working on and having to reread the codebase), but then recovering. This meant it eventually got to correct implementation with many more steps.

Claude was more stable — it never got stuck, which is expected from a production service running for a while. However, it was less disciplined about following instructions. Most crucially, my instruction to use a functional test written as a shell script to lead the implementation — Claude did that eventually, but also improvised lots of its own testing methodologies. It seemed less disciplined overall, not following instructions as closely, just trying different things until getting acceptable results.

I was much more satisfied by the code quality and elegance. Unlike GPT-5's single messy Rust file, Claude created a very neat project structure with different modules for each part of the program — much more readable and maintainable code I would consider working with.

Like GPT-5, Claude stopped a few times, which was okay. But unlike GPT-5, it didn't give reliable status reports. Each time It would say it's done, but when I'd point out the missing implementation, it would give its usual "you're absolutely right" and try to fix it. After five iterations of this, I gave up. The current implementation, though impressive and elegant, is incomplete. Claude didn't really understand the task beginning to end — it was trying different things and required constant feedback.

See the current Claude 4 Sonnet port at github.com/intellectronica/ruler/tree/feat/claude-rust-port/rust-ruler

Note that Claude 4 Sonnet isn’t the strongest model from Anthropic’s Claude series. Claude Opus is their most capable model for coding, but it seemed inappropriate to compare it with GPT-5 because it costs 10 times as much.

Key Comparisons

Working Style

GPT-5: Does lots of thinking, then executes correctly first time
Claude: Tries different approaches, makes mistakes, recovers and iterates

Code Quality

GPT-5: Complete, working implementation but poor structure (single file, spaghetti code)
Claude: Beautiful, maintainable project structure with clean modules

Reliability

GPT-5: Got stuck twice (probably technical issues during preview), but recovered easily
Claude: Never got stuck, but unreliable status reporting

Instruction Following

GPT-5: Highly disciplined, follows instructions precisely
Claude: Improvises and deviates from instructions frequently

GitHub Copilot Chat Agent: Great but Not Yet Fully Autonomous

GitHub Copilot has come a long way and the current agent is really fun to work with. It has full support for everything — from its own rich toolset to MCP servers, plus terminal and IDE interaction. It can work agentically on long-running tasks, and the display and interaction feel high quality.

One persistent problem: Even though Copilot Chat lets you allow actions initiated by tools and MCP servers for the session duration or for all activity in the workspace, terminal commands often require manual approval. For a long-running task, I had to approve even simple commands like reading directory listings manually. This requires constant supervision from the user.

Compared to agentic coding tools like Claude Code, Codex CLI and others, this is more hands-on work. I hope GitHub Copilot Chat eventually gets its own autonomous mode allowing it to run without requiring approvals for terminal commands. For now, this kind of task would probably be better suited for terminal-based agents or autonomous web-based agents like the GitHub Copilot web agent, OpenHands, or similar tools.

One thing I appreciate about the the GitHub Copilot subscription, is that it meters work by requests, rather than tokens. Rather than anxiously counting every interaction and hoping it won't cost me too much, I could predict the cost based on the number of requests I made to the chat. In this case, both GPT-5 and Claude 4 Sonnet have the same 1X multiplier for "premium requests".

Conclusion

Both models performed well. GPT-5 seems to be a very intelligent and capable model — able to understand tasks perfectly, plan well, and carry them out with intention. It definitely looks like a very strong model for coding that I'll probably rely on heavily in the future.

Stylistically, Claude still seems to be winning — it wrote more elegant code and was nicer to interact with, and that counts too.

Addendum: only after completing this task and publishing the writeup, I found out that GitHub Copilot Chat in Visual Studio Code already has an experimental flag for auto-approving commands. This is obviously something you’d want to run in an isolated environment, but it’s going to save me a lot of time in the future.

Elite AI Assisted Coding

Discussion about this post