The Real Cost of Routine Engineering Tasks
For over two years, the tech world has been shouting that AI is coming for software engineering. Autonomous coding agents. End-to-end pull requests. Engineers replaced by bots.
And yet, most engineering teams are still grinding through tickets, writing the same CRUD endpoints, and fixing the same event-routing bugs. Individual engineers may be more efficient, but there's been no 10x productivity explosion at the organizational level.
This, I believe, is all about to change.
What we believed before anyone cared
Last summer, in mid-2025, we started building Supernaut around a conviction most people thought was premature: fully autonomous engineering agents that work end-to-end, asynchronously. That means taking a task, writing the code, opening the merge request, and waiting for a human to review. No hand-holding, no context switching, no mental load on the engineers.
The tech wasn't fully there yet. The skepticism was constant. Even people on our own team would ask: "But can it actually work?"
We didn't have a satisfying answer. Not yet. I refused to launch Supernaut until we could confidently use it in production ourselves. I'm a firm believer in dogfooding, for better or worse.
Then, in January 2026, we deployed our latest agents. It was supposed to be a quick test to see if Supernaut could handle some of the annoying workload we had piling up. But this time it actually worked, and we ended up shipping our entire backlog in a day.
In February alone, Supernaut shipped 78 merge requests across our production services. Almost entirely hands-off.
The Stripe moment
In February 2026, Stripe revealed that their internal coding agents—they call them "Minions"—were merging over a thousand pull requests per week. Human-reviewed, but with no human-written code. Built on top of their existing CI/CD infrastructure, running in isolated environments, following the same processes their engineers follow.
When I read their blog, I immediately messaged my co-founder: "These people speak my language, it's basically us, almost 1:1."
The industry took notice. Not because the idea was new, but because Stripe isn't a startup chasing hype. They move over a trillion dollars in payments. Their codebase is humongous. When Stripe says "this works in production," the conversation changes.
What real-world data tells us
| Difficulty | Description |
|---|---|
| 1 — Trivial | Config changes, logging additions, small cleanup |
| 2 — Simple | Targeted bug fixes, schema extensions, straightforward feature work |
| 3 — Moderate | Multi-concern changes, refactors, integration work across components |
| 4 — Hard | New architectural components, cross-system plumbing, hard-to-reproduce edge cases |
| 5 — Very Hard | Foundational service refactors, major architectural consolidations |
| Metric | Supernaut |
|---|---|
| Total MRs | 78 |
| Avg Difficulty | 2.7 / 5 |
| Est. Credits | ~98,800 |
| Difficulty | MRs | % | Est. Credits |
|---|---|---|---|
| 1 — Trivial | 9 | 12% | ~4,200 |
| 2 — Simple | 26 | 33% | ~24,400 |
| 3 — Moderate | 26 | 33% | ~36,600 |
| 4 — Hard | 13 | 17% | ~24,300 |
| 5 — Very Hard | 4 | 5% | ~9,400 |
| Total | 78 | ~98,800 |
Nearly 80% of the work sits at difficulty-3 or below: bug fixes, integration wiring, event handling, feature scaffolding. Important, but predictable. The bread-and-butter engineering that keeps codebases moving.
Supernaut also took on 13 hard tasks and 4 that qualified as very hard. These required more oversight. Difficulty-4 work lands, but not always cleanly. Difficulty-5 tasks are at the very edge—it got there four times, but these aren't reliable yet, and it needed a lot of guidance.
One critical caveat: Supernaut was bottlenecked by us, not by itself. We ran out of tasks to feed it. Had we had a deeper backlog, it could have closed a thousand MRs just as easily. We're a 2.5-engineer team; there's only so much task definition and review bandwidth to go around.
(Note: Rework rate—human reviewers asking for MR updates—was 10 out of 78 MRs, or ~13%. That number means nothing in isolation, so we went looking for a baseline).
What human engineers actually ship
Someone close to us agreed to share their engineering data. Same format: every merge request rated on the same 1-to-5 scale, over a four-week window, same standards.
Two distinct engineer profiles: a mid-level group and a senior-level group. Both use AI-assisted coding tools daily: Claude Code, Cursor, OpenCode, the works. This isn't "engineers without AI" versus Supernaut. This is engineers with every AI tool available versus an autonomous agent.
Mid-level engineer group
Average of 29 merge requests in four weeks. Narrow focus.
| Difficulty | MRs | % | Est. Credits |
|---|---|---|---|
| 1 — Trivial | 1 | 3% | ~500 |
| 2 — Simple | 8 | 28% | ~7,500 |
| 3 — Moderate | 16 | 55% | ~22,500 |
| 4 — Hard | 4 | 14% | ~7,500 |
| 5 — Very Hard | 0 | 0% | 0 |
| Total | 29 | ~37,900 |
Average difficulty: 2.8 / 5. Over 85% at difficulty 3 or below. Their hardest tasks were legitimate complexity-4 features, but nothing that required holding an entire system in their head. Rework rate: 2 out of 29 MRs, or ~6.9%.
Senior-level engineer group
Average of 55 merge requests across multiple services/repositories. A mix of backend plumbing (event routing, state management, integration wiring) and deep core engine architecture (orchestration logic, inter-service bridges).
This group split their time between coordination work—context switching, unblocking others—and deep architectural thinking. Their MR output doesn't capture the full scope of their contribution, but it represents a blend of lower-difficulty throughput and high-complexity system design.
| Difficulty | MRs | % | Est. Credits |
|---|---|---|---|
| 1 — Trivial | 4 | 7% | ~1,700 |
| 2 — Simple | 17 | 31% | ~15,900 |
| 3 — Moderate | 18 | 33% | ~24,500 |
| 4 — Hard | 12 | 22% | ~22,200 |
| 5 — Very Hard | 4 | 7% | ~9,400 |
| Total | 55 | ~73,700 |
Average difficulty: 2.9 / 5. The senior group is the only one to hit difficulty-5 tasks. That kind of work requires understanding not just the code, but the why behind it. Rework rate: 1 out of 55 MRs, or ~2%. The best quality bar of anyone, human or AI.
The full picture side-by-side
| Metric | Mid Group | Senior Group | Supernaut |
|---|---|---|---|
| Total MRs | 29 | 55 | 78 |
| Avg Difficulty | 2.8 | 2.9 | 2.7 |
| Credit Equiv. | ~37,900 | ~73,700 | ~98,800 |
| Intervention/Rework | ~7% | ~2% | ~13% |
The real cost
Now let's put dollar signs on this.
At Supernaut's current pricing, 100,000 credits maps to $500. That means Supernaut's entire four-week output—78 merge requests—cost roughly $494.
| Difficulty | MRs | Est. Cost |
|---|---|---|
| 1 — Trivial | 9 | ~$21 |
| 2 — Simple | 26 | ~$122 |
| 3 — Moderate | 26 | ~$183 |
| 4 — Hard | 13 | ~$122 |
| 5 — Very Hard | 4 | ~$47 |
| Total | 78 | ~$494 |
That's roughly $6.33 per merge request. A difficulty-2 bug fix costs about $4.70. A moderate integration task, about $7.00.
The difficulty 1–3 range is where this gets almost absurd. Those 61 merge requests cost us roughly $326 total. That's the work that fills most engineering sprints, the kind your team spends entire weeks on. For less than the cost of a decent team lunch.
That's insane.
The inconvenient truth
The vast majority of normal engineering work is, frankly, menial.
Not useless. Not unskilled. It all needs to happen. But across two engineer profiles using every AI tool available, the average difficulty was 2.8 to 2.9 on a five-point scale. The mid group spent over 80% of their time at difficulty-3 or below. Even the senior group spent nearly 38% on work rated 2 or lower.
That's what your engineering P&L is paying for. The plumbing. And when an autonomous agent can do it at a fraction of the cost and at a comparable quality bar—the economics of human-powered plumbing stop making sense.
What this means for engineers
"AI is replacing engineers" is a lazy take. It misses what's actually happening.
AI is replacing the tasks that most engineers spend their time on. That's a different statement, and it leads to a different conclusion.
The engineers who thrive won't be the ones who grind through the most tickets. They'll be the ones who do the things AI can't:
- Architectural thinking. Holding six interacting systems in your head and making judgment calls about which abstractions will survive the next quarter. Only the senior group consistently operated at difficulty 5. That kind of reasoning is human territory, and it's the most valuable engineering work that exists.
- Product judgment. Talking to customers. Defining what to build next. Understanding why a feature matters, not just how to implement it. Knowing what not to build. Supernaut can ship the MR, but it can't reliably decide whether the MR should exist.
- Closing the loop. The best engineering teams don't just write code. They shape the vision, define iterations based on real user feedback, and make the calls that turn a codebase into a product. No agent does that right now.
The future isn't engineers versus agents. It's engineers with agents. Your mid engineers stop burning months on integration plumbing and start leveling up on harder problems. Your senior engineers stop drowning in difficulty-2 bug fixes and context switching and start designing systems.
The transition will be messy
I'd be lying if I said this shift will be smooth.
There will be teams that over-correct—cut too many engineers too fast, hand Supernaut (or tools like it) tasks it can't handle, and watch quality crater. There will be engineers who resist the change entirely, convinced that AI can't touch "real" engineering, right up until it handles 78 merge requests in a month at their quality bar.
There will be uncomfortable conversations about what a mid-level engineering role even means when the difficulty 2s and 3s can be done autonomously for a few dollars each.
These conversations are coming. Stripe's Minions are merging a thousand PRs a week. We shipped 78 production MRs in our first full month—and ran out of work before Supernaut ran out of capacity. The trend line isn't subtle.
Where we go from here
At Supernaut, we've been working toward this moment since last summer. Through the skepticism, the technical dead ends, the months where nothing worked quite right. We believe autonomous end-to-end engineering is the future, and that future has already happened. The inflection point is here.
The question is: how long will you keep feeding your best engineers to the backlog?