The Real Cost of Routine Engineering Tasks

Mar 16, 2026 10 min read

Automation Engineering

For over two years, the tech world has been shouting that AI is coming for software engineering. Autonomous coding agents. End-to-end pull requests. Engineers replaced by bots.

And yet, most engineering teams are still grinding through tickets, writing the same CRUD endpoints, and fixing the same event-routing bugs. Individual engineers may be more efficient, but there's been no 10x productivity explosion at the organizational level.

This, I believe, is all about to change.

What we believed before anyone cared

Last summer, in mid-2025, we started building Supernaut around a conviction most people thought was premature: fully autonomous engineering agents that work end-to-end, asynchronously. That means taking a task, writing the code, opening the merge request, and waiting for a human to review. No hand-holding, no context switching, no mental load on the engineers.

The tech wasn't fully there yet. The skepticism was constant. Even people on our own team would ask: "But can it actually work?"

We didn't have a satisfying answer. Not yet. I refused to launch Supernaut until we could confidently use it in production ourselves. I'm a firm believer in dogfooding, for better or worse.

Then, in January 2026, we deployed our latest agents. It was supposed to be a quick test to see if Supernaut could handle some of the annoying workload we had piling up. But this time it actually worked, and we ended up shipping our entire backlog in a day.

In February alone, Supernaut shipped 78 merge requests across our production services. Almost entirely hands-off.

The Stripe moment

In February 2026, Stripe revealed that their internal coding agents—they call them "Minions"—were merging over a thousand pull requests per week. Human-reviewed, but with no human-written code. Built on top of their existing CI/CD infrastructure, running in isolated environments, following the same processes their engineers follow.

When I read their blog, I immediately messaged my co-founder: "These people speak my language, it's basically us, almost 1:1."

The industry took notice. Not because the idea was new, but because Stripe isn't a startup chasing hype. They move over a trillion dollars in payments. Their codebase is humongous. When Stripe says "this works in production," the conversation changes.

What real-world data tells us

Every task Supernaut tackles gets rated on a difficulty scale from 1 to 5:
Difficulty	Description
1 — Trivial	Config changes, logging additions, small cleanup
2 — Simple	Targeted bug fixes, schema extensions, straightforward feature work
3 — Moderate	Multi-concern changes, refactors, integration work across components
4 — Hard	New architectural components, cross-system plumbing, hard-to-reproduce edge cases
5 — Very Hard	Foundational service refactors, major architectural consolidations

Here's what Supernaut shipped in a single four-week production window:
Metric	Supernaut
Total MRs	78
Avg Difficulty	2.7 / 5
Est. Credits	~98,800

The difficulty breakdown:
Difficulty	MRs	%	Est. Credits
1 — Trivial	9	12%	~4,200
2 — Simple	26	33%	~24,400
3 — Moderate	26	33%	~36,600
4 — Hard	13	17%	~24,300
5 — Very Hard	4	5%	~9,400
Total	78		~98,800

Nearly 80% of the work sits at difficulty-3 or below: bug fixes, integration wiring, event handling, feature scaffolding. Important, but predictable. The bread-and-butter engineering that keeps codebases moving.

Supernaut also took on 13 hard tasks and 4 that qualified as very hard. These required more oversight. Difficulty-4 work lands, but not always cleanly. Difficulty-5 tasks are at the very edge—it got there four times, but these aren't reliable yet, and it needed a lot of guidance.

One critical caveat: Supernaut was bottlenecked by us, not by itself. We ran out of tasks to feed it. Had we had a deeper backlog, it could have closed a thousand MRs just as easily. We're a 2.5-engineer team; there's only so much task definition and review bandwidth to go around.

(Note: Rework rate—human reviewers asking for MR updates—was 10 out of 78 MRs, or ~13%. That number means nothing in isolation, so we went looking for a baseline).

What human engineers actually ship

Someone close to us agreed to share their engineering data. Same format: every merge request rated on the same 1-to-5 scale, over a four-week window, same standards.

Two distinct engineer profiles: a mid-level group and a senior-level group. Both use AI-assisted coding tools daily: Claude Code, Cursor, OpenCode, the works. This isn't "engineers without AI" versus Supernaut. This is engineers with every AI tool available versus an autonomous agent.

Mid-level engineer group

Average of 29 merge requests in four weeks. Narrow focus.

Difficulty	MRs	%	Est. Credits
1 — Trivial	1	3%	~500
2 — Simple	8	28%	~7,500
3 — Moderate	16	55%	~22,500
4 — Hard	4	14%	~7,500
5 — Very Hard	0	0%	0
Total	29		~37,900

Average difficulty: 2.8 / 5. Over 85% at difficulty 3 or below. Their hardest tasks were legitimate complexity-4 features, but nothing that required holding an entire system in their head. Rework rate: 2 out of 29 MRs, or ~6.9%.

Senior-level engineer group

Average of 55 merge requests across multiple services/repositories. A mix of backend plumbing (event routing, state management, integration wiring) and deep core engine architecture (orchestration logic, inter-service bridges).

This group split their time between coordination work—context switching, unblocking others—and deep architectural thinking. Their MR output doesn't capture the full scope of their contribution, but it represents a blend of lower-difficulty throughput and high-complexity system design.

Difficulty	MRs	%	Est. Credits
1 — Trivial	4	7%	~1,700
2 — Simple	17	31%	~15,900
3 — Moderate	18	33%	~24,500
4 — Hard	12	22%	~22,200
5 — Very Hard	4	7%	~9,400
Total	55		~73,700

Average difficulty: 2.9 / 5. The senior group is the only one to hit difficulty-5 tasks. That kind of work requires understanding not just the code, but the why behind it. Rework rate: 1 out of 55 MRs, or ~2%. The best quality bar of anyone, human or AI.

The full picture side-by-side

Metric	Mid Group	Senior Group	Supernaut
Total MRs	29	55	78
Avg Difficulty	2.8	2.9	2.7
Credit Equiv.	~37,900	~73,700	~98,800
Intervention/Rework	~7%	~2%	~13%

The real cost

Now let's put dollar signs on this.

At Supernaut's current pricing, 100,000 credits maps to $500. That means Supernaut's entire four-week output—78 merge requests—cost roughly $494.

Difficulty	MRs	Est. Cost
1 — Trivial	9	~$21
2 — Simple	26	~$122
3 — Moderate	26	~$183
4 — Hard	13	~$122
5 — Very Hard	4	~$47
Total	78	~$494

That's roughly $6.33 per merge request. A difficulty-2 bug fix costs about $4.70. A moderate integration task, about $7.00.

The difficulty 1–3 range is where this gets almost absurd. Those 61 merge requests cost us roughly $326 total. That's the work that fills most engineering sprints, the kind your team spends entire weeks on. For less than the cost of a decent team lunch.

That's insane.

The inconvenient truth

The vast majority of normal engineering work is, frankly, menial.

Not useless. Not unskilled. It all needs to happen. But across two engineer profiles using every AI tool available, the average difficulty was 2.8 to 2.9 on a five-point scale. The mid group spent over 80% of their time at difficulty-3 or below. Even the senior group spent nearly 38% on work rated 2 or lower.

That's what your engineering P&L is paying for. The plumbing. And when an autonomous agent can do it at a fraction of the cost and at a comparable quality bar—the economics of human-powered plumbing stop making sense.

What this means for engineers

"AI is replacing engineers" is a lazy take. It misses what's actually happening.

AI is replacing the tasks that most engineers spend their time on. That's a different statement, and it leads to a different conclusion.

The engineers who thrive won't be the ones who grind through the most tickets. They'll be the ones who do the things AI can't:

Architectural thinking. Holding six interacting systems in your head and making judgment calls about which abstractions will survive the next quarter. Only the senior group consistently operated at difficulty 5. That kind of reasoning is human territory, and it's the most valuable engineering work that exists.
Product judgment. Talking to customers. Defining what to build next. Understanding why a feature matters, not just how to implement it. Knowing what not to build. Supernaut can ship the MR, but it can't reliably decide whether the MR should exist.
Closing the loop. The best engineering teams don't just write code. They shape the vision, define iterations based on real user feedback, and make the calls that turn a codebase into a product. No agent does that right now.

The future isn't engineers versus agents. It's engineers with agents. Your mid engineers stop burning months on integration plumbing and start leveling up on harder problems. Your senior engineers stop drowning in difficulty-2 bug fixes and context switching and start designing systems.

The transition will be messy

I'd be lying if I said this shift will be smooth.

There will be teams that over-correct—cut too many engineers too fast, hand Supernaut (or tools like it) tasks it can't handle, and watch quality crater. There will be engineers who resist the change entirely, convinced that AI can't touch "real" engineering, right up until it handles 78 merge requests in a month at their quality bar.

There will be uncomfortable conversations about what a mid-level engineering role even means when the difficulty 2s and 3s can be done autonomously for a few dollars each.

These conversations are coming. Stripe's Minions are merging a thousand PRs a week. We shipped 78 production MRs in our first full month—and ran out of work before Supernaut ran out of capacity. The trend line isn't subtle.

Where we go from here

At Supernaut, we've been working toward this moment since last summer. Through the skepticism, the technical dead ends, the months where nothing worked quite right. We believe autonomous end-to-end engineering is the future, and that future has already happened. The inflection point is here.

The question is: how long will you keep feeding your best engineers to the backlog?