What is Claude Code? It Refactored 1,847 Lines I Feared

By Lazar Milicevic · Published June 30, 2026 · 8 min read

You have a module somewhere that prints money and terrifies you. You wrote it at 2am, the tests are thin, the function names lie, and every new feature has to thread through it. That's the exact code I gave Claude Code last Saturday — and the unglamorous job is the one worth talking about.

What Claude Code actually is, in one paragraph

Claude Code is an agentic coding tool that runs in your terminal. You point it at a repo, it reads the whole codebase, edits files in place, runs commands, reads the errors those commands produce, and decides what to do next. It is not a chat window. It is not a copilot autocompleting your next line. The closest analogy: a junior engineer who never sleeps, never gets bored, reads faster than you, and will happily run pytest four hundred times in a row. You install it once, run claude in your repo root, and you talk to it the way you'd brief a contractor — domain, constraints, definition of done.

That's the whole pitch. The interesting question is what you point it at.

The real job: 9 files, 23 templates, two years of fear

I run an invoicing platform for small businesses in the US. Inside it lives a PDF generation module — 9 files, 23 invoice templates, multi-language support including Cyrillic edge cases (some clients invoice EU partners), tax logic baked into render functions. Average cyclomatic complexity per function: 14. In English: too many branches, too many if statements nested inside other if statements, nobody — including me — wanted to open those files.

That's the code you don't touch. That's also the code that quietly kills your velocity, because every new feature has to thread through it. Every PDF tweak meant two days of debugging Cyrillic escape sequences I'd already debugged eight months earlier.

Here's the actual numbers from one Saturday morning, not estimates:

Metric	Before	After
Files	9	9
Lines rewritten	—	1,847
Avg. function complexity	14	4
Templates preserved	23	23 (pixel-identical)
Git commits	—	11
Wall-clock time	—	38 min
Token cost	—	~$4.00
My hands on keyboard	—	~6 min total

The templates still render byte-for-byte identical to production. I diffed the PDFs.

The setup: three things I gave it, then I made coffee

I opened a fresh tmux pane, ran claude in the repo root, and gave it exactly three inputs. Not ten. Three.

1. A CLAUDE.md at the project root. This is the context file the agent reads first. It is the single most important thing in this entire post. Mine looked roughly like this:

# PDF Module Context

## Domain
This is an invoicing platform for US small businesses.
An "invoice" has: line items, tax (per-state US sales tax + EU VAT
for export clients), totals, payment terms, issuer + recipient.

## Templates
23 templates live in /pdf/templates/. They inherit from BaseTemplate.
All 23 must render pixel-identical after any change.
Cyrillic strings appear in 4 of them — escape via \u, never raw bytes.

## Sacred behaviors
- Tax rounding is half-up to 2 decimals. Do NOT change.
- Currency is always USD unless `invoice.currency` is set.
- Template IDs are stable; never rename.

## Commands
- Tests: `pytest tests/pdf/ -x`
- Lint: `ruff check pdf/`
- Render diff: `python scripts/render_all.py --diff`

## Rules for you
- Refactor for readability. Do not add features.
- Run the test suite after every file.
- If any test fails, stop and ask.
- Commit after each green file with a clear message.

2. A single prompt. Not a wall of instructions. Just:

Refactor the PDF module for readability. Preserve all 23 template behaviors. Run the test suite after each file. Stop and ask if any test fails.

3. A clean git branch. git checkout -b refactor/pdf-claude. If it went sideways, git branch -D and I'm back to where I started.

Then I made coffee.

What it did while I wasn't watching

Read all 9 files and mapped the dependency graph (which template inherits from which base, where tax logic actually lives, which utility is called from where).
Refactored leaf files first — the ones nothing else depends on — so a failure couldn't cascade.
Ran pytest tests/pdf/ -x after each file.
On a green pass: committed with a descriptive message.
On a red pass (one Cyrillic escape failure in template_invoice_eu.py): read the failure output, rolled back that specific change, tried a different approach using \u escaping instead of a raw string, ran tests again, green, committed.

38 minutes. Eleven commits. I reviewed the diff in about 20 minutes. Merged.

The four guardrails that kept it from blowing up production

This is the part most explainers skip. An agent that edits your code in place is dangerous if you don't fence it in. Here are the four guardrails I use on every client repo, no exceptions:

The non-negotiables

Clean git branch. Never run an agent on main. If the run goes sideways, git reset --hard or delete the branch. This single rule has saved me more times than I can count.
A working test suite as the oracle. The agent only knows it succeeded if pytest (or your equivalent) tells it so. No tests = the agent is guessing whether it broke something. If you have zero tests, ask the agent to write characterization tests first, in a separate run, and you review them before the refactor.
A CLAUDE.md that names the sacred behaviors. "Don't change tax rounding" is more useful than 500 lines of architectural prose. Be specific about what must not move.
Scoped tasks, not "fix the codebase." One module, one outcome, one branch. A 38-minute run with 11 commits is reviewable. A 6-hour run touching 80 files is not.

If any one of those four is missing, I don't run the agent. I write the missing one first.

The mindset shift: you're reviewing PRs now, not typing

This is what people miss when they first try Claude Code. You are not writing code anymore. You are reviewing pull requests from an agent that types faster than you and has read more of the codebase than you remember.

The skill that matters has shifted. It is no longer:

How fast can you type?
How many frameworks do you know by heart?
Can you write a regex from memory?

The skill that matters now is:

How clearly can you write a CLAUDE.md?
How carefully do you read a diff?
Do you know which behaviors in your system are sacred and which are flexible?

Writing good context is the new senior engineering skill. Vague context → the agent guesses → you get plausible-looking code that subtly breaks an edge case six weeks later. Specific context → the agent executes → you ship a refactor on a Saturday morning that you'd been postponing for two years.

This isn't a hypothetical. Anthropic published engineering guidance on agentic coding that says basically the same thing in their own words: the quality of the context file determines the quality of the output. Believe them on this one.

Honest limits — what Claude Code still gets wrong

I run this on three client repos daily. I'm not selling you anything. Here's what still breaks:

It hallucinates library APIs. Not often, but often enough. It'll call pandas.read_csv_with_schema() which doesn't exist. You only catch it because the test fails. Without tests, you ship the bug.
It struggles with deeply implicit context. If a behavior is "obvious" because of how your team has historically used a function — but it's nowhere in the code or docs — the agent can't see it. Write it into CLAUDE.md.
Token cost is real on large tasks. $4 for a 1,847-line refactor is fine. A 20,000-line monorepo sweep is $40-80. Not enormous, not free. Budget for it.
It's not a replacement for understanding your system. If you don't know what the code is supposed to do, neither does the agent. It will produce confident, well-tested code that does the wrong thing.
Large context windows degrade. Past a certain size, performance drops. Break work into modules.

If someone tells you it just works, they're either selling something or haven't shipped with it.

The second-order effect for small teams

This is where it gets interesting for solo founders and 2-5 person teams. I run three tmux panes in parallel:

# Pane 1
cd ~/repos/client-a && claude
> refactor the billing module per CLAUDE.md

# Pane 2
cd ~/repos/client-b && claude
> add characterization tests for the auth flow

# Pane 3
cd ~/repos/client-c && claude
> draft the Postgres -> Postgres 16 migration script

I'm the reviewer for all three. That's not a future scenario, that's how the work shipped this week. A two-person team can now do what a six-person team did two years ago — if and only if you invest in context files and guardrails instead of in typing faster.

The bottleneck shifted. It used to be: how much code can you write? Now it's: how many diffs can you carefully review per day? My honest ceiling is about 1,200 lines of agent-generated code reviewed per day before I get sloppy. That's the new constraint.

Where bizflowai.io fits in

We use Claude Code internally on every client repo we touch at bizflowai.io — refactors, test scaffolding, migration drafts, the unsexy infrastructure work that nobody wants to quote a fixed price for. When small business owners ask us to automate an operations workflow, half the job is often cleaning up the legacy module that the workflow has to talk to. That's where agentic coding earns its keep, and where we apply the same four guardrails on client code that I described above. No mystery, no proprietary magic — just the discipline of context files, test oracles, scoped tasks, and clean branches.

The takeaway

Claude Code isn't a better autocomplete. It's a different shape of work. If you have a module you've been afraid to touch for a year, the bottleneck isn't your typing speed and it isn't the agent's capability. It's the 200 lines of CLAUDE.md you haven't written yet. Spend an hour on that file. Then go make coffee.

Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

What is Claude Code?

Claude Code is an agentic coding tool that runs in your terminal, not a chat window. You point it at a repository and it reads the entire codebase, edits files in place, runs commands, reads the errors those commands produce, and decides what to do next. It operates autonomously, functioning like a junior engineer that can work through multi-file tasks without constant supervision.

How do I use Claude Code to refactor a legacy module?

Set up three things before running it. First, create a CLAUDE.md file at the project root explaining the domain, sacred behaviors, and which test commands to run. Second, give it a single clear prompt with constraints, like preserve all behaviors and run tests after each file. Third, work on a clean git branch so you can discard the run if it goes wrong.

Why does the CLAUDE.md context file matter?

The CLAUDE.md file is what determines whether the agent executes correctly or guesses. If your context file is vague about the domain, constraints, and sacred behaviors, the agent will hallucinate decisions. If it specifies exactly what the system does and which behaviors must be preserved, the agent executes reliably. Writing good context files is described as the new senior engineering skill, replacing typing speed.

What are the limits of Claude Code?

Claude Code still hallucinates library APIs sometimes, calling functions that don't exist, which you only catch when tests fail. It requires guardrails: a clean git branch, a working test suite, and tasks small enough that one failure won't poison the whole run. It also costs tokens, which adds up on large refactors, and it cannot replace understanding your own system.

How does Claude Code change a developer's workflow?

You stop writing code directly and instead review pull requests from an agent. The valuable skills become writing clear context files and carefully reading diffs, not typing speed. A second-order effect for small teams is parallelism: a single developer can run Claude Code across multiple repositories in separate tmux panes simultaneously, acting as the reviewer for refactors, test writing, and migrations happening at the same time.