HN Developers Are Using LLMs Wrong for Coding. Here's What

By Lazar Milicevic · Published July 5, 2026 · 8 min read

A developer asked Hacker News how people actually use LLMs for coding beyond autocomplete. 181 upvotes later, the answers reveal a clear split: people who paste vague prompts into a blank chat and people who build structured systems around the model. The first group ships broken code. The second group ships working software.

I build AI-assisted systems for small businesses — invoicing automation, CRM integrations, lead pipelines. The gap between toy usage and production usage is massive, and it has almost nothing to do with which model you pick.

The Chat Window Is a Prototype, Not a Production System

The Hacker News thread exposes the core mistake most developers and founders make: treating an LLM chat window as a coding environment. When someone says "we tried ChatGPT for our workflow and it didn't work," the failure isn't the model. The failure is the integration design.

A blank chat has no file access, no structured output enforcement, no iteration loop, and no guardrails. You're asking a language model to hold your entire codebase in context through a text box. That's not a workflow — it's a party trick.

Here's what the production version looks like instead:

# What "didn't work" looks like
response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "fix the invoicing bug"}]
)
# No files. No schema. No retry. No context. 40% accuracy.

# What actually works
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=SYSTEM_PROMPT_WITH_RULES,
    tools=[read_file, run_test, write_file, search_code],
    messages=structured_input_with_codebase_context
)
# File access. Tool use. Retry on failure. 90%+ accuracy.

The accuracy jump from ~40% to consistently above 90% isn't from switching models. It's from switching architectures. You give the model the same tools a human developer needs: the ability to read files, run tests, see errors, and iterate.

Agentic Access Beats Copy-Paste Every Time

The top answers in the Hacker News thread fall into a few camps, but the developers shipping real work share one thing in common: they give the model direct repository access through agentic tools like Claude Code or Cursor.

This works because the model can:

Read the actual codebase — not a summary you pasted, but the real files with real imports and real dependencies
Run commands — execute tests, lint, compile, and see the actual error output
Iterate — make a change, see it fail, fix the failure, and try again without you in the loop
Narrow its scope — work on one function or one file with full surrounding context

When I use Claude Code on client projects, I'm not typing prompts into a void. I'm giving it a specific task inside a specific repository:

# Claude Code in a real project — narrow, scoped, verifiable
claude "The Stripe webhook handler in /api/webhooks/stripe.py is 
missing idempotency checks. Read the file, check how we handle 
duplicate event IDs, add a guard clause that returns 200 early 
if we've already processed the event. Run the test suite after."

That prompt works because the model can read the file, understand the existing patterns, write code that matches them, and verify the result by running tests. A chat window can't do any of that.

The Narrow Pipeline: When You Can't Give Full Repo Access

Not every use case allows full repository access. For client work — invoicing, CRM, lead pipelines — you often need a narrower pipeline where the LLM handles one specific decision per call with tightly controlled input.

This is the second workflow the Hacker News thread touches on: pre-processing your code into structured chunks, defining the output format explicitly, and constraining the model to a single decision.

Here's a real example from an invoicing classification system:

# Narrow pipeline: one decision, structured input, enforced output
EXTRACTION_PROMPT = """
You are an invoice data extractor. Given the raw text from an 
invoice PDF, extract these fields:
- vendor_name (string)
- invoice_number (string)
- total_amount (float, USD)
- due_date (YYYY-MM-DD)
- line_items (array of {description, quantity, unit_price})

Output valid JSON only. No commentary.
"""

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2000,
    tools=[{"type": "computer_20250124", "name": "extract_invoice"}],
    messages=[{
        "role": "user",
        "content": f"Raw invoice text:\n{ocr_output}"
    }]
)

# Validate against schema before accepting
parsed = validate_against_schema(response, InvoiceSchema)
if not parsed.valid:
    log_failure(parsed.errors)
    route_to_human_review()

The difference between this and a vague chat prompt is structural:

Dimension	Vague Chat	Narrow Pipeline
Input	Whatever you paste	Pre-processed, structured
Output	Free text	JSON schema enforced
Scope	"Fix the bug"	One decision per call
Failure mode	Silent wrong answer	Schema validation + human review
Accuracy	~40%	90%+

Both approaches — agentic tools and narrow pipelines — solve the same root problem: they collapse the vague-prompt failure mode that makes most LLM coding attempts fail.

The Wrong Question People Keep Asking

The Hacker News thread's question — "how do you use LLMs for coding beyond autocomplete?" — is slightly wrong. People keep searching for the universal prompt, the one setup that works for everything. There isn't one.

The developers winning in that thread stopped searching for a universal answer and started building purpose-specific setups. The right question is: what does my specific workflow need?

That depends on three things:

What you're building — A SaaS codebase needs agentic file access. An invoice classifier needs a narrow extraction pipeline. A bug-fix workflow needs test execution.
What your codebase looks like — A 200k-line monorepo needs chunking and search. A 500-line prototype needs full-context reads.
What failure costs your business — A wrong line item on an invoice costs money. A wrong function name in a prototype costs five minutes.

When I build systems for clients at bizflowai.io, I don't start with a model choice. I start with the failure mode and work backward to the architecture that prevents it.

Code Review and Debugging: The Underrated Use Case

One of the most interesting patterns in the Hacker News thread is developers using LLMs for code review and debugging rather than writing new code from scratch. This is where LLMs consistently deliver high value with lower risk.

The reason is simple: reviewing existing code is a constrained problem. The code already exists, the context is finite, and the feedback loop (does the suggested fix actually work?) is immediate. Writing a new feature from scratch is an open-ended problem where the model has to make architectural decisions it isn't equipped to make.

# High-value, low-risk: LLM as code reviewer
claude "Read /api/handlers/invoice_create.py. Check for: 
1. Missing input validation
2. SQL injection vectors  
3. Race conditions in the payment creation block
4. Any place where a None value could reach the database layer
Report findings as a list with file:line references. Don't 
rewrite the file — just identify the issues."

This works because:

The scope is bounded (one file, specific vulnerability classes)
The model doesn't need to make creative decisions
The human reviews every finding before any code changes
False positives cost seconds; true positives prevent production bugs

For a solo founder maintaining a SaaS without a dev team, this is the highest-ROI LLM coding workflow available. You get a second pair of eyes on every commit without hiring anyone.

What the Thread Missed: Feedback Loops

The Hacker News thread covers input and interaction patterns well, but most answers miss the most important piece of production LLM systems: the feedback loop.

A chat window is one-shot. You ask, the model answers, you move on. A production system logs the input, the output, whether it was correct, and feeds that back into the pipeline.

# Production systems log everything
result = process_invoice(raw_text)

log_entry = {
    "timestamp": datetime.utcnow(),
    "input_hash": hash(raw_text),
    "model": "claude-sonnet-4-20250514",
    "output": result.parsed_data,
    "confidence": result.confidence_score,
    "human_verified": False,
    "error_type": None
}

if result.confidence_score < 0.85:
    log_entry["error_type"] = "LOW_CONFIDENCE"
    route_to_human_queue(result, log_entry)
    track_failure_pattern(raw_text, result)

Without this loop, you can't answer the most important question for any business workflow: how often is this wrong, and in what way? The Hacker News developers talking about agentic tools get this implicitly — Claude Code runs tests and reports failures. But the people still using chat windows have no feedback data at all.

The pattern I see across client work: systems without feedback loops degrade silently. A pipeline that runs at 92% accuracy in January might drop to 78% by April because a vendor changed their invoice format. Without logging and confidence scoring, nobody notices until a client complains about a billing error.

How This Plays Out in Practice

For the specific workflows I automate for clients — invoicing, lead intake, CRM enrichment — the architecture looks like this:

Pre-processing layer — OCR, file parsing, data normalization. No LLM involved. Pure deterministic code.
LLM decision layer — Narrow, schema-constrained, one decision per call. Each call has a clear input contract and output schema.
Validation layer — Schema enforcement, business rule checks, confidence scoring. Rejections route to human review.
Feedback layer — Every decision logged, every correction tracked, failure patterns monitored weekly.

This stack runs on my home server alongside multiple AI agent projects in parallel. It handles real client data, real invoices, real lead pipelines — not demos. The LLM is one component in a larger system, not the entire system.

The Hacker News thread is useful because it shows developers are waking up to this. The era of pasting snippets from ChatGPT and hoping they work is ending. The era of structured LLM systems with file access, guardrails, and feedback loops is here. The developers and founders who build those systems will ship working software. The ones still in the chat window will keep asking why it didn't work.

Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

How do I use LLMs for coding effectively?

Use one of two workflows. First, adopt an agentic tool like Claude Code or Cursor that can read your actual files, run commands, and iterate on output. Second, build a narrow pipeline where you pre-process code into structured chunks, define the output format explicitly, and constrain the model to a single decision per call. Both approaches eliminate the vague-prompt problem that causes most LLM coding attempts to fail in a blank chat window.

Why does interaction structure matter for LLM coding accuracy?

The bottleneck isn't model capability — it's how you structure the interaction. Giving an LLM narrow, well-scoped tasks with full context, structured input schemas, clear guardrails, and a feedback loop raises accuracy from roughly 40% to consistently above 90%. That improvement comes from an architecture upgrade, not a model upgrade. Developers who pasted vague prompts into a chat window without file access or iteration loops consistently got poor results.

When should I use an agentic tool vs a custom LLM pipeline for coding?

Use an agentic tool like Claude Code when you want a ready-made solution that can read files, run commands, and iterate without custom engineering. Build a custom pipeline when you need precise control — pre-processing your codebase into structured chunks, defining explicit output formats, and constraining the model to a single decision per call. Choose the agentic route for speed and the custom pipeline for repeatability in production systems.

What are the most common ways developers use LLMs for coding?

Developers fall into four main camps. Some use agentic tools like Claude Code or Cursor that give the model direct repository access. Others build custom tooling that pre-processes codebases into structured context. A third group uses LLMs for code review and debugging rather than writing new code. Finally, some still treat LLMs as enhanced search engines for quick answers. The developers getting the most value give models narrow, well-scoped tasks with full context rather than asking for entire features from scratch.

Why did my ChatGPT integration fail for coding tasks?

Nine times out of ten, the failure is in integration design, not the model. Most teams paste a vague prompt into a chat window with no system prompt, no file access, no structured output, and no iteration loop. When you instead wire up a system that gives the model codebase access, structured input schemas, clear guardrails, and a feedback loop, accuracy jumps from roughly 40% to above 90%. The chat window is a prototype, not a production system.