Using LLMs for Code Review and Testing
What I’ve Found
I’ve been using LLMs increasingly for software engineering, especially from late 2024 into early 2025—Codeium, Windsurf, ChatGPT, Claude with mixed results initially. Codeium got completions right often enough but wrong enough that fixing it became more work than working without it. Windsurf was impressive but didn’t fit my workflow. ChatGPT was okay but inconsistent. Claude has been consistently useful, but only when I stopped trying to use it for generation and started using it differently.
The pattern that emerged: LLMs work exceptionally well for code review and analysis. They’re unreliable for extended code generation.
This isn’t obvious from demos or marketing and took months of production use to see where the value actually is.
Why This Pattern Exists
LLMs are linguistic tools trained on massive amounts of code. They match patterns reliably to surface issues worth focusing on. They’re very good at this—comparable to a skilled code reviewer for pattern recognition.
They fail at generation when the task requires sustained goal coherence or novel problem-solving. When generating code, they predict tokens sequentially. Each choice is probabilistic and errors propagate. They default to the most common implementation in their training data, which is often generic or wrong for your requirements.
The critical difference: analysis is bounded, generation is unbounded.
When you give an LLM existing code and ask “what’s wrong here?”, the analysis space is finite. The code exists. Patterns can be matched. Known issues can be spotted.
When you ask it to generate code, it predicts tokens without feedback or iteration. Errors don’t stop generation—they accumulate and compound.
The Core Mechanism: Explanation Forces Discovery
The most useful thing I’ve found: explaining your code in review reveals your own bugs.
Not because an LLM or person’s feedback is always correct. Because the act of explaining your logic makes gaps visible. When articulating what and why something was done, we discover the logical errors, missed edge cases and wrong assumptions.
Traditional code review has become gatekeeping and style checking. Explaining your reasoning to something that will listen without interruption is different—it forces focus and clarity. LLM feedback triggers clarification, and in clarifying, I found problems through the iteration.
This used to happen naturally with code reviews, walkthroughs or pair programming and is increasingly rare in contemporary work.
Method Emerging From Generated Chaos
This is the system that’s emerged from practice:
1: Code + Explanation
Drop your code into the LLM asking for review. Explain what you were building and why.
The LLM will spot: anti-patterns, common security issues, edge cases you missed, inconsistencies. The responses will immediately surface what wasn’t articulated and what’s missing.
This is fast and catches low-hanging fruit before you invest further.
2: Clarification & Iteration
Read the feedback and iterate. Often you’ll spot your own error while explaining. Clarify, justify, or correct. Iterate until logic is solid or you’ve confirmed the approach is intentional.
This catches the majority of errors before code runs—mostly logical inconsistencies, architectural misalignment, missing cases.
3: Testing
Write tests. Generate them with the LLM for complex logic if useful. The deterministic gate is the test suite, not the review.
This catches remaining issues: off-by-one errors, state bugs, integration issues. It is the baseline litmus test for completion, deployment.
4: Execution
Run the code in its actual environment. Real data, real constraints, real failures. Use the application, try the feature, see it work.
This catches what testing didn’t: concurrency issues, resource constraints, actual third-party behavior, environmental configuration.
When This Works
This system works when:
- You’re solving problems within your domain expertise
- Solution scope is bounded (small features, utilities, refactoring)
- The problem is well-represented in LLM training data (common patterns, standard libraries)
- You can articulate requirements clearly
- You have a test strategy
It fails when:
- The problem is poorly represented in training data (novel, etc)
- Solution requires architectural decisions across many components
- Scope is large and requires coordination
- Requirements are unclear
- The problem is new to you (you lack pattern recognition too)
Why Autonomous/Agentic Tools Are Risky
LLMs generate probabilistically. They’re guessing machines. When you use an LLM in a loop—generate code, run it, observe failure, generate fix, run again—errors compound.
Each probabilistic choice feeds the next. Without hard deterministic feedback (test suite, compiler error, hard constraint), the system can diverge. Each iteration is locally plausible but globally incorrect.
Humans maintain goal coherence across sequences. We remember why we started. We course-correct. LLMs maintain context in a window but don’t maintain goal—they maintain token sequence.
This is why I don’t use LLM tools for extensive generation. Not because it can’t work, because it lacks features that make that likely to succeed or economically viable.
Defense: deterministic gates. Human oversight at decision points. Tests that fail hard. No autonomous loops.
Practical Impact
Since implementing this methodology:
- Delivery time feels substantially faster
- Major logic errors surface several iterations earlier than they used to
- Test coverage is better—fewer edge cases missed
- Less rework and refactor errors
The time savings come from:
- LLM analysis is faster than human review (no context-switching, no waiting, no politics, no fatigue, no coordination, no overhead)
- Explanation as forcing function catches errors human review usually misses
Remaining errors get caught in testing and execution—the deterministic gates where they belong.
What I’ve Learned About LLM Capabilities
LLMs excel at:
- Pattern recognition in existing code
- Spotting patterns and common vulnerabilities
- Generating test cases for well-defined logic and existing code
- Explaining concepts clearly (goal is clarity; clarity patterns are well-represented)
- Code review (goal is pattern matching; that’s what they do)
LLMs struggle with:
- Generation that is more than trivial (goal must be maintained; they lose coherence)
- Novel architectural problems (no training data patterns to match)
- Coordinating changes across many files
- Maintaining context over medium to long sequences
- Generating code for uncommon frameworks or private codebases
Understanding this—that they’re linguistically sophisticated but cognitively limited—determines how you use them. Use them where language patterns suffice or your source offers patterns. Don’t use them where goal maintenance matters.
Implementation
I imagine this methodology should work with any LLM (though I found it worked with Claude), any IDE, any testing framework.
Using high-quality LLMs that perform well allows rapid completion of work through immediate iterative feedback. It’s comparable to autocomplete but applied to whatever step in software development you’re at when you need it—review, test generation, debugging, explanation. No waiting for CI, no scheduling code review, no context-switching overhead.
What matters:
- Clear explanation of your work (forces clear thinking)
- Fast feedback loop (deterministic, no waiting)
- Tests as actual quality gate (deterministic pass/fail)
- Hard constraints on what LLM controls (analysis and feedback only; generation only when bounded)
Tools are secondary. System is primary.
Why This Matters Now
Most discussion about LLMs and code focuses on generation speed: “How do we use LLMs to code faster?”
Better question: “What part of the development process actually benefits from LLM capabilities?”
Answer: The parts that involve pattern recognition and analysis. Not the parts that require novel thinking or sustained goal coherence.
LLMs are pattern-matching engines. Humans are goal-maintenance engines. Each does what it’s good at.
The value isn’t in replacing developers. It’s in eliminating friction (waiting for review, context-switching, organizational overhead) and forcing better practices (explanation, testing).
If the approach I’ve outlined ships more stable features, at several times the prior rate, using less compute cost, I’m both more productive and operating at substantively higher value overall.
This methodology emerged from production use over the past year. Your experience may differ. Test it, adapt it, use what works. jimmont.com