What Is Code Health? A Practical Guide to the 12 Biomarkers
What Is Code Health? A Practical Guide to the 12 Biomarkers
What is code health? It is the part of a codebase that tells you whether the system will stay easy to change next week, next month, and next quarter. Code health is not a vibe, and it is not a single score you can paste into a dashboard and trust blindly. In practice, code health metrics work best when they combine structure, dependency shape, testing signal, and repository history. That is the idea behind the 12 code biomarkers in this guide. You will see how to read them, why aggregate averages hide trouble, and how to turn a weak code health score into a repair plan. For a live view of these ideas on a real repo, see our live examples.
A working definition
Code health explained in one sentence: it is a measurement of how much friction the code will create for the next change.
That sounds simple until you try to measure it.
A file can be short but tangled. A module can be clean but surrounded by fragile callers. A repo can have good tests and still be hard to change because the same few files absorb most edits. So the useful question is not “Is this code good?” It is “Where will future work slow down, and why?”
A practical definition needs four ingredients:
- Structure — how hard the code is to read and reason about.
- Coupling — how many things depend on it, and what it depends on.
- Test signal — whether changes are guarded by meaningful tests.
- Churn signal — whether history says this area keeps changing, breaking, or attracting risk.
Modern static analysis tools already measure pieces of this. SonarQube tracks maintainability, reliability, security, duplication, coverage, and complexity metrics in its analysis model. GitHub’s CODEOWNERS feature gives a simple ownership signal. MCP now has a current spec revision dated 2025-11-25, and the protocol’s working groups are still evolving the standard in 2026. (docs.sonarsource.com)
The problem is not lack of data. The problem is bad aggregation.
Why aggregate scores hide problems
A single repo-wide average makes unhealthy code look acceptable.
If one module scores 9/10 and another scores 2/10, the mean may look fine. That average can also improve while the actual risk gets worse, if healthy files get healthier and the truly fragile file stays ignored. CodeScene’s own product page calls out this exact failure mode: hotspots may be healthy, averages may hide low-scoring files, and a legacy module can remain a long-term risk even if it changes less often. (codescene.com)
That is why “code health” works better as a profile than as a number.
Averages miss three common failure modes
| Failure mode | What the average says | What is really happening |
|---|---|---|
| One hot file is broken | Looks fine | A high-churn file is dragging delivery down |
| One legacy module is stale | Looks fine | Future changes will be expensive |
| Tests cover the wrong areas | Looks fine | Refactors will still be risky |
You want a code health score that can be broken apart, not just rolled up.
That is also why dependency graphs matter. If a file sits high in the dependency chain, poor health there costs more than the same score in a leaf module. Repowise’s dependency graph demo shows how imports become a directed graph, which is the right shape for finding shared risk and blast radius. See the FastAPI dependency graph demo to see it in action, or read repowise’s architecture to understand how the graph and other intelligence layers fit together.
Code Health Overview Dashboard
The 12 biomarkers, grouped
The 12 biomarkers below are not magic. They are a compact way to read a codebase from four angles. The grouping matters more than the exact math.
Complexity (4)
1) Cyclomatic complexity
Counts the number of decision paths through a function. More branches means more states to test and more paths to miss.
2) Cognitive complexity
Estimates how hard a function is to follow in your head. Deep nesting and abrupt control flow raise the cost even if cyclomatic complexity stays modest.
3) File size
Big files are often a sign that several responsibilities got merged. Size alone is not proof of bad code, but it is a good screening signal.
4) Function length and density
A long function with many local variables, conditionals, and nested loops is usually a change tax. The real issue is not lines. It is the amount of state a reader must keep in memory.
Coupling (3)
5) Fan-in
How many other files or modules depend on this one. High fan-in means a change can ripple outward.
6) Fan-out
How many dependencies this file pulls in. High fan-out usually means the file is an orchestration layer, or it has grown into a god object.
7) Cycles
A cycle means two or more modules depend on each other. Cycles make refactors slow because you cannot move one piece without touching the others.
Test signal (2)
8) Coverage on changed code
Overall coverage can lie. Coverage on new and changed lines is more honest because it checks the code you actually touched.
9) Test proximity
A file with nearby, focused tests is easier to change than one that depends on a distant integration suite. The signal is not just “Are there tests?” It is “Do the tests sit close enough to fail fast?”
SonarQube explicitly separates test coverage from complexity and maintainability metrics, which is the right model. Coverage alone does not describe maintainability. Complexity alone does not describe correctness. (docs.sonarsource.com)
Churn signal (3)
10) Churn rate
How often the file changes over time. High churn means people keep revisiting the same code, often because it is unstable or central to product work.
11) Churn x complexity
This is the hotspot idea in one number. A complex file that changes often deserves more attention than a complex file that has been untouched for years.
12) Co-change pressure
If file A keeps changing with files B and C, you likely have hidden coupling. That is a sign the current module boundaries do not match reality.
Research on software maintainability has long treated size, complexity, and churn as important signals rather than isolated measures. The ACM’s summary of empirical work on open-source maintainability notes that complexity figures and churn are standard metrics used to study code quality trends. (ubiquity.acm.org)
A quick summary table
| Biomarker | Group | Best use |
|---|---|---|
| Cyclomatic complexity | Complexity | Branch-heavy functions |
| Cognitive complexity | Complexity | Human readability |
| File size | Complexity | Large-file triage |
| Function length and density | Complexity | State-heavy logic |
| Fan-in | Coupling | Shared risk |
| Fan-out | Coupling | Over-orchestration |
| Cycles | Coupling | Refactor blockers |
| Coverage on changed code | Test signal | Trust in current change |
| Test proximity | Test signal | Local safety net |
| Churn rate | Churn signal | Change frequency |
| Churn x complexity | Churn signal | Hotspot ranking |
| Co-change pressure | Churn signal | Hidden coupling |
Per-file scoring
A per-file score is useful only if it answers one question: “What would make this file safer to touch?”
A good file-level score should:
- penalize deep nesting and large functions,
- penalize high fan-in and cycles,
- reward local test coverage,
- factor in recent churn,
- and surface the reason a file scored poorly.
That last point matters. A number without a reason turns into trivia.
A useful workflow looks like this:
- Sort by lowest score.
- Filter by high churn or high fan-in.
- Inspect the top 3–5 files.
- Check whether the problem is structural or historical.
- Decide between refactor, test addition, or boundary split.
This is where tools like repowise help. The hotspot analysis demo shows how churn and complexity combine into a ranked risk view, and the auto-generated docs for FastAPI show the other side of the same problem: docs, symbols, and module context in one place.
Per-File Score Breakdown
What to do when a file scores poorly
| Problem pattern | Likely fix |
|---|---|
| High complexity, low churn | Refactor only when you touch it |
| High complexity, high churn | Schedule dedicated cleanup |
| High fan-in | Add tests first, then split carefully |
| High fan-out | Extract interfaces or adapters |
| Low coverage on changed code | Add focused tests before refactor |
| Co-change pressure | Revisit module boundaries |
Module rollup
A module rollup should answer a different question: “Which parts of the system are accumulating risk?”
This is where per-file metrics get promoted into architecture decisions.
A module score can be a weighted average of file scores, but weight matters. If you average everything equally, a big package with many trivial leaf files can hide one core module that everyone depends on. Better rollups give more weight to:
- files with high fan-in,
- files on important dependency paths,
- files with many co-change partners,
- files with low test signal,
- and files that have declined over time.
Dependency graphs are the cleanest way to do this. Repowise’s architecture page shows how the dependency graph, git intelligence, and code health layer fit together, while the live examples page lets you compare those layers on real repositories.
Rollup rules that work
- Do not average away hotspots. Keep the worst 5–10% visible.
- Track the trend, not just the snapshot.
- Weight by dependency importance.
- Separate stable debt from active risk.
- Surface a reason code for the module score.
CodeScene’s product material makes a similar point by separating hotspot code health, average code health, and the surrounding context rather than collapsing everything into one KPI. (codescene.com)
Reading a declining trend
A declining trend matters more than a static low score.
A file that drops from 8 to 6 in two months is telling you something different from a file that has sat at 6 for a year. The first one is getting worse. The second one is a known debt bucket.
How to read the trend
- Short drop, high churn: likely an active feature area under pressure.
- Slow decline, low churn: likely neglected debt.
- Flat low score, high fan-in: likely a shared core module that needs protection.
- Flat low score, low fan-in: probably lower priority unless the code is mission-critical.
Declining health should trigger a specific response:
- Check whether tests fell behind.
- Check whether the module picked up new responsibilities.
- Check whether ownership is unclear.
- Check whether recent changes introduced cycles or extra fan-out.
- Decide whether to fix the code or freeze it behind a better boundary.
GitHub’s CODEOWNERS feature is a lightweight ownership signal, but it is only a starting point. It tells you who should review changes. It does not tell you whether the file is becoming harder to maintain. That is where git intelligence and dependency signals add value. (docs.github.com)
Worked example on a real repo
Let’s use a small FastAPI-shaped example.
Imagine three files:
app/main.pyapp/routes/users.pyapp/services/billing.py
A shallow view might say the repo is healthy because coverage is decent and no file is huge. The code health profile says more:
| File | Complexity | Coupling | Test signal | Churn | Result |
|---|---|---|---|---|---|
app/main.py | Low | High fan-out | Medium | Low | Structural glue |
app/routes/users.py | Medium | Medium | High | High | Active feature surface |
app/services/billing.py | High | High fan-in | Low | High | Priority hotspot |
The billing module is the one to fix first. It has the worst mix: complex logic, many dependents, poor tests, and frequent edits.
A sensible plan looks like this:
- Add tests around the current behavior.
- Cut one boundary at a time.
- Remove a cycle if one exists.
- Reduce fan-out by pushing API calls behind adapters.
- Re-score after each change.
That is the practical version of code health. The score is not the goal. The decision is the goal.
If you want to see how this kind of context appears in tooling, check the ownership map for Starlette and the FastAPI dependency graph demo. If you want the automated docs side, the FastAPI docs example shows how file-level context can sit beside architectural context.
Hotspot Trend and Refactor Workflow
Why a code health score needs context
A score is useful when it starts a conversation, not when it ends one.
If a code health score is low because the code is old but stable, that is different from a low score on a change-heavy path. If a file is low because it has high cyclomatic complexity, that is different from a file that is low because nobody owns it. If a module is declining because tests are missing, the fix is obvious. If it is declining because it is the main integration seam in the system, the fix may be architectural.
That is why code biomarkers beat a single dashboard number.
Repowise’s auto-generated wiki, git intelligence, dependency graph, and health layer are built around that same idea: put the reason next to the number. If you want to see the full workflow, start with repowise’s architecture, then compare it with the live examples, and finally inspect the FastAPI hotspot analysis demo.
FAQ
What is code health in software engineering?
Code health is a measure of how costly future changes will be. It combines structure, coupling, testing, and history so you can spot risk before it turns into slow delivery.
What are code health metrics?
Code health metrics are the measurements used to estimate maintainability and change risk. Common examples include complexity, fan-in, fan-out, cycles, coverage on changed code, churn, and co-change patterns.
What are code biomarkers?
Code biomarkers are the individual signals that make up a code health profile. In this guide, the 12 biomarkers are grouped into complexity, coupling, test signal, and churn signal.
Is a code health score enough on its own?
No. A score is a summary, not a diagnosis. You need the underlying biomarkers to know whether the problem is complexity, coupling, missing tests, or churn.
How do I improve code health without a big refactor?
Start with the worst hotspot. Add tests, remove one cycle, reduce fan-out, and split responsibilities only where the score says the risk is highest.
What tools help measure code health?
Static analysis tools, dependency graph tools, ownership maps, and git history analysis are the useful ones. GitHub’s CODEOWNERS gives ownership hints, SonarQube tracks maintainability and coverage metrics, and MCP gives AI tools a standard way to access structured repo context. (docs.github.com)


