How a 12-Biomarker Code Health Scorer Actually Works

repowise team··15 min read
code health scorer12 biomarkerscode health algorithmhow code health is calculatedhealth score internals

Why one number isn’t enough

A code health scorer works because one number is only useful if it is built from signals that map to real maintenance cost. A file can look clean and still sit at the center of churn. Another file can be ugly but stable enough that it should stay low on your list. A useful score needs to account for structure, ownership, change history, and the parts of the codebase that people actually touch. That is the point of the 12-biomarker model: reduce a messy set of maintenance signals into one score without pretending the score is magic. CodeScene’s public product docs make a similar argument, saying its Code Health is an aggregated metric built from 25+ factors and that a single KPI is not enough on its own. (codescene.com)

That framing also fits how the Model Context Protocol is used in agent tools today: MCP servers expose structured capabilities and let clients ask for exactly the data they need, rather than stuffing a model with a blob of text. MCP’s spec defines tools as named actions with schemas, and current spec pages describe versioned protocol identifiers and tool discovery. (modelcontextprotocol.io)

For a repo intelligence system like repowise, that matters. If you want the score to be useful in code review, architecture work, or refactoring triage, you need the score to be explainable at file level, roll up cleanly at module level, and change in a predictable way over time. See what repowise generates on real repos in our live examples, or check repowise’s architecture if you want to see how the data flows end to end.

Code Health Score PipelineCode Health Score Pipeline

The 12 biomarkers and why each is on the list

The 12 biomarkers are not random code smells. They are chosen because each one can be measured automatically and each one tends to correlate with maintainability pain. A good code health algorithm avoids subjective labels like “messy” and instead asks concrete questions: how complex is this file, how much does it depend on others, how many paths does it expose, how often does it change, and how old is the last cleanup?

Here is the list I use when I design a scorer like this.

BiomarkerWhat it measuresWhy it matters
1. Cyclomatic complexityBranching paths in a function or fileHarder to reason about, harder to test
2. Cognitive complexityNested control flow and mental loadCaptures “this feels hard” better than raw branch count
3. File sizeLines or tokens per fileLarge files hide unrelated behavior
4. Function countHow much is packed into one unitHigh density often means poor separation
5. Parameter countAPI surface per functionToo many inputs often signal muddled responsibilities
6. DuplicationRepeated blocks or near-duplicatesCopies diverge and raise bug risk
7. Coupling outImports, calls, or references to other unitsHigh fan-out makes changes expensive
8. Coupling inDependents or reverse referencesCentral files carry more blast radius
9. Change frequencyHow often the file changed in git historyFrequent edits can indicate churn or instability
10. Recent churnChange volume in a recent windowFinds active hot spots, not just old trouble
11. Ownership concentrationHow many people touch itBus factor and review risk
12. Test coverage proxyPresence of tests or nearby test linkageLow confidence code should score lower

A scorer can use more than one signal per biomarker. For example, “coupling” can come from dependency edges, call graphs, or co-change history. “Ownership concentration” can come from git authorship and recent reviewer patterns. “Test coverage proxy” can be a real coverage report if the repo has one, or a weaker heuristic when it does not.

The key design choice is this: each biomarker should be individually explainable. If a file scores poorly, the report should say why. “High coupling out” is useful. “Bad vibes” is not.

CodeScene’s public docs describe Code Health as an aggregated metric built from a set of factors scanned from source code, and their product pages note that Code Health is weighted and normalized across codebases. That is the same shape of problem, even if the factor set differs. (codescene.com)

How code health is calculated at file level

A file-level score is usually the first place people look, so the formula needs to be stable and boring. The common pattern is:

  1. Measure each biomarker on a normalized scale.
  2. Convert each biomarker into a risk contribution.
  3. Weight the contributions.
  4. Aggregate to one file score.
  5. Clamp the result to a fixed range, often 0–100.

A simple version looks like this:

file_health = 100 - Σ(weight_i × risk_i)

Where risk_i is already normalized. That normalization step matters more than the formula. A raw cyclomatic complexity of 18 means very different things in a 40-line helper versus a 2,000-line orchestration file. Good scoring systems normalize against file type, language, and sometimes repository baseline.

A practical normalization path

For each biomarker:

  • Map raw values to a percentile or z-score against the repo.
  • Apply thresholds for “safe,” “watch,” and “risk.”
  • Convert the score to a penalty curve.
  • Cap extreme penalties so one bad metric does not drown everything else.

That last part is important. If a scorer punishes one giant file too hard, the score becomes a size detector. If it punishes too softly, it hides real risk. The curve needs to be nonlinear enough to catch outliers and smooth enough that a small change in a function does not swing the score by 30 points.

Example

Imagine a 420-line TypeScript file:

  • Complexity: high
  • Duplication: moderate
  • Coupling out: high
  • Change frequency: high
  • Ownership concentration: one person
  • Tests: weak

A file like that should score poorly even if it has decent formatting. Formatting is not maintainability. That is why the biomarkers include structural and historical signals, not just lint-style checks.

If you want to see this kind of output on a real codebase, the FastAPI dependency graph demo shows how structural context feeds later scoring, and the hotspot analysis demo shows why churn belongs in the same conversation as complexity.

File Score BreakdownFile Score Breakdown

Per-file aggregation

A code health scorer is only as good as its file aggregation. If the per-file score is noisy, everything above it becomes noise too.

There are three common ways to aggregate:

1. Weighted average

The simplest approach is to average biomarker penalties with weights. This works if you want transparency and easy debugging.

2. Max-risk bias

Some scorers intentionally bias toward the worst biomarker. That makes sense for files with one glaring issue, such as extreme coupling or extreme size.

3. Hybrid aggregation

This is usually the best choice. Use a weighted average, then add a penalty for any biomarker that crosses a critical threshold.

A hybrid model lets the scorer say:

  • “This file is mostly fine.”
  • “This file has one severe issue.”
  • “This file is structurally unhealthy across several dimensions.”

That last category matters because maintenance pain is often cumulative. A file with moderate complexity, moderate churn, and weak tests can be more expensive than a file with one obvious smell.

Why file size is usually a weight, not a hard rule

Large files are not always bad. Generated code is large by design. Parsing logic can be large and still manageable. That is why file size should affect the score, but not dominate it. It is a contributor, not a verdict.

Why normalization is repo-specific

A 150-line function might be huge in one repo and normal in another. A scorer should compare a file against its neighbors first. Absolute thresholds can still exist, but they should be a guardrail, not the whole model.

Module rollup

Once you have file scores, you can roll them up to module, package, or service level. This is where a health score internals view becomes more useful than a dashboard number.

A good module rollup should answer three questions:

  1. Which files dominate the module score?
  2. Is the module score dragged down by one hotspot or by broad decay?
  3. Is the module score hiding a critical dependency hub?

A good rollup formula

A simple average is not enough. You want the larger and more central files to matter more.

A practical rollup might use:

  • file score
  • file size
  • dependency centrality
  • churn
  • internal fan-in

That gives you something like:

module_health = weighted_mean(file_health, file_weight)

Where file_weight is influenced by both size and centrality.

Why centrality matters

A small file at the center of many dependencies can be more important than a large leaf file. If that file changes, many callers can break. If it is unhealthy, the blast radius is high. That is one reason a module rollup should include dependency graph data, not just averages.

This is also where code intelligence platforms help more than plain linters. repowise parses dependency graphs across 10+ languages, which lets a module score reflect actual structure instead of directory shape. If you want to see the underlying data model, check our architecture page or see all 8 MCP tools in action on a real codebase.

Trend computation

A static score is fine for a snapshot. A trend is what makes the score operational.

Trend computation should answer:

  • Is the score improving?
  • Is it flat but unstable?
  • Is it degrading slowly?
  • Did a recent change cause a cliff drop?

The basic trend model

At minimum, keep a time series per file and per module.

trend = score_t - score_(t-n)

But that is too coarse by itself. Better systems smooth the line with a rolling average or exponential moving average, then compare current value to baseline.

What to track

  • last 7 days
  • last 30 days
  • last 90 days
  • since last release
  • since ownership change

That gives both short-term and medium-term views. A file that dipped yesterday after a refactor should not be treated the same as a file that has been declining for six months.

Why trend beats raw score in review

A PR that drops health from 84 to 81 may be fine if the repo was at 20 last month and climbing. A PR that drops from 84 to 81 after three weeks of decline is different. The trend turns the score into a management tool instead of a badge.

What we deliberately did NOT include

A serious code health algorithm should exclude signals that are easy to measure but hard to defend.

1. Style-only lint rules

Formatting issues can be fixed by automation. They are useful for consistency, but they do not tell you much about long-term maintenance cost.

2. Commit message quality

Interesting for process hygiene. Weak as a code health signal.

3. Subjective labels

Words like “clean,” “messy,” or “ugly” do not help. They are impossible to audit and hard to tune.

4. Raw lines changed in a single PR

Big PRs are not always bad, and tiny PRs are not always safe. What matters is where the change lands and how often that area already moves.

5. Generic repository popularity

Stars, forks, and contributor count may matter for project health, but they do not say much about maintainability inside the codebase.

That last point matters in open source. repowise is AGPL-3.0 licensed, and the AGPL is designed for network software where users interact with a modified version over a network. That keeps the code available to the community in the server setting. (gnu.org)

Tradeoffs vs SonarQube SQALE and CodeScene Code Health

There are two useful comparison points here: SonarQube’s SQALE-style maintainability model and CodeScene’s Code Health.

SonarQube SQALE

SQALE is centered on technical debt and remediation effort. It gives teams a maintainability view that maps issues to estimated effort, which is useful if your main question is “how much work is behind this issue set?” SonarQube’s public docs organize maintainability around debt ratios, remediation estimates, and quality gates. (codescene.com)

CodeScene Code Health

CodeScene’s Code Health is more opinionated about code-level risk. The product says the metric is aggregated from 25+ factors and is used alongside hotspots to prioritize refactoring candidates. It also stresses that the score is weighted and normalized, not just a raw issue count. (codescene.com)

How the tradeoffs look in practice

DimensionSQALE-style debtCodeScene-style Code Health12-biomarker scorer
Main questionHow much effort to fix issues?How healthy is this code, structurally and historically?How healthy is this file or module across 12 measurable signals?
StrengthClear remediation framingStrong prioritization with hotspotsEasier to explain and implement in-house
WeaknessCan become issue-count heavyCan feel proprietary or opaqueNeeds careful calibration
Best useQuality gates and technical debt accountingRefactoring prioritizationRepo intelligence, PR review, and internal dashboards

The practical choice is not “which is perfect.” It is “which model matches the decisions your team needs to make.” If you need a lightweight internal system that you can audit and extend, a 12-biomarker scorer is a good fit.

Open-source implementation walkthrough

Here is how I would build the scorer in an open-source repo intelligence platform.

Step 1: Parse the codebase

Walk the repo, parse files, and build symbol indexes. For supported languages, extract imports, declarations, and references. repowise already supports Python, TypeScript, JavaScript, Go, Rust, Java, C++, C, Ruby, and Kotlin, so the same pipeline can feed the score in mixed-language repos.

Step 2: Compute the 12 biomarkers

For each file:

  • complexity metrics from AST traversal
  • duplication from token or block similarity
  • coupling from import and call edges
  • churn from git history
  • ownership from author concentration
  • test proximity from file path and coverage data

Step 3: Normalize against the repo

Use repo baselines, not fixed global thresholds. A scorer should learn the shape of the repository it is judging.

Step 4: Aggregate to file and module scores

Combine the biomarkers into a file score, then roll that score upward with dependency-aware weights.

Step 5: Store history

Persist score snapshots so you can chart trends, spot regressions, and show before/after deltas in PRs.

Step 6: Expose it through tools

This is where MCP helps. MCP servers expose tools with defined schemas, and the current spec defines tool discovery and versioned protocol behavior. That makes it a good fit for agent-facing score lookups, risk summaries, and dependency paths. (modelcontextprotocol.io)

If you want a concrete starting point, try repowise on your own repo — the MCP server is configured automatically — or use pip install repowise && repowise init.

Module Rollup and Trend ViewModule Rollup and Trend View

FAQ

How is code health calculated?

A typical code health scorer calculates a file score by normalizing several biomarkers, weighting them, and converting the result into a fixed score range such as 0–100. The important part is not the exact formula. It is the calibration: the same raw metric can mean different things in different repos.

What are the 12 biomarkers in a code health scorer?

A practical 12-biomarker set includes complexity, file size, function count, parameter count, duplication, coupling in, coupling out, change frequency, recent churn, ownership concentration, test proxy, and dependency centrality. The exact list can vary, but each biomarker should be measurable and explainable.

Why not use only cyclomatic complexity?

Because one metric misses too much. A file can have moderate complexity and still be high risk if it changes often, sits at the center of the graph, and has weak tests. A useful scorer combines structural and historical signals.

How does a module score differ from a file score?

A file score looks at one unit. A module score rolls multiple file scores together, usually with weights for size, centrality, and churn. This helps surface hidden risk in a package that looks fine at the file level but is unhealthy overall.

Trends show direction. A flat score can hide decay, while a slightly lower score may still be acceptable if the repo is improving. Trend lines make the scorer useful for PR review and release tracking.

How does this differ from SonarQube or CodeScene?

SonarQube’s maintainability model is centered on technical debt and remediation effort, while CodeScene’s Code Health emphasizes aggregated code-level health and hotspot-aware prioritization. A 12-biomarker scorer can borrow the useful parts of both ideas while staying smaller, more transparent, and easier to adapt. (codescene.com)

Is the Model Context Protocol relevant here?

Yes. MCP gives you a standard way to expose tool-backed repo intelligence to AI agents. The spec defines tools, discovery, and versioned protocol identifiers, which makes it a good fit for code health lookups and dependency queries. (modelcontextprotocol.io)

Can I build this in an open-source repo?

Yes. AGPL-3.0 is a common choice for server software that you want to keep open over the network, and repowise uses that license. The practical benefit is simple: if the scorer becomes a shared internal service, the implementation remains inspectable and modifiable. (gnu.org)

If you want to see the outputs before installing anything, start with the live examples and the auto-generated docs for FastAPI.

Try repowise on your repo

One command indexes your codebase.