How a 25-Marker Code Health Scorer Actually Works

repowise team·May 20, 2026·15 min read

code health scorer25 markerscode health algorithmhow code health is calculatedhealth score internals

Why one number isn’t enough

A code health scorer works because one number is only useful if it is built from signals that map to real maintenance cost. A file can look clean and still sit at the center of churn. Another file can be ugly but stable enough that it should stay low on your list. A useful score needs to account for structure, ownership, change history, and the parts of the codebase that people actually touch. That is the point of the 25-marker model: reduce a messy set of maintenance signals into one score without pretending the score is magic. CodeScene’s public product docs make a similar argument, saying its Code Health is an aggregated metric built from 25+ factors and that a single KPI is not enough on its own. (codescene.com)

That framing also fits how the Model Context Protocol is used in agent tools today: MCP servers expose structured capabilities and let clients ask for exactly the data they need, rather than stuffing a model with a blob of text. MCP’s spec defines tools as named actions with schemas, and current spec pages describe versioned protocol identifiers and tool discovery. (modelcontextprotocol.io)

For a repo intelligence system like repowise, that matters. If you want the score to be useful in code review, architecture work, or refactoring triage, you need the score to be explainable at file level, roll up cleanly at module level, and change in a predictable way over time. See what repowise generates on real repos in our live examples, or check repowise’s architecture if you want to see how the data flows end to end. For the full picture, see the complete code-health guide.

Code Health Score Pipeline

The 25 markers and why each is on the list

The 25 markers are not random code smells. They are chosen because each one can be measured automatically and each one tends to correlate with maintainability pain. A good code health algorithm avoids subjective labels like “messy” and instead asks concrete questions: how complex is this file, how much does it depend on others, how many paths does it expose, how often does it change, and how old is the last cleanup?

Here is the list I use when I design a scorer like this.

Marker	What it measures	Why it matters
1. Cyclomatic complexity	Branching paths in a function or file	Harder to reason about, harder to test
2. Cognitive complexity	Nested control flow and mental load	Captures “this feels hard” better than raw branch count
3. File size	Lines or tokens per file	Large files hide unrelated behavior
4. Function count	How much is packed into one unit	High density often means poor separation
5. Parameter count	API surface per function	Too many inputs often signal muddled responsibilities
6. Duplication	Repeated blocks or near-duplicates	Copies diverge and raise bug risk
7. Coupling out	Imports, calls, or references to other units	High fan-out makes changes expensive
8. Coupling in	Dependents or reverse references	Central files carry more blast radius
9. Change frequency	How often the file changed in git history	Frequent edits can indicate churn or instability
10. Recent churn	Change volume in a recent window	Finds active hot spots, not just old trouble
11. Ownership concentration	How many people touch it	Bus factor and review risk
12. Test coverage proxy	Presence of tests or nearby test linkage	Low confidence code should score lower

A scorer can use more than one signal per marker. For example, “coupling” can come from dependency edges, call graphs, or co-change history. “Ownership concentration” can come from git authorship and recent reviewer patterns. “Test coverage proxy” can be a real coverage report if the repo has one, or a weaker heuristic when it does not.

The key design choice is this: each marker should be individually explainable. If a file scores poorly, the report should say why. “High coupling out” is useful. “Bad vibes” is not.

CodeScene’s public docs describe Code Health as an aggregated metric built from a set of factors scanned from source code, and their product pages note that Code Health is weighted and normalized across codebases. That is the same shape of problem, even if the factor set differs. (codescene.com)

How code health is calculated at file level

A file-level score is usually the first place people look, so the formula needs to be stable and boring. The common pattern is:

Measure each marker on a normalized scale.
Convert each marker into a risk contribution.
Weight the contributions.
Aggregate to one file score.
Clamp the result to a fixed range, often 0–100.

A simple version looks like this:

file_health = 100 - Σ(weight_i × risk_i)

Where risk_i is already normalized. That normalization step matters more than the formula. A raw cyclomatic complexity of 18 means very different things in a 40-line helper versus a 2,000-line orchestration file. Good scoring systems normalize against file type, language, and sometimes repository baseline.

A practical normalization path

For each marker:

Map raw values to a percentile or z-score against the repo.
Apply thresholds for “safe,” “watch,” and “risk.”
Convert the score to a penalty curve.
Cap extreme penalties so one bad metric does not drown everything else.

That last part is important. If a scorer punishes one giant file too hard, the score becomes a size detector. If it punishes too softly, it hides real risk. The curve needs to be nonlinear enough to catch outliers and smooth enough that a small change in a function does not swing the score by 30 points.

Example

Imagine a 420-line TypeScript file:

Complexity: high
Duplication: moderate
Coupling out: high
Change frequency: high
Ownership concentration: one person
Tests: weak

A file like that should score poorly even if it has decent formatting. Formatting is not maintainability. That is why the markers include structural and historical signals, not just lint-style checks.

If you want to see this kind of output on a real codebase, the FastAPI dependency graph demo shows how structural context feeds later scoring, and the hotspot analysis demo shows why churn belongs in the same conversation as complexity.

File Score Breakdown

Per-file aggregation

A code health scorer is only as good as its file aggregation. If the per-file score is noisy, everything above it becomes noise too.

There are three common ways to aggregate:

1. Weighted average

The simplest approach is to average marker penalties with weights. This works if you want transparency and easy debugging.

2. Max-risk bias

Some scorers intentionally bias toward the worst marker. That makes sense for files with one glaring issue, such as extreme coupling or extreme size.

3. Hybrid aggregation

This is usually the best choice. Use a weighted average, then add a penalty for any marker that crosses a critical threshold.

A hybrid model lets the scorer say:

“This file is mostly fine.”
“This file has one severe issue.”
“This file is structurally unhealthy across several dimensions.”

That last category matters because maintenance pain is often cumulative. A file with moderate complexity, moderate churn, and weak tests can be more expensive than a file with one obvious smell.

Why file size is usually a weight, not a hard rule

Large files are not always bad. Generated code is large by design. Parsing logic can be large and still manageable. That is why file size should affect the score, but not dominate it. It is a contributor, not a verdict.

Why normalization is repo-specific

A 150-line function might be huge in one repo and normal in another. A scorer should compare a file against its neighbors first. Absolute thresholds can still exist, but they should be a guardrail, not the whole model.

Module rollup

Once you have file scores, you can roll them up to module, package, or service level. This is where a health score internals view becomes more useful than a dashboard number.

A good module rollup should answer three questions:

Which files dominate the module score?
Is the module score dragged down by one hotspot or by broad decay?
Is the module score hiding a critical dependency hub?

A good rollup formula

A simple average is not enough. You want the larger and more central files to matter more.

A practical rollup might use:

file score
file size
dependency centrality
churn
internal fan-in

That gives you something like:

module_health = weighted_mean(file_health, file_weight)

Where file_weight is influenced by both size and centrality.

Why centrality matters

A small file at the center of many dependencies can be more important than a large leaf file. If that file changes, many callers can break. If it is unhealthy, the blast radius is high. That is one reason a module rollup should include dependency graph data, not just averages.

This is also where code intelligence platforms help more than plain linters. repowise parses dependency graphs across 10+ languages, which lets a module score reflect actual structure instead of directory shape. If you want to see the underlying data model, check our architecture page or see all 9 MCP tools in action on a real codebase.

Trend computation

A static score is fine for a snapshot. A trend is what makes the score operational.

Trend computation should answer:

Is the score improving?
Is it flat but unstable?
Is it degrading slowly?
Did a recent change cause a cliff drop?

The basic trend model

At minimum, keep a time series per file and per module.

trend = score_t - score_(t-n)

But that is too coarse by itself. Better systems smooth the line with a rolling average or exponential moving average, then compare current value to baseline.

What to track

last 7 days
last 30 days
last 90 days
since last release
since ownership change

That gives both short-term and medium-term views. A file that dipped yesterday after a refactor should not be treated the same as a file that has been declining for six months.

Why trend beats raw score in review

A PR that drops health from 84 to 81 may be fine if the repo was at 20 last month and climbing. A PR that drops from 84 to 81 after three weeks of decline is different. The trend turns the score into a management tool instead of a badge.

What we deliberately did NOT include

A serious code health algorithm should exclude signals that are easy to measure but hard to defend.

1. Style-only lint rules

Formatting issues can be fixed by automation. They are useful for consistency, but they do not tell you much about long-term maintenance cost.

2. Commit message quality

Interesting for process hygiene. Weak as a code health signal.

3. Subjective labels

Words like “clean,” “messy,” or “ugly” do not help. They are impossible to audit and hard to tune.

4. Raw lines changed in a single PR

Big PRs are not always bad, and tiny PRs are not always safe. What matters is where the change lands and how often that area already moves.

5. Generic repository popularity

Stars, forks, and contributor count may matter for project health, but they do not say much about maintainability inside the codebase.

That last point matters in open source. repowise is AGPL-3.0 licensed, and the AGPL is designed for network software where users interact with a modified version over a network. That keeps the code available to the community in the server setting. (gnu.org)

Tradeoffs vs SonarQube SQALE and CodeScene Code Health

There are two useful comparison points here: SonarQube’s SQALE-style maintainability model and CodeScene’s Code Health.

SonarQube SQALE

SQALE is centered on technical debt and remediation effort. It gives teams a maintainability view that maps issues to estimated effort, which is useful if your main question is “how much work is behind this issue set?” SonarQube’s public docs organize maintainability around debt ratios, remediation estimates, and quality gates. (codescene.com)

CodeScene Code Health

CodeScene’s Code Health is more opinionated about code-level risk. The product says the metric is aggregated from 25+ factors and is used alongside hotspots to prioritize refactoring candidates. It also stresses that the score is weighted and normalized, not just a raw issue count. (codescene.com)

How the tradeoffs look in practice

Dimension	SQALE-style debt	CodeScene-style Code Health	25-marker scorer
Main question	How much effort to fix issues?	How healthy is this code, structurally and historically?	How healthy is this file or module across 25 measurable signals?
Strength	Clear remediation framing	Strong prioritization with hotspots	Easier to explain and implement in-house
Weakness	Can become issue-count heavy	Can feel proprietary or opaque	Needs careful calibration
Best use	Quality gates and technical debt accounting	Refactoring prioritization	Repo intelligence, PR review, and internal dashboards

The practical choice is not “which is perfect.” It is “which model matches the decisions your team needs to make.” If you need a lightweight internal system that you can audit and extend, a 25-marker scorer is a good fit.

Open-source implementation walkthrough

Here is how I would build the scorer in an open-source repo intelligence platform.

Step 1: Parse the codebase

Walk the repo, parse files, and build symbol indexes. For supported languages, extract imports, declarations, and references. repowise already supports Python, TypeScript, JavaScript, Go, Rust, Java, C++, C, Ruby, and Kotlin, so the same pipeline can feed the score in mixed-language repos.

Step 2: Compute the 25 markers

For each file:

complexity metrics from AST traversal
duplication from token or block similarity
coupling from import and call edges
churn from git history
ownership from author concentration
test proximity from file path and coverage data

Step 3: Normalize against the repo

Use repo baselines, not fixed global thresholds. A scorer should learn the shape of the repository it is judging.

Step 4: Aggregate to file and module scores

Combine the markers into a file score, then roll that score upward with dependency-aware weights.

Step 5: Store history

Persist score snapshots so you can chart trends, spot regressions, and show before/after deltas in PRs.

Step 6: Expose it through tools

This is where MCP helps. MCP servers expose tools with defined schemas, and the current spec defines tool discovery and versioned protocol behavior. That makes it a good fit for agent-facing score lookups, risk summaries, and dependency paths. (modelcontextprotocol.io)

If you want a concrete starting point, try repowise on your own repo — the MCP server is configured automatically — or use pip install repowise && repowise init.

Module Rollup and Trend View

FAQ

How is code health calculated?

A typical code health scorer calculates a file score by normalizing several markers, weighting them, and converting the result into a fixed score range such as 0–100. The important part is not the exact formula. It is the calibration: the same raw metric can mean different things in different repos.

What are the 25 markers in a code health scorer?

A practical 25-marker set includes complexity, file size, function count, parameter count, duplication, coupling in, coupling out, change frequency, recent churn, ownership concentration, test proxy, and dependency centrality. The exact list can vary, but each marker should be measurable and explainable.

Why not use only cyclomatic complexity?

Because one metric misses too much. A file can have moderate complexity and still be high risk if it changes often, sits at the center of the graph, and has weak tests. A useful scorer combines structural and historical signals.

How does a module score differ from a file score?

A file score looks at one unit. A module score rolls multiple file scores together, usually with weights for size, centrality, and churn. This helps surface hidden risk in a package that looks fine at the file level but is unhealthy overall.

How do trends help more than a single score?

Trends show direction. A flat score can hide decay, while a slightly lower score may still be acceptable if the repo is improving. Trend lines make the scorer useful for PR review and release tracking.

How does this differ from SonarQube or CodeScene?

SonarQube’s maintainability model is centered on technical debt and remediation effort, while CodeScene’s Code Health emphasizes aggregated code-level health and hotspot-aware prioritization. A 25-marker scorer can borrow the useful parts of both ideas while staying smaller, more transparent, and easier to adapt. (codescene.com)

Is the Model Context Protocol relevant here?

Yes. MCP gives you a standard way to expose tool-backed repo intelligence to AI agents. The spec defines tools, discovery, and versioned protocol identifiers, which makes it a good fit for code health lookups and dependency queries. (modelcontextprotocol.io)

Can I build this in an open-source repo?

Yes. AGPL-3.0 is a common choice for server software that you want to keep open over the network, and repowise uses that license. The practical benefit is simple: if the scorer becomes a shared internal service, the implementation remains inspectable and modifiable. (gnu.org)

If you want to see the outputs before installing anything, start with the live examples and the auto-generated docs for FastAPI.