Benchmark proof

How task packets hold up on public repos.

These scorecards turn the benchmark suite into a product readout. The question is not just whether Cartograph emits a packet. The question is whether the packet changes what a developer or coding agent reads next.

At a glance

Where Cartograph is already strong.

Strong today

Concrete bug-fix packets with explicit change surfaces
Mixed-language repos when the task names the real adapter or converter path
Small-repo ergonomics now that `analyze --static --json` stays compact by default

Still being refined

Trace-flow validation targeting in large monorepos
Broad task packets that span frontend and backend simultaneously
Framework-heavy repos where many tests match the same generic terms

Scorecards

Four benchmark cases that show the current shape of the product.

Strong

`llama.cpp` bug-fix

The packet stays on the Dots OCR adapter and GGUF conversion surface, then lands on the exact Dots OCR GGUF test as the first validation target.

Good with caveats

`fastapi` bug-fix

Source-side focus is good. The packet prioritizes dependency and exception surfaces correctly, but test targeting still drifts to less relevant validations.

Open benchmark JSON

Good with caveats

`next.js` trace-flow

The packet stays on router and route code instead of collapsing into fixtures, but the first validation targets are still more e2e-heavy than ideal.

Open benchmark JSON

Useful, not clean enough yet

`open-webui` task

Backend state and redis surfaces rank highly, but broad task packets still admit generic structural files like `README.md` too easily.

Open benchmark JSON

What it means

Why this matters as a product signal.

Small repos

Cartograph should get out of the way.

Compact `analyze` output is the right default. Direct reads are often faster, and the product should acknowledge that.

Medium repos

`packet` and `context` start paying off.

This is where the product shifts from “interesting” to “time saver.”

Large repos

`analyze` becomes triage and handoff infrastructure.

The product value is not raw summary. It is cutting the next read list down to the useful shape.