METHODOLOGY · AI ACCURACY
How we measure
window-counting accuracy.
The accuracy claims on the pricing page are grounded in a real benchmark, not vibes. This page documents how we measure, what the dataset looks like, and where the numbers come from.
The dataset
benchmarks/window-count-v1.csv in the public GitHub repo. 50 UK addresses with ground-truth window counts. Each row carries the address, lat/lng, floor count, verified window count, and property type.
Ground truth comes from one of three sources, in order of preference:
- Site visit · the founder counted windows from outside the property. Most reliable.
- Operator job report · counts confirmed by a window-cleaning round that actually serviced the address.
- Manual Street View count · careful manual count from all four angles. Weakest because rear elevations are often obscured; flagged in the dataset notes column.
The dataset spans urban London, regional UK cities (Manchester, Bristol, Edinburgh, Glasgow, Cardiff, Bath), and suburban Surrey / Hertfordshire. Mix of terraced, semi-detached, detached, flat. v1 excludes conservatories and solar panels · those are v2.
How we score
The engine receives the lat/lng, fetches imagery, runs the vision model, and returns a window count. We score the absolute error against the ground-truth count.
- ±1 hit rate · share of properties where the engine’s count is within 1 of ground truth.
- ±2 hit rate · within 2.
- Mean absolute error · average of |aiCount − groundTruth|.
Each tier (Solo, Operator, Fleet) is scored separately because they ship different models + imagery configurations.
When the benchmark runs
- On every PR that touches the AI pipeline (
lib/claude.tsorlib/aiTier.ts) · drift greater than 3 percentage points blocks the deploy. - Quarterly against the production model · catches drift from upstream model changes (Anthropic version bumps, new Sonnet / Opus releases).
Results are written to benchmarks/results-<tier>-<date>.json and committed to the repo so historical drift is auditable.
What we currently claim
Until the v1 dataset is fully populated and one full benchmark run completes, the pricing page uses non-numeric language (“typically within ±2 windows” rather than a percentage). We’ll update this page with the measured numbers and a link to the latest results JSON the moment that ships.
The non-numeric copy is observed-true on the development set but uncited in the formal statistical sense. We’d rather be honest about that than invent percentages.
Customer count is always authoritative
Whatever the AI says, the customer’s own window count is what the quote bills on. The AI read sits in the operator’s pre-arrival brief (an audit trail) but never overrules the customer-facing number. We don’t want operators arguing with customers over a count the AI got wrong.