Architecture · 5 min read

Why OpenRegistry returns raw upstream data — and refuses to summarise

Published 2026-04-29 · Permalink

OpenRegistry has one design rule that's rare in the AI-tools space: every byte of every tool response comes verbatim from the upstream registry. We do not summarise, normalise field names, infer industry classifications, derive risk scores, translate non-English fields, or pad missing values with sensible-looking defaults. If a registry has stale data, we return stale data. If a registry has a quirky field name in Polish, we return the Polish name.

This is a deliberate choice that costs us some marketing-friendly polish and makes the outputs harder to skim. It's also the reason agents, compliance teams, and third-party developers can rely on the result.

The opposite approach

Most company-data products — Bureau van Dijk Orbis, Refinitiv WorldCheck, Dun & Bradstreet, Crunchbase, OpenCorporates — sit between raw registry feeds and end users. They scrape, parse, normalise, classify, dedupe across sources, and serve their own cleaned-up version of the world. That's valuable when the consumer is a human analyst who wants a single tidy dossier, but it's lossy and opinionated:

For an AI agent, all four of these hurt more than they help. The agent already has the world's best natural-language post-processor wired in — its own LLM. Pre-cooking the data for it is double work and forces the agent's model to second-guess what was lost in normalisation.

What "verbatim" actually means

Concretely, every tool in OpenRegistry has a jurisdiction_data object on its response that contains the raw upstream payload, byte-for- byte:

On top of that, every record carries a unified shape — jurisdiction, company_id, company_name, status, incorporation_date, registered_address — so a cross-jurisdictional agent can reason without knowing the upstream key names. Both layers are present. The agent picks whichever it needs.

Where this matters in practice

Status semantics

UK CH has eight possible company_status values plus a free-text company_status_detail for the "in-flight" cases (e.g. struck-off notice published, liquidation, voluntary arrangement). Normalising these to active / dissolved would lose the signal a KYC team is looking for — a company that's active - proposal to strike off is materially riskier than a plain active one. Our unified status field tags it active for cross-country comparison; the upstream nuance is preserved verbatim in jurisdiction_data.

Shareholders vs PSC

UK CH publishes a structured PSC register and a separate (filing-only) statement of capital. They disagree: a 10% shareholder appears in the statement of capital but not in PSC; a corporate trustee with appointment rights appears in PSC but not as a shareholder. Most aggregators conflate them. We don't. get_shareholders returns Statement-of-Capital filing references; get_persons_with_significant_control returns the structured PSC register. A KYC tool that treats PSC as "the shareholder list" will misread companies whose ownership doesn't pass through equity.

Date encoding

India's MCA OGD feed has a documented Y2K bug: pre-1946 incorporation dates are offset by +100 years in the primary drop. We surface the bug rather than silently fixing it - jurisdiction_data._upstream_date_y2k_bug: true - and preserve the raw value. A consumer who depends on 100-year-old companies (rare but real - banks, exchanges, livery companies) needs to know the upstream is wrong and decide how to handle it. Silent correction would mask the issue from the agent.

The trade

Verbatim data is harder to skim. A human analyst opening a single response sees thirty-plus fields with their original upstream names and has to know what they mean. A normalised view would let her see "company type: private limited" instead of company_type: "ltd".

For agents, the trade flips. The agent's LLM understands "ltd" and "PLC" and "AG" and "GmbH" and "S.r.l." and "주식회사" perfectly. What the agent needs from us is reliability — when its tool returns a value, the value has to be exactly what the registry says, not a derivative two layers of opinion away. That's what verbatim buys.

What this means for you. If you're evaluating OpenRegistry against an aggregator and the aggregator's responses look "tidier", that's not us being negligent. It's the architectural choice. A well-built MCP client with a current-generation LLM (Claude 4.x, GPT-5.x, Gemini 3.x) will read raw jurisdiction_data as fluently as a human reads a pre-formatted dossier. The cost of not seeing the raw data — silent normalisation errors that break in production — is much higher.