OpenRegistry has one design rule that's rare in the AI-tools space: every byte of every tool response comes verbatim from the upstream registry. We do not summarise, normalise field names, infer industry classifications, derive risk scores, translate non-English fields, or pad missing values with sensible-looking defaults. If a registry has stale data, we return stale data. If a registry has a quirky field name in Polish, we return the Polish name.
This is a deliberate choice that costs us some marketing-friendly polish and makes the outputs harder to skim. It's also the reason agents, compliance teams, and third-party developers can rely on the result.
The opposite approach
Most company-data products — Bureau van Dijk Orbis, Refinitiv WorldCheck, Dun & Bradstreet, Crunchbase, OpenCorporates — sit between raw registry feeds and end users. They scrape, parse, normalise, classify, dedupe across sources, and serve their own cleaned-up version of the world. That's valuable when the consumer is a human analyst who wants a single tidy dossier, but it's lossy and opinionated:
- A normalised field is one tool's interpretation of an upstream value. The interpretation can be wrong, and is rarely auditable.
- A "company status: active" derived value collapses the upstream's actual finer-grained status (e.g. active - proposal to strike off, active - liquidation) into binary truth.
- A scrape-and-cache pipeline introduces a freshness gap. By the time an analyst sees the data, it could be hours, days, or quarters out of date.
- An aggregator's identifier (BvD ID, OpenCorporates ID) is not the registry's identifier, so you can't cleanly hand the result back to the government's own portal for verification.
For an AI agent, all four of these hurt more than they help. The agent already has the world's best natural-language post-processor wired in — its own LLM. Pre-cooking the data for it is double work and forces the agent's model to second-guess what was lost in normalisation.
What "verbatim" actually means
Concretely, every tool in OpenRegistry has a jurisdiction_data
object on its response that contains the raw upstream payload, byte-for-
byte:
- UK Companies House → all CH public data API fields, with original names (
company_status_detail,has_been_liquidated,previous_company_names[].ceased_on, etc.) - Norway Brreg → all Brønnøysund response fields with Norwegian keys (
navn,organisasjonsform,postadresse) - OpenDART (Korea) → 모든 응답 필드 with Korean keys (
corp_name,corp_name_eng,jurir_no) plus the raw XBRL document bytes - Spain BORME → all 35 sub-fields of an
acto_inscritoentry, in Spanish, including the original "actos" XML structure
On top of that, every record carries a unified shape — jurisdiction,
company_id, company_name, status, incorporation_date, registered_address —
so a cross-jurisdictional agent can reason without knowing the upstream key
names. Both layers are present. The agent picks whichever it needs.
Where this matters in practice
Status semantics
UK CH has eight possible company_status values plus a free-text
company_status_detail for the "in-flight" cases (e.g. struck-off
notice published, liquidation, voluntary arrangement). Normalising these
to active / dissolved would lose the signal a
KYC team is looking for — a company that's active - proposal to strike off
is materially riskier than a plain active one. Our unified
status field tags it active for cross-country comparison;
the upstream nuance is preserved verbatim in jurisdiction_data.
Shareholders vs PSC
UK CH publishes a structured PSC register and a separate (filing-only)
statement of capital. They disagree: a 10% shareholder appears in the
statement of capital but not in PSC; a corporate trustee with appointment
rights appears in PSC but not as a shareholder. Most aggregators conflate
them. We don't. get_shareholders returns Statement-of-Capital
filing references; get_persons_with_significant_control returns the
structured PSC register. A KYC tool that treats PSC as "the shareholder
list" will misread companies whose ownership doesn't pass through equity.
Date encoding
India's MCA OGD feed has a documented Y2K bug: pre-1946 incorporation
dates are offset by +100 years in the primary drop. We surface the bug
rather than silently fixing it - jurisdiction_data._upstream_date_y2k_bug:
true - and preserve the raw value. A consumer who depends on
100-year-old companies (rare but real - banks, exchanges, livery
companies) needs to know the upstream is wrong and decide how to handle
it. Silent correction would mask the issue from the agent.
The trade
Verbatim data is harder to skim. A human analyst opening a single response
sees thirty-plus fields with their original upstream names and has to know
what they mean. A normalised view would let her see "company type:
private limited" instead of company_type: "ltd".
For agents, the trade flips. The agent's LLM understands "ltd" and "PLC" and "AG" and "GmbH" and "S.r.l." and "주식회사" perfectly. What the agent needs from us is reliability — when its tool returns a value, the value has to be exactly what the registry says, not a derivative two layers of opinion away. That's what verbatim buys.
jurisdiction_data as fluently as a human reads a
pre-formatted dossier. The cost of not seeing the raw data — silent
normalisation errors that break in production — is much higher.