Why OpenRegistry returns raw upstream data — and refuses to summarise

OpenRegistry has one design rule that's rare in the AI-tools space: every byte of every tool response comes verbatim from the upstream registry. We do not summarise, normalise field names, infer industry classifications, derive risk scores, translate non-English fields, or pad missing values with sensible-looking defaults. If a registry has stale data, we return stale data. If a registry has a quirky field name in Polish, we return the Polish name.

This is a deliberate choice that costs us some marketing-friendly polish and makes the outputs harder to skim. It's also the reason agents, compliance teams, and third-party developers can rely on the result.

The opposite approach

Most company-data products — Bureau van Dijk Orbis, Refinitiv WorldCheck, Dun & Bradstreet, Crunchbase, OpenCorporates — sit between raw registry feeds and end users. They scrape, parse, normalise, classify, dedupe across sources, and serve their own cleaned-up version of the world. That's valuable when the consumer is a human analyst who wants a single tidy dossier, but it's lossy and opinionated:

A normalised field is one tool's interpretation of an upstream value. The interpretation can be wrong, and is rarely auditable.
A "company status: active" derived value collapses the upstream's actual finer-grained status (e.g. active - proposal to strike off, active - liquidation) into binary truth.
A scrape-and-cache pipeline introduces a freshness gap. By the time an analyst sees the data, it could be hours, days, or quarters out of date.
An aggregator's identifier (BvD ID, OpenCorporates ID) is not the registry's identifier, so you can't cleanly hand the result back to the government's own portal for verification.

For an AI agent, all four of these hurt more than they help. The agent already has the world's best natural-language post-processor wired in — its own LLM. Pre-cooking the data for it is double work and forces the agent's model to second-guess what was lost in normalisation.

What "verbatim" actually means

Concretely, every tool in OpenRegistry has a jurisdiction_data object on its response that contains the raw upstream payload, byte-for- byte:

UK Companies House → all CH public data API fields, with original names (company_status_detail, has_been_liquidated, previous_company_names[].ceased_on, etc.)
Norway Brreg → all Brønnøysund response fields with Norwegian keys (navn, organisasjonsform, postadresse)
OpenDART (Korea) → 모든 응답 필드 with Korean keys (corp_name, corp_name_eng, jurir_no) plus the raw XBRL document bytes
Spain BORME → all 35 sub-fields of an acto_inscrito entry, in Spanish, including the original "actos" XML structure

On top of that, every record carries a unified shape — jurisdiction, company_id, company_name, status, incorporation_date, registered_address — so a cross-jurisdictional agent can reason without knowing the upstream key names. Both layers are present. The agent picks whichever it needs.

Where this matters in practice

Status semantics

UK CH has eight possible company_status values plus a free-text company_status_detail for the "in-flight" cases (e.g. struck-off notice published, liquidation, voluntary arrangement). Normalising these to active / dissolved would lose the signal a KYC team is looking for — a company that's active - proposal to strike off is materially riskier than a plain active one. Our unified status field tags it active for cross-country comparison; the upstream nuance is preserved verbatim in jurisdiction_data.

Shareholders vs PSC

UK CH publishes a structured PSC register and a separate (filing-only) statement of capital. They disagree: a 10% shareholder appears in the statement of capital but not in PSC; a corporate trustee with appointment rights appears in PSC but not as a shareholder. Most aggregators conflate them. We don't. get_shareholders returns Statement-of-Capital filing references; get_persons_with_significant_control returns the structured PSC register. A KYC tool that treats PSC as "the shareholder list" will misread companies whose ownership doesn't pass through equity.

Date encoding

India's MCA OGD feed has a documented Y2K bug: pre-1946 incorporation dates are offset by +100 years in the primary drop. We surface the bug rather than silently fixing it - jurisdiction_data._upstream_date_y2k_bug: true - and preserve the raw value. A consumer who depends on 100-year-old companies (rare but real - banks, exchanges, livery companies) needs to know the upstream is wrong and decide how to handle it. Silent correction would mask the issue from the agent.

The trade

Verbatim data is harder to skim. A human analyst opening a single response sees thirty-plus fields with their original upstream names and has to know what they mean. A normalised view would let her see "company type: private limited" instead of company_type: "ltd".

For agents, the trade flips. The agent's LLM understands "ltd" and "PLC" and "AG" and "GmbH" and "S.r.l." and "주식회사" perfectly. What the agent needs from us is reliability — when its tool returns a value, the value has to be exactly what the registry says, not a derivative two layers of opinion away. That's what verbatim buys.

What this means for you. If you're evaluating OpenRegistry against an aggregator and the aggregator's responses look "tidier", that's not us being negligent. It's the architectural choice. A well-built MCP client with a current-generation LLM (Claude 4.x, GPT-5.x, Gemini 3.x) will read raw jurisdiction_data as fluently as a human reads a pre-formatted dossier. The cost of not seeing the raw data — silent normalisation errors that break in production — is much higher.