Validating Metadata Against FAIR Criteria Automatically: A Python Implementation Guide

This guide implements the automated compliance gate that a normalized metadata record passes through before it is written to persistent storage. The task is narrow and deterministic: take one canonical record, evaluate it against a fixed set of checks derived from the FAIR Principles (Findable, Accessible, Interoperable, Reusable), and return a machine-readable verdict that classifies every finding as either a critical violation (block publication) or a non-critical gap (publish, then remediate asynchronously). It assumes Python 3.10+, familiarity with the asyncio and httpx primitives, and a record that has already been through the crosswalk stage. This step sits directly downstream of Metadata Schema Mapping: the mapper hands over a structurally canonical payload, and this validator decides whether that payload is compliant enough to publish. It is where two of the four principles from the FAIR Principle Breakdown are actually enforced rather than merely asserted.

When automated validation degrades at scale, the root cause is almost never the FAIR Principles themselves. It is schema version drift, unresolved persistent identifiers, missing license strings, or a controlled-vocabulary resolver returning HTTP 404/410 under load. The design goal is therefore not just to flag non-compliant records, but to isolate the failure boundary, keep throughput stable through registry outages, and leave a queryable audit trail an accreditor can trust.

The FAIR Validation Rule Set

Every check is deterministic: a named rule, the record field it inspects, the exact pass condition, and a severity that decides routing. critical findings block publication; warning findings are recorded in provenance and re-attempted on a later pass so an upstream registry outage never becomes a publication outage. The controlled vocabularies referenced below — the SPDX License List for rights, ORCID identifiers for creator, and ROR identifiers for affiliations — are cited by their full names because an unresolved term is a measurable downgrade in machine-actionability, not a cosmetic issue.

Rule ID	FAIR principle	Field(s) inspected	Pass condition	Severity
`F1-pid-present`	Findable	`identifier`	Non-empty string matching a DOI/Handle/ARK shape	`critical`
`F1-pid-resolves`	Findable	`identifier`	Resolver returns HTTP 200 (or 3xx to a 200)	`warning`
`F2-rich-metadata`	Findable	`title`, `creators`	`title` non-empty and `creators` length ≥ 1	`critical`
`A1-access-protocol`	Accessible	`accessUrl`	Present and uses `https` scheme	`critical`
`A1-access-rights`	Accessible	`accessRights`	One of `open`, `embargoed`, `restricted`	`critical`
`I1-context`	Interoperable	`@context`	Declares a W3C JSON-LD 1.1 context URI	`critical`
`I2-resource-type`	Interoperable	`resourceTypeGeneral`	In DataCite Metadata Schema controlled vocabulary	`warning`
`R1-license-spdx`	Reusable	`rights`	Exact match against the SPDX License List	`critical`
`R1-creator-orcid`	Reusable	`creators[].orcid`	Every creator carries a resolvable ORCID URI	`warning`
`R1-provenance`	Reusable	`provenance`	Non-empty source + timestamp block present	`warning`

The split between critical and warning is the single most important design decision on this page. Structural conformance (a well-formed identifier, a declared context, an SPDX license) is a hard gate. Live resolution of external identifiers is inherently flaky, so it is scored as a warning and deferred — this is what keeps validation throughput decoupled from third-party uptime.

Production Python Implementation

The validator runs the cheap synchronous structural checks first and short-circuits on any critical structural failure, so a malformed record never triggers outbound HTTP. Only records that pass the structural gate spend network budget on identifier resolution. External calls run concurrently under a bounded asyncio.Semaphore and each is wrapped so a timeout downgrades to a warning rather than raising — mirroring the circuit-breaker posture described in API Routing & Fallbacks.

python

from __future__ import annotations

import asyncio
import re
from dataclasses import dataclass, field
from enum import Enum
from typing import Any

import httpx

# --- Controlled vocabularies (in production, load from a versioned registry) ---
SPDX_LICENSES: frozenset[str] = frozenset(
    {"CC-BY-4.0", "CC0-1.0", "CC-BY-SA-4.0", "MIT", "Apache-2.0"}
)
DATACITE_RESOURCE_TYPES: frozenset[str] = frozenset(
    {"Dataset", "Software", "Collection", "Text", "Image"}
)
ACCESS_RIGHTS: frozenset[str] = frozenset({"open", "embargoed", "restricted"})

PID_SHAPE = re.compile(r"^(10\.\d{4,9}/\S+|hdl:\S+|ark:/\d+/\S+)$")
ORCID_URI = re.compile(r"^https://orcid\.org/\d{4}-\d{4}-\d{4}-\d{3}[\dX]$")


class Severity(str, Enum):
    CRITICAL = "critical"
    WARNING = "warning"


@dataclass(slots=True)
class Finding:
    rule_id: str
    principle: str
    severity: Severity
    message: str


@dataclass(slots=True)
class FairReport:
    content_hash: str
    findings: list[Finding] = field(default_factory=list)

    @property
    def blocked(self) -> bool:
        """A record is blocked only by a CRITICAL finding."""
        return any(f.severity is Severity.CRITICAL for f in self.findings)

    @property
    def score(self) -> float:
        """Fraction of the 10 rules that passed; warnings still count as misses."""
        return round(1.0 - len(self.findings) / 10, 3)

    @property
    def verdict(self) -> str:
        if self.blocked:
            return "blocked"
        return "deferred" if self.findings else "compliant"


def _structural_checks(record: dict[str, Any], report: FairReport) -> None:
    """Cheap, offline checks. Populate report.findings with structural gaps."""
    def fail(rule: str, principle: str, sev: Severity, msg: str) -> None:
        report.findings.append(Finding(rule, principle, sev, msg))

    pid = str(record.get("identifier", "")).strip()
    if not PID_SHAPE.match(pid):
        fail("F1-pid-present", "Findable", Severity.CRITICAL,
             f"identifier {pid!r} is not a DOI/Handle/ARK")
    if not record.get("title") or not record.get("creators"):
        fail("F2-rich-metadata", "Findable", Severity.CRITICAL,
             "title and at least one creator are required")
    access_url = str(record.get("accessUrl", ""))
    if not access_url.startswith("https://"):
        fail("A1-access-protocol", "Accessible", Severity.CRITICAL,
             "accessUrl must be an https URL")
    if record.get("accessRights") not in ACCESS_RIGHTS:
        fail("A1-access-rights", "Accessible", Severity.CRITICAL,
             f"accessRights must be one of {sorted(ACCESS_RIGHTS)}")
    if not str(record.get("@context", "")).startswith("http"):
        fail("I1-context", "Interoperable", Severity.CRITICAL,
             "a JSON-LD @context URI is required")
    if record.get("resourceTypeGeneral") not in DATACITE_RESOURCE_TYPES:
        fail("I2-resource-type", "Interoperable", Severity.WARNING,
             "resourceTypeGeneral is outside the DataCite vocabulary")
    if record.get("rights") not in SPDX_LICENSES:
        fail("R1-license-spdx", "Reusable", Severity.CRITICAL,
             f"rights {record.get('rights')!r} is not an SPDX identifier")
    if not record.get("provenance"):
        fail("R1-provenance", "Reusable", Severity.WARNING,
             "provenance block (source + timestamp) is missing")


async def _resolves(client: httpx.AsyncClient, url: str) -> bool:
    """A timeout or transport error is a soft miss, never an exception."""
    try:
        resp = await client.get(url, follow_redirects=True, timeout=5.0)
        return resp.status_code == 200
    except httpx.HTTPError:
        return False


async def _resolution_checks(
    record: dict[str, Any], report: FairReport, client: httpx.AsyncClient
) -> None:
    """Concurrent external checks; failures downgrade to WARNING (deferred)."""
    sem = asyncio.Semaphore(8)  # bound concurrent outbound calls

    async def guarded(url: str) -> bool:
        async with sem:
            return await _resolves(client, url)

    pid = str(record.get("identifier", "")).strip()
    creators = record.get("creators") or []
    orcids = [c.get("orcid", "") for c in creators if ORCID_URI.match(c.get("orcid", ""))]

    pid_ok, *orcid_oks = await asyncio.gather(
        guarded(f"https://doi.org/{pid}"),
        *[guarded(o) for o in orcids],
    )
    if not pid_ok:
        report.findings.append(Finding(
            "F1-pid-resolves", "Findable", Severity.WARNING,
            f"identifier {pid} did not resolve; deferred for retry"))
    if len(orcids) < len(creators) or not all(orcid_oks):
        report.findings.append(Finding(
            "R1-creator-orcid", "Reusable", Severity.WARNING,
            "one or more creator ORCIDs are missing or unresolved"))


async def validate_fair(
    record: dict[str, Any], content_hash: str
) -> FairReport:
    """Run the full FAIR gate: structural first, then bounded async resolution."""
    report = FairReport(content_hash=content_hash)
    _structural_checks(record, report)
    # Short-circuit: never spend network budget on a structurally broken record.
    if report.blocked:
        return report
    async with httpx.AsyncClient() as client:
        await _resolution_checks(record, report, client)
    return report

The FairReport.verdict collapses the routing decision into three states — compliant, deferred, blocked — which is exactly the surface a deferred-compliance scoring layer and the immutable audit trail consume downstream. Every validation event should emit one structured log line (content_hash, verdict, score, and the applied rule-set revision) so that a later replay is reproducible; the remediation state machine that consumes those verdicts is shown below.

Verification

Pin the verdict for three representative records — one clean, one with a hard structural violation, one with only a deferrable gap — so a rule regression is caught the moment the rule set is edited. The resolution checks are stubbed here so the test stays hermetic and offline.

python

import asyncio


def _record(**overrides: object) -> dict[str, object]:
    base = {
        "identifier": "10.5281/zenodo.123456",
        "title": "Soil Carbon Flux, 2026",
        "creators": [{"orcid": "https://orcid.org/0000-0002-1825-0097"}],
        "accessUrl": "https://example.org/ds/1",
        "accessRights": "open",
        "@context": "https://schema.org/",
        "resourceTypeGeneral": "Dataset",
        "rights": "CC-BY-4.0",
        "provenance": {"source": "eln", "at": "2026-05-01T09:30:00Z"},
    }
    return base | overrides


def test_verdicts(monkeypatch) -> None:
    # Force every external resolution to succeed for a deterministic test.
    async def always_ok(client, url):  # noqa: ANN001
        return True
    monkeypatch.setattr("mymod._resolves", always_ok)

    clean = asyncio.run(validate_fair(_record(), "h1"))
    assert clean.verdict == "compliant" and clean.score == 1.0

    bad = asyncio.run(validate_fair(_record(rights="Open Access"), "h2"))
    assert bad.verdict == "blocked"            # non-SPDX license is critical

    gap = asyncio.run(validate_fair(_record(provenance=None), "h3"))
    assert gap.verdict == "deferred"           # missing provenance is only a warning

Run it with pytest -q; a passing run prints 1 passed and, wired into the pipeline, each record emits one JSON line — {"stage": "fair_gate", "content_hash": "…", "verdict": "compliant", "score": 1.0} — that observability dashboards aggregate into a validation pass-rate.

Gotchas

Blocking the whole batch on a resolver timeout. Treating F1-pid-resolves as critical turns a transient DOI-resolver outage into a stalled ingestion queue. Root cause: conflating structural validity with live reachability. Fix: keep resolution checks at warning severity and defer them, exactly as the rule table specifies.
Case-folding an SPDX identifier. Normalizing rights to lowercase turns the valid CC-BY-4.0 into the invalid cc-by-4.0 and silently blocks every record. Root cause: over-eager string hygiene. Fix: only .strip() the license string — SPDX License List identifiers are case-sensitive.
Scoring a bare ORCID as compliant. A creator carrying 0000-0002-1825-0097 is not linked data; only the full https://orcid.org/0000-0002-1825-0097 URI resolves. Root cause: source systems export the short form. Fix: require the HTTPS URI in the crosswalk step upstream, or the R1-creator-orcid warning will fire on every record. See how ORCID normalization is handled during Pydantic schema validation.

Metadata Schema Mapping — the crosswalk stage that produces the canonical record this gate validates.
FAIR Principle Breakdown — how each of the four principles maps to a concrete pipeline checkpoint.
API Routing & Fallbacks — circuit breakers and bounded retries for the external identifier resolution this validator depends on.

See the Core Architecture & FAIR Mapping overview for the full pipeline topology, and the field-by-field crosswalk rules in how to map Dublin Core to schema.org for research data.

Validating Metadata Against FAIR Criteria Automatically: A Python Implementation Guide #

The FAIR Validation Rule Set #

Production Python Implementation #

Verification #

Gotchas #

Related #

Validating Metadata Against FAIR Criteria Automatically: A Python Implementation Guide

The FAIR Validation Rule Set

Production Python Implementation

Verification

Gotchas

Related