Data Governance Frameworks: Encoding Research Data Policy as an Executable Pipeline

Research data governance is no longer a binder of administrative rules consulted at audit time; in a working FAIR platform it is executable specification that decides, at write time, how every dataset is validated, described, licensed, preserved, and exposed. This guide sits inside the Open Science Infrastructure Planning pipeline as its policy layer — the stage that runs before repository deposit and identifier minting — and shows Python automation engineers and research data managers how to compile high-level compliance obligations into a deterministic sequence of gates. It assumes you already operate an ingestion pipeline and want to know exactly where each governance rule is enforced, what happens when a record fails, and how every decision is recorded for funder reporting. Treating governance as code turns compliance from a retrospective review into a property the archive holds by construction: an asset cannot enter the repository unless it has already passed every policy checkpoint.

Governance-as-code pipeline — three ordered gates guard repository ingestion, and every decision lands on the audit log.

Concept & Specification: Governance as a Chain of Typed Contracts

A governance framework is enforceable only when each rule is expressed as a contract a machine can assert rather than a sentence a human must interpret. Four bodies of standard supply the vocabulary those contracts reference. The DataCite Metadata Schema defines the mandatory properties — identifier, creator, title, publisher, publication year, resource type — that a citable research dataset must carry; the field-by-field translation from internal records into that schema is the subject of Metadata Schema Mapping. Contributor identity is expressed as an ORCID iD in its full URI form, and organizational affiliation through the Research Organization Registry (ROR). Licensing is recorded as an SPDX License List identifier so that a legal permission set is reduced to a single canonical token; automating that choice is covered in Open License Configuration. Discovery metadata is serialized to the Schema.org Dataset type so search engines and harvesters can index the record.

Each governance rule binds one of these standards to a decision point and a failure action. The framework distinguishes three rule classes. Structural rules constrain shape — required fields, types, controlled-vocabulary membership — and are checked by Pydantic schema validation at the ingestion boundary. Referential rules assert that an identifier resolves to real, well-formed metadata in an external registry. Policy rules encode institutional and funder obligations — permitted licenses, maximum embargo windows, retention periods — that no upstream standard defines for you. The sections below implement one gate per rule class, in the order a record must clear them, with the compliance rationale annotated on each step.

Step-by-Step Implementation

The pipeline advances a record through four ordered gates. A record that fails any gate is quarantined with a machine-readable reason rather than dropped, so the archive can only ever contain compliant assets and every rejection is reconstructable.

Step 1 — Define strict metadata schemas with programmatic validation

Governance-as-code requires deterministic validation boundaries. Using the Pydantic V2 API, we define a strict contract that enforces field presence, validates the ORCID URI form, constrains the license to an SPDX allowlist, and serializes the record to a Schema.org Dataset document. This gate enforces the structural rules that satisfy the Interoperable and Reusable principles before data enters any downstream stage.

python

import logging
from datetime import datetime, timezone
from typing import Optional, List
from pydantic import BaseModel, Field, field_validator, ValidationError

# Structured logging configuration for audit trails
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s | %(message)s",
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger("governance.schema_validator")

# Canonical license URLs keyed by SPDX identifier (string templating cannot
# derive these reliably, since SPDX IDs do not map 1:1 to URL paths).
LICENSE_URLS = {
    "CC-BY-4.0": "https://creativecommons.org/licenses/by/4.0/",
    "CC0-1.0": "https://creativecommons.org/publicdomain/zero/1.0/",
    "MIT": "https://opensource.org/licenses/MIT",
    "Apache-2.0": "https://www.apache.org/licenses/LICENSE-2.0",
}

class DatasetGovernanceSchema(BaseModel):
    """Strict schema mapping for FAIR-compliant dataset metadata."""
    dataset_id: str = Field(..., description="Internal project identifier (UUID v4)")
    title: str = Field(..., min_length=5, max_length=300)
    creators: List[str] = Field(..., min_length=1, description="ORCID or ROR identifiers")
    license: str = Field(..., pattern=r"^(CC-BY-4\.0|CC0-1\.0|MIT|Apache-2\.0)$")
    embargo_date: Optional[datetime] = None
    data_types: List[str] = Field(..., min_length=1)
    funding_grant: Optional[str] = None

    @field_validator("creators")
    @classmethod
    def validate_orcid_format(cls, v: List[str]) -> List[str]:
        for id_ in v:
            if not (id_.startswith("https://orcid.org/") or id_.startswith("0000-000")):
                raise ValueError(f"Invalid creator identifier format: {id_}")
        return v

    @field_validator("data_types")
    @classmethod
    def validate_data_types(cls, v: List[str]) -> List[str]:
        allowed = {"CSV", "JSON", "Parquet", "NetCDF", "TIFF", "FASTQ", "RDF"}
        for dtype in v:
            if dtype.upper() not in allowed:
                raise ValueError(f"Unsupported data type: {dtype}. Allowed: {allowed}")
        return [d.upper() for d in v]

    def to_json_ld(self) -> dict:
        """Serialize to Schema.org/Dataset JSON-LD for repository ingestion."""
        return {
            "@context": "https://schema.org",
            "@type": "Dataset",
            "identifier": self.dataset_id,
            "name": self.title,
            "author": [{"@type": "Person", "identifier": c} for c in self.creators],
            "license": LICENSE_URLS[self.license],
            "datePublished": (self.embargo_date or datetime.now(timezone.utc)).isoformat(),
            "encodingFormat": self.data_types
        }

def validate_and_serialize(raw_payload: dict) -> dict:
    """Execute validation, log FAIR checkpoints, and return JSON-LD."""
    try:
        validated = DatasetGovernanceSchema.model_validate(raw_payload)
        logger.info("FAIR Checkpoint PASSED: Metadata schema validation successful.")
        logger.info("FAIR Checkpoint PASSED: Controlled vocabulary & license constraints enforced.")
        return validated.to_json_ld()
    except ValidationError as e:
        logger.error("FAIR Checkpoint FAILED: Schema validation error. Details: %s", e.json())
        raise

Step 2 — Resolve persistent identifiers with resilient API integration

Automated governance must verify that referenced persistent identifiers actually resolve and return the metadata shape it expects. Registry APIs enforce strict rate limits, require specific headers, and return structured errors, so this referential gate implements exponential backoff, strict HTTP status validation, and payload schema checking to guarantee the Findable and Accessible principles. The broader resolver control plane — circuit breakers, secondary registries, and cache fallback — is detailed in API routing & fallbacks.

python

import logging
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
from typing import Dict, Any

logger = logging.getLogger("governance.pid_resolver")

# Exact API constraints for DataCite REST API
DATACITE_BASE_URL = "https://api.datacite.org/dois"
MAX_RETRIES = 3
BACKOFF_FACTOR = 0.5
TIMEOUT = (5, 15)  # (connect, read)

def configure_session() -> requests.Session:
    """Configure session with exact retry policy and timeout constraints."""
    session = requests.Session()
    retry_strategy = Retry(
        total=MAX_RETRIES,
        backoff_factor=BACKOFF_FACTOR,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"],
        respect_retry_after_header=True
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("https://", adapter)
    session.headers.update({
        "Accept": "application/vnd.api+json",
        "User-Agent": "FAIR-Governance-Pipeline/1.0"
    })
    return session

def resolve_doi_metadata(doi: str) -> Dict[str, Any]:
    """Fetch and validate DOI metadata with resilient retry logic."""
    session = configure_session()
    url = f"{DATACITE_BASE_URL}/{doi}"

    logger.info("Attempting PID resolution: %s", doi)
    try:
        response = session.get(url, timeout=TIMEOUT)

        # Exact API constraint: enforce 200 OK and JSON-API content type
        if response.status_code != 200:
            logger.error("FAIR Checkpoint FAILED: DOI resolution returned HTTP %s", response.status_code)
            raise requests.HTTPError(f"DOI resolution failed: HTTP {response.status_code}")

        payload = response.json()
        data = payload.get("data", {})

        # Validation: ensure required FAIR fields exist in registry response
        if not data.get("attributes", {}).get("title"):
            raise ValueError("Registry response missing required 'title' field.")

        logger.info("FAIR Checkpoint PASSED: PID resolved and metadata structure validated.")
        return data["attributes"]

    except requests.exceptions.RetryError as e:
        logger.error("FAIR Checkpoint FAILED: Max retries exceeded for %s. Error: %s", doi, e)
        raise
    except requests.exceptions.RequestException as e:
        logger.error("Network failure during PID resolution: %s", e)
        raise
    except ValueError as e:
        logger.error("Schema validation failed on registry payload: %s", e)
        raise

Step 3 — Automate policy enforcement for licensing and retention

Funder mandates and institutional retention policies dictate precise access windows and license compatibility that no metadata standard encodes for you. This policy gate evaluates license strings against an approved allowlist, calculates embargo expiration, and blocks ingestion if any compliance threshold is breached. It is the layer that operationalizes Funder Mandate Alignment and keeps downstream Institutional Repository Strategy deposits inside their legal windows.

python

import logging
from datetime import datetime, timedelta
from typing import Optional

logger = logging.getLogger("governance.policy_engine")

ALLOWED_LICENSES = {"CC-BY-4.0", "CC0-1.0", "MIT", "Apache-2.0"}
MAX_EMBARGO_DAYS = 730  # 2-year institutional maximum

def enforce_access_policy(license_str: str, embargo_date: Optional[datetime], publication_date: datetime) -> bool:
    """Validate license and embargo against institutional retention policies."""
    logger.info("Evaluating access policy for license: %s", license_str)

    # Validation: License allowlist check
    if license_str not in ALLOWED_LICENSES:
        logger.error("FAIR Checkpoint FAILED: License '%s' not in institutional allowlist.", license_str)
        return False

    # Validation: Embargo constraint check
    if embargo_date:
        if embargo_date < publication_date:
            logger.error("FAIR Checkpoint FAILED: Embargo date precedes publication date.")
            return False

        embargo_duration = (embargo_date - publication_date).days
        if embargo_duration > MAX_EMBARGO_DAYS:
            logger.error("FAIR Checkpoint FAILED: Embargo exceeds institutional maximum of %d days.", MAX_EMBARGO_DAYS)
            return False

        logger.info("FAIR Checkpoint PASSED: Embargo window validated (%d days).", embargo_duration)
    else:
        logger.info("FAIR Checkpoint PASSED: Immediate open access configured.")

    logger.info("FAIR Checkpoint PASSED: Policy enforcement successful. Dataset cleared for publication.")
    return True

Step 4 — Execute repository ingestion and audit compliance checkpoints

The final gate pushes validated metadata to a repository API — InvenioRDM, Dataverse, or DSpace — enforcing exact API constraints, handling the bearer token securely, and logging every checkpoint for the institutional audit trail. This is where the plan requirements captured in building a data management plan template for researchers become machine-executable deposit rules.

python

import logging
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
from typing import Dict, Any

logger = logging.getLogger("governance.repository_ingest")

REPO_API_URL = "https://repository.institution.edu/api/v1/datasets"
REPO_TIMEOUT = (5, 30)

def ingest_to_repository(json_ld: Dict[str, Any], api_token: str) -> Dict[str, Any]:
    """Push validated JSON-LD to institutional repository with audit logging."""
    session = requests.Session()
    # POST is excluded from urllib3's default retryable methods, so opt in
    # explicitly. Only do this for endpoints that handle deposits idempotently
    # (the 409 branch below guards against duplicate ingestion).
    retry_strategy = Retry(
        total=3,
        backoff_factor=1.0,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["POST"]
    )
    session.mount("https://", HTTPAdapter(max_retries=retry_strategy))
    session.headers.update({
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/json",
        "X-FAIR-Compliance-Check": "PASSED"
    })

    logger.info("Initiating repository ingestion for dataset ID: %s", json_ld.get("identifier"))

    try:
        response = session.post(REPO_API_URL, json=json_ld, timeout=REPO_TIMEOUT)

        # Exact API constraint: enforce 201 Created and validate response schema
        if response.status_code == 201:
            resp_data = response.json()
            logger.info("FAIR Checkpoint PASSED: Dataset successfully ingested. Repository ID: %s", resp_data.get("id"))
            return resp_data
        elif response.status_code == 409:
            logger.warning("FAIR Checkpoint WARNING: Dataset already exists in repository. Skipping duplicate.")
            return {"status": "duplicate", "message": response.text}
        else:
            logger.error("FAIR Checkpoint FAILED: Repository returned HTTP %s. Body: %s", response.status_code, response.text)
            response.raise_for_status()

    except requests.exceptions.RetryError as e:
        logger.error("Repository ingestion failed after retries: %s", e)
        raise
    except requests.exceptions.RequestException as e:
        logger.error("Network or API failure during ingestion: %s", e)
        raise

Governance Control Reference Matrix

Every rule in the framework maps a governing standard to an enforcement point, a decision function, and the action taken on failure. This matrix is the single source of truth for what the pipeline enforces and where; use it to reconcile a rejected record against the exact gate that stopped it.

Governance control	Governing standard	Enforcement gate	Decision rule	Failure action
Required core fields	DataCite Metadata Schema	Step 1 schema	`dataset_id`, `title`, `creators`, `license`, `data_types` present	Quarantine — `ValidationError`
Creator identity form	ORCID	Step 1 `validate_orcid_format`	Full `https://orcid.org/` URI or `0000-000…` iD	Quarantine — invalid identifier
Format allowlist	Controlled vocabulary	Step 1 `validate_data_types`	Member of `{CSV, JSON, Parquet, NetCDF, TIFF, FASTQ, RDF}`	Quarantine — unsupported type
PID resolvability	DataCite Metadata Schema	Step 2 `resolve_doi_metadata`	HTTP 200 and `attributes.title` present	Retry, then dead-letter
License permitted	SPDX License List	Step 3 `enforce_access_policy`	Member of institutional `ALLOWED_LICENSES`	Reject — non-compliant license
Embargo bound	Funder / institutional policy	Step 3 `enforce_access_policy`	`0 ≤ (embargo − publication) ≤ 730 days`	Reject — embargo out of range
Duplicate deposit	Repository API	Step 4 `ingest_to_repository`	HTTP 409 on POST	Skip — logged as duplicate
Discovery serialization	Schema.org `Dataset`	Step 1 `to_json_ld`	Valid JSON-LD `@context`/`@type` emitted	Block — ingestion payload rejected

Error Handling and Edge Cases

A governance pipeline is judged on how it fails, not only on how it passes. Each gate distinguishes transient failures, which are retried, from permanent failures, which are quarantined for human remediation, and never conflates the two.

Transient registry failures (HTTP 429/5xx during Step 2) are retried with jittered exponential backoff honoring the Retry-After header. When retries are exhausted a RetryError surfaces; route the record to a dead-letter queue keyed by its dataset_id with a resolution_pending status rather than crashing the batch.
Malformed records (ValidationError at Step 1) are permanent. Serialize e.json() alongside the raw payload into a quarantine store so a curator sees exactly which field failed which constraint. Never silently coerce — a record that cannot be trusted must not advance.
Policy rejections (Step 3 returning False) are business-rule failures, not exceptions. Record the rejecting rule (license or embargo) on the audit ledger and notify the depositor; these need a policy decision, not a retry.
Duplicate deposits (HTTP 409 at Step 4) are treated as success-with-warning. Because POST retries are enabled, a network timeout after the server committed the deposit would otherwise double-ingest; the 409 branch makes the operation idempotent.
Partial batches must be transactional at the record level. One quarantined record must never abort the sibling records in the same run; accumulate per-record outcomes and emit a summary at the end.

Verification and Testing

Assert governance behavior the same way you assert application logic — with tests that exercise both the pass path and every rejection path. The following covers the Step 3 policy gate, the rule class most likely to drift as institutional windows change.

python

from datetime import datetime, timezone

def test_policy_gate_enforcement() -> None:
    pub = datetime(2026, 1, 1, tzinfo=timezone.utc)

    # Compliant: permitted license, no embargo -> immediate open access
    assert enforce_access_policy("CC-BY-4.0", None, pub) is True

    # Rejected: license outside institutional allowlist
    assert enforce_access_policy("GPL-3.0", None, pub) is False

    # Rejected: embargo exceeds the 730-day institutional maximum
    long_embargo = datetime(2029, 1, 1, tzinfo=timezone.utc)
    assert enforce_access_policy("CC-BY-4.0", long_embargo, pub) is False

    # Rejected: embargo date precedes publication date
    backdated = datetime(2025, 6, 1, tzinfo=timezone.utc)
    assert enforce_access_policy("CC-BY-4.0", backdated, pub) is False

A passing run of the full pipeline emits a contiguous ladder of checkpoint logs; the presence of every expected FAIR Checkpoint PASSED line is itself an assertion you can grep in CI:

code

2026-07-02 09:14:01 [INFO] governance.schema_validator | FAIR Checkpoint PASSED: Metadata schema validation successful.
2026-07-02 09:14:01 [INFO] governance.pid_resolver | FAIR Checkpoint PASSED: PID resolved and metadata structure validated.
2026-07-02 09:14:02 [INFO] governance.policy_engine | FAIR Checkpoint PASSED: Policy enforcement successful. Dataset cleared for publication.
2026-07-02 09:14:02 [INFO] governance.repository_ingest | FAIR Checkpoint PASSED: Dataset successfully ingested. Repository ID: 10.5281/zenodo.7654321

Gotchas and Known Pitfalls

Bare ORCID vs. full URI. A record carrying 0000-0002-1825-0097 is not interchangeable with https://orcid.org/0000-0002-1825-0097. The validator accepts both forms, but downstream JSON-LD consumers expect the URI. Normalize to the URI form during enrichment or you emit Dataset documents whose author.identifier will not dereference.
Naive vs. timezone-aware datetimes. Comparing an offset-naive embargo_date against a timezone-aware publication_date raises TypeError in Step 3. Normalize every datetime to UTC at the ingestion boundary; a governance gate that throws on a timezone mismatch is a governance gate that silently stops running.
Silent Pydantic coercion. Without model_config = ConfigDict(strict=True) on numeric or boolean fields, Pydantic V2 will happily coerce the string "5" to 5. For governance contracts that is a data-integrity hole — enable strict mode where a type must be exact so a malformed payload fails loudly at Step 1 instead of being quietly accepted.
Retrying non-idempotent POSTs. Enabling allowed_methods=["POST"] in Step 4 is safe only because the endpoint returns 409 on a duplicate. If your repository does not de-duplicate deposits, a retried POST after a server-side commit double-ingests; verify idempotency before opting POST into the retry policy.
License allowlist drift. The SPDX identifier set is duplicated between the Step 1 regex and the Step 3 ALLOWED_LICENSES set. When institutional policy adds a license, both must change together — hoist the allowlist into one shared constant, or the schema will accept a token the policy engine then rejects, quarantining valid records.

Frequently Asked Questions

Why enforce governance in code instead of at repository review?

Because review-time enforcement is retroactive and non-deterministic: it inspects records after they are already in the archive and depends on a curator’s judgment. Encoding each rule as a typed contract means a non-compliant record never becomes durable in the first place, and every decision is timestamped on the audit ledger. Compliance becomes a property the archive holds by construction rather than a state you periodically try to restore.

Where does governance sit relative to the rest of the infrastructure pipeline?

It is the first executable stage after a submission event and before repository deposit — the policy layer described in the Open Science Infrastructure Planning overview. Structural validation runs first, referential PID resolution second, institutional policy third, and only a record that clears all three reaches the repository ingestion gate.

How do I add a new institutional license without breaking existing records?

Update the single shared SPDX allowlist that both the Step 1 schema pattern and the Step 3 ALLOWED_LICENSES set reference, add a canonical URL to LICENSE_URLS, and add a test asserting the new identifier passes the policy gate. Deploy the policy change as a reviewed pull request so the diff on the allowlist is itself the audit record of the obligation change.

What is the difference between a quarantined record and a dead-lettered one?

A quarantined record failed a permanent rule — a missing field, an invalid identifier, a disallowed license — and needs a human to correct the payload. A dead-lettered record failed a transient dependency, typically registry resolution, and is expected to succeed on replay once the dependency recovers. Keeping the two queues separate stops a registry outage from generating remediation tickets no curator can action.

Open Science Infrastructure Planning — the parent overview showing how this policy layer composes with repository, funder, and license stages.
Funder Mandate Alignment — the checkpoint layer whose obligations the Step 3 policy engine enforces.
Institutional Repository Strategy — the deposit target that consumes the audited JSON-LD this pipeline produces.
Open License Configuration — how the SPDX license tokens validated here are selected and encoded.
Building a data management plan template for researchers — turning plan requirements into the machine-executable rules above.

Data Governance Frameworks: Encoding Research Data Policy as an Executable Pipeline #

Concept & Specification: Governance as a Chain of Typed Contracts #

Step-by-Step Implementation #

Step 1 — Define strict metadata schemas with programmatic validation #

Step 2 — Resolve persistent identifiers with resilient API integration #

Step 3 — Automate policy enforcement for licensing and retention #

Step 4 — Execute repository ingestion and audit compliance checkpoints #

Governance Control Reference Matrix #

Error Handling and Edge Cases #

Verification and Testing #

Gotchas and Known Pitfalls #

Frequently Asked Questions #

Why enforce governance in code instead of at repository review? #

Where does governance sit relative to the rest of the infrastructure pipeline? #

How do I add a new institutional license without breaking existing records? #

What is the difference between a quarantined record and a dead-lettered one? #

Related Guides #

Explore this section

Data Governance Frameworks: Encoding Research Data Policy as an Executable Pipeline

Concept & Specification: Governance as a Chain of Typed Contracts

Step-by-Step Implementation

Step 1 — Define strict metadata schemas with programmatic validation

Step 2 — Resolve persistent identifiers with resilient API integration

Step 3 — Automate policy enforcement for licensing and retention

Step 4 — Execute repository ingestion and audit compliance checkpoints

Governance Control Reference Matrix

Error Handling and Edge Cases

Verification and Testing

Gotchas and Known Pitfalls

Frequently Asked Questions

Why enforce governance in code instead of at repository review?

Where does governance sit relative to the rest of the infrastructure pipeline?

How do I add a new institutional license without breaking existing records?

What is the difference between a quarantined record and a dead-lettered one?

Related Guides