Open Science Infrastructure Planning: Governance, Repository, and Funder-Mandate Automation for FAIR Research Data

Open science infrastructure is the layer of engineered systems that turns institutional data policy into running software. When it is absent or improvised, the failure modes are predictable and expensive: datasets are deposited without persistent identifiers, licenses are missing or incompatible, funder data-sharing conditions are discovered only at audit time, and metadata drifts out of sync with the repositories that expose it. Research data managers absorb this as manual curation debt, academic IT teams absorb it as unplanned migrations, and Python automation engineers absorb it as brittle one-off scripts that break every grant cycle. This section covers the planning decisions and the reference architecture that prevent those outcomes — how to encode governance as executable policy, choose and integrate an institutional repository, align each funder mandate to concrete technical controls, and automate open-license selection so that every published dataset is Findable, Accessible, Interoperable, and Reusable by construction rather than by remediation.

The four operational domains below are engineered as one continuous pipeline, not four disconnected projects. Governance defines the rules, repository strategy defines where data lands and how it is discovered, funder alignment defines the non-negotiable checkpoints, and license configuration defines the legal metadata that makes reuse possible. Each domain has its own implementation guide; this overview shows how they compose and where the seams are.

Architecture Overview

The planning pipeline moves a dataset from a submission event through governance validation, repository deposit, identifier resolution, and public discovery, with funder and license checkpoints embedded as gates rather than after-the-fact reviews. Every transition writes to an append-only audit ledger so compliance state is reconstructable at any point.

Open science infrastructure pipeline — from policy intake to audited public discovery.

Each box maps to one of the four implementation domains. The governance node is specified in Data Governance Frameworks; the funder checkpoint in Funder Mandate Alignment; the repository deposit and PID nodes in Institutional Repository Strategy; and the license node in Open License Configuration. For the ingestion and validation internals that feed this pipeline, see the sibling Core Architecture & FAIR Mapping overview.

Layered Infrastructure and Service Boundaries

A durable open science platform separates concerns into four service layers, each with a stable contract so that any single layer can be replaced without rewriting the others. Collapsing these into a monolith is the most common planning mistake: it couples repository choice to metadata modeling, so a repository migration forces a schema rewrite, and it couples policy to code, so every governance change requires a deployment.

Policy layer. Governance rules live as version-controlled declarative artifacts — Rego bundles, YAML policy files, or typed validation classes — reviewed through pull requests alongside pipeline code. This is where mandatory-field lists, controlled vocabularies, retention windows, and embargo logic are defined. Treating policy as code is the central premise of Data Governance Frameworks, and it is what makes compliance auditable: a diff on the policy repository is a diff on institutional obligations.

Metadata layer. Validated records are normalized to a canonical internal model, then crosswalked to the exchange formats the repository and harvesters expect. First-class citizens here are the DataCite Metadata Schema for identifier registration, Dublin Core for OAI-PMH harvesting, and schema.org for search-engine discovery. Structural validation at this boundary is enforced through Pydantic schema validation, and the field-by-field crosswalk rules are specified in Metadata Schema Mapping.

Repository layer. This is the persistence and exposure tier — a Dataverse installation, a Zenodo community, an institutional Figshare, or a self-hosted InvenioRDM instance — reached through its REST API rather than a web upload form. Repository choice drives deposit semantics, versioning behavior, and identifier scheme, which is why it is a planning decision and not an implementation detail. The trade-offs for grant-funded work are analyzed in choosing the right repository for grant-funded projects.

Resolution layer. Persistent identifiers and their landing pages are the durable public contract. A DOI minted through the DataCite REST API, a Handle, or an ARK must resolve for the full retention period regardless of repository migrations, so resolution is decoupled from storage and treated as its own service with its own uptime budget.

Layer	Responsibility	Reference technologies	Stable contract
Policy	Encode mandates, retention, embargo	Open Policy Agent (Rego), Pydantic validators, YAML bundles	`decision: allow \| deny` + reason codes
Metadata	Normalize + crosswalk records	DataCite Metadata Schema, Dublin Core, schema.org, `rdflib`	Canonical JSON model → exchange serializations
Repository	Persist + expose datasets	Dataverse API, Zenodo API, InvenioRDM, Figshare	REST deposit + versioned record IDs
Resolution	Mint + resolve identifiers	DataCite REST, Handle System, ARK/N2T	Immutable PID → landing-page URL

Four service layers, each replaceable behind the stable contract it exposes.

Core Pipeline Patterns

The pipeline that connects these layers must be idempotent, observable, and resilient to partial failure. Deposits are triggered by events — a new submission, a metadata correction, an embargo expiry — so an event-driven model backed by a message broker is the default, with scheduled batch runs reserved for bulk backfills and nightly reconciliation. Whichever model is used, three properties are non-negotiable: reprocessing the same event must not create a duplicate deposit, every failure must land in a dead-letter queue with enough context to remediate, and every state transition must be recorded before it is acted on.

Idempotency is achieved by deriving a deterministic content key for each submission and checking it against the ledger before minting an identifier or calling the repository API. A retry, a duplicate webhook, or a replayed batch then resolves to the existing record instead of a second DOI. The pattern below shows a broker-agnostic deposit coordinator that enforces this contract.

python

import hashlib
import json
import logging
from dataclasses import dataclass
from datetime import datetime, timezone
from enum import StrEnum

logger = logging.getLogger("osi.deposit_coordinator")


class DepositState(StrEnum):
    RECEIVED = "received"
    GOVERNED = "governed"
    DEPOSITED = "deposited"
    RESOLVED = "resolved"
    QUARANTINED = "quarantined"


@dataclass(slots=True)
class Submission:
    dataset_slug: str
    title: str
    version: str
    payload: dict


def content_key(sub: Submission) -> str:
    """Deterministic idempotency key: identical logical submissions collapse
    to the same key, so replays never mint a second identifier."""
    basis = f"{sub.dataset_slug}:{sub.version}:{sub.title}".encode("utf-8")
    return hashlib.sha256(basis).hexdigest()


class DepositCoordinator:
    def __init__(self, ledger, governance, repository, pid_service) -> None:
        self.ledger = ledger            # append-only audit store
        self.governance = governance    # policy-as-code evaluator
        self.repository = repository     # repository REST client
        self.pid = pid_service           # DOI/Handle/ARK minter

    def handle(self, sub: Submission) -> str:
        key = content_key(sub)
        existing = self.ledger.lookup(key)
        if existing and existing["state"] == DepositState.RESOLVED:
            logger.info("Idempotent hit for %s -> %s", key, existing["pid"])
            return existing["pid"]  # already published; return prior PID

        self._record(key, DepositState.RECEIVED, sub)

        decision = self.governance.evaluate(sub.payload)
        if not decision.allowed:
            self._record(key, DepositState.QUARANTINED, sub, reason=decision.reasons)
            self.ledger.dead_letter(key, decision.reasons)
            raise PermissionError(f"Governance denied {key}: {decision.reasons}")
        self._record(key, DepositState.GOVERNED, sub)

        record_id = self.repository.deposit(sub.payload)  # repository API call
        self._record(key, DepositState.DEPOSITED, sub, record_id=record_id)

        pid = self.pid.mint_or_resolve(sub.dataset_slug, sub.version)
        self.ledger.finalize(key, pid=pid, record_id=record_id)
        self._record(key, DepositState.RESOLVED, sub, pid=pid)
        return pid

    def _record(self, key: str, state: DepositState, sub: Submission, **ctx) -> None:
        self.ledger.append({
            "key": key,
            "state": str(state),
            "slug": sub.dataset_slug,
            "ts": datetime.now(timezone.utc).isoformat(),
            **ctx,
        })
        logger.info("state=%s key=%s %s", state, key, json.dumps(ctx, default=str))

For high-volume backfills — migrating an existing repository, or ingesting a decade of legacy datasets — the same coordinator is driven from bounded, concurrent batches rather than one event at a time. The rate-limit handling, checkpoint serialization, and retry semantics for that mode are covered in async batch processing, which is the correct place to tune concurrency without changing the deposit contract above.

Compliance Enforcement as Pipeline Checkpoints

Compliance is enforced where data flows, not in a quarterly review. Each mandate is compiled into a gate that a dataset must pass before advancing to the next state, and a denied gate routes the record to remediation with a specific, machine-readable reason rather than a vague failure. This is the difference between a pipeline that can prove compliance and one that merely hopes for it.

Three categories of rule are enforced. FAIR sub-principles are checked structurally — a record without a resolvable identifier cannot be Findable, so identifier reservation is a hard gate. Funder conditions are checked against the specific mandate attached to the grant; the mapping from policy language to concrete metadata fields is worked out in Funder Mandate Alignment, with a fully worked example in aligning NIH data-sharing policies with FAIR principles. Institutional conditions — retention minimums, PII handling, permitted licenses — are checked against the governance policy bundle. The evaluator returns a single allow/deny decision plus the failed rule identifiers so remediation is unambiguous.

python

from dataclasses import dataclass, field


@dataclass(slots=True)
class PolicyDecision:
    allowed: bool
    reasons: list[str] = field(default_factory=list)


class ComplianceEvaluator:
    """Compile FAIR, funder, and institutional rules into one gate. Each rule
    is a callable returning (rule_id, ok); a single failure denies the deposit."""

    OPEN_LICENSES = {"CC-BY-4.0", "CC0-1.0", "CC-BY-SA-4.0"}

    def evaluate(self, record: dict) -> PolicyDecision:
        checks = [
            ("fair.findable.pid", bool(record.get("identifier_reserved"))),
            ("fair.reusable.license", record.get("license") in self.OPEN_LICENSES),
            ("fair.interoperable.schema", record.get("metadata_schema") == "datacite-4.5"),
            ("funder.data_sharing", self._funder_satisfied(record)),
            ("inst.retention_years", int(record.get("retention_years", 0)) >= 10),
        ]
        failed = [rule_id for rule_id, ok in checks if not ok]
        return PolicyDecision(allowed=not failed, reasons=failed)

    @staticmethod
    def _funder_satisfied(record: dict) -> bool:
        mandate = record.get("funder_mandate")
        if mandate is None:
            return True  # no attached mandate -> institutional defaults apply
        # A mandated dataset must be open or carry an approved embargo end-date.
        return record.get("is_open_access", False) or bool(record.get("embargo_end"))

Because the evaluator returns rule identifiers rather than free text, remediation tickets, compliance dashboards, and audit exports all key off the same stable vocabulary. When a funder updates its policy, the change is a new rule in the bundle and a diff in version control — never a silent behavioral shift.

Identity and Resolution

Persistent identifiers are the promise that a dataset stays reachable long after the grant closes and possibly after the originating repository is decommissioned. Planning the resolution layer means choosing an identifier scheme, deciding where minting authority lives, and defining what happens when the registration API is unreachable.

The DataCite Metadata Schema and its REST API are the default for research datasets because a DOI ties the identifier to rich, harvestable metadata and to a resolvable landing page. Handles offer a lighter-weight, institution-run alternative, and ARKs suit high-volume or ephemeral objects where a registration fee per identifier is prohibitive. The decision is durable — you cannot cheaply change scheme after publication — so it belongs in the planning phase.

Scheme	Registration authority	Metadata coupling	Best fit	Cost model
DOI (DataCite)	DataCite member, via REST API	Strong — schema required at mint	Citable published datasets	Annual membership + per-DOI
Handle	Local Handle service	None required	Institution-run object IDs	Prefix fee, self-hosted
ARK	Any organization (N2T or local)	Optional	High-volume / ephemeral objects	Free to low

Minting must be idempotent and must degrade gracefully. A DOI is reserved in the draft state before the repository deposit, promoted to findable only after the landing page is live, and — critically — reservation is keyed to the same content key used by the deposit coordinator, so a retry resolves the existing draft instead of burning a new identifier. When the registration API is slow or returning 5xx, the pipeline follows a deterministic fallback chain: reserve locally, queue the remote registration, and reconcile on the next scheduled run. The circuit-breaker and secondary-registry mechanics for these outbound calls are specified in API Routing & Fallbacks.

python

import httpx


class PIDService:
    """Idempotent DOI reservation with local-first fallback. The content key
    guarantees replays resolve to the existing draft rather than a new DOI."""

    def __init__(self, client: httpx.Client, ledger, prefix: str) -> None:
        self.client = client
        self.ledger = ledger
        self.prefix = prefix

    def mint_or_resolve(self, slug: str, version: str) -> str:
        cached = self.ledger.pid_for(slug, version)
        if cached:
            return cached  # idempotent: identifier already reserved

        suffix = f"{slug}.v{version}"
        doi = f"{self.prefix}/{suffix}"
        try:
            resp = self.client.post(
                "https://api.datacite.org/dois",
                json={"data": {"type": "dois", "attributes": {"doi": doi, "event": "draft"}}},
                timeout=5.0,
            )
            resp.raise_for_status()
        except (httpx.HTTPStatusError, httpx.RequestError) as exc:
            # Degrade gracefully: reserve locally, queue remote registration.
            self.ledger.queue_registration(doi, error=str(exc))
        self.ledger.record_pid(slug, version, doi)
        return doi

Failure Modes and Operational Runbook

Every long-lived data platform fails in a small number of recurring ways. Planning for them means defining, in advance, the detection signal and the remediation step for each — so that an on-call engineer follows a runbook instead of improvising against a production incident.

Failure mode	Detection signal	Immediate remediation
Schema drift	Validation-failure rate spikes on a specific field; DLQ fills with one rule id	Pin the schema version, add a coercion shim, re-run quarantined records
Resolver downtime	DOI registration returns `5xx` / timeouts; queued-registration backlog grows	Local-first reservation already holds; drain the reconciliation queue when the API recovers
Quota / rate-limit exhaustion	Repository API returns `429`; batch throughput collapses	Back off with jitter, lower concurrency, resume from the last serialized checkpoint
Metadata/repository divergence	Nightly reconciliation finds landing pages whose metadata differs from the ledger	Re-push the canonical record; flag for curator review if the repository is authoritative
Orphaned dataset	Deposit succeeded but PID mint did not; ledger shows `deposited` without `resolved`	Replay the content key through the coordinator; idempotency resolves the existing deposit

Two operational habits keep this list short. First, a nightly reconciliation job walks every resolved record and asserts that its PID resolves, its checksum is intact, and its exposed metadata matches the ledger; drift triggers a self-healing re-push rather than waiting for a user report. Second, the DLQ is monitored as a first-class signal — a rising dead-letter rate on a single rule identifier is the earliest warning of schema drift or an upstream policy change, and it points directly at the field to fix.

Security and Access Control

FAIR does not mean unconditionally open. The Accessible principle requires that data and metadata are retrievable under clearly defined conditions, which for sensitive research often means metadata is public while the data itself sits behind authorization. The planning layer therefore specifies access control as attribute-based (ABAC) — decisions are computed from dataset attributes such as embargo state, sensitivity classification, and grant conditions, evaluated against the requester’s institutional identity — with role-based (RBAC) mappings for coarse curator and administrator functions.

Three controls are mandatory across every layer. Transport is TLS 1.3 for all API traffic, and data at rest is encrypted with keys held in a managed KMS with scheduled rotation. Every access decision, deposit, and policy evaluation is written to the same append-only audit ledger that records pipeline state, so the security trail and the compliance trail are one artifact. And PII is tokenized or stripped in the metadata layer before any record is crosswalked to a public schema or cross-referenced against external registries such as ORCID or ROR. The cryptographic and identity-provider integration details are specified in the sibling Security & Access Control guide and applied to campus networks in setting up secure data boundaries for academic IT.

FAQ

How do I decide between a DOI, a Handle, and an ARK for a new repository?

Match the identifier to the object’s lifecycle and citation needs. Use a DOI through the DataCite Metadata Schema when datasets are citable research outputs that funders expect to appear in the literature — the mandatory metadata coupling is a feature, not overhead. Use a Handle when your institution runs its own resolution service and does not need per-identifier registration metadata. Use an ARK for high-volume or short-lived objects where a per-identifier fee would be prohibitive. Decide before first publication; retrofitting a new scheme onto already-cited datasets is expensive and confusing to downstream users.

What happens when the DataCite registration API is unreachable during a deposit?

The pipeline should never block or fail the deposit on a transient registration outage. Reserve the identifier locally against the deterministic content key, mark the deposit as deposited (not yet resolved) in the ledger, and enqueue the remote registration for the next reconciliation run. Because the content key is idempotent, the retry resolves the existing draft rather than minting a duplicate. The circuit-breaker and secondary-registry fallback logic is covered in the API routing guide.

How do I keep funder mandates from silently going stale?

Encode every mandate as a named rule in the governance policy bundle and version-control it. When a funder revises its data-sharing policy, the update is a reviewable diff and a new rule identifier — never an undocumented change in pipeline behavior. Because the compliance evaluator returns rule identifiers, your dashboards and audit exports automatically reflect the new obligation the moment the rule ships.

Can metadata be public while the underlying data stays restricted?

Yes, and this is the common case for sensitive research. Model access as attribute-based: expose the DataCite/schema.org metadata and a resolvable landing page for discovery, while gating the data download behind an authorization check on embargo state, sensitivity class, and grant conditions. PII must be tokenized or removed in the metadata layer before the record is crosswalked to any public schema.

Which repository should we standardize on for grant-funded projects?

There is no single answer — the trade-off is between institutional control (a self-hosted Dataverse or InvenioRDM), zero-maintenance hosting with generous quotas (Zenodo), and discipline-specific reach (Figshare or a domain repository). Because repository choice drives deposit semantics, versioning, and identifier scheme, treat it as a planning decision. The detailed comparison for grant-funded work is in the institutional repository strategy guide.

How do I prevent duplicate DOIs when a webhook or batch replays?

Derive a deterministic content key from the dataset slug, version, and title, and check it against the ledger before minting. A replayed webhook, a duplicated broker message, or a re-run batch all collapse to the same key, so the coordinator returns the existing identifier instead of creating a second one. Idempotency at the key boundary is what makes the whole pipeline safe to retry.

Data Governance Frameworks — encode institutional policy as version-controlled, executable validation rules.
Funder Mandate Alignment — map grant conditions to concrete metadata fields and pipeline gates.
Institutional Repository Strategy — choose and integrate a repository via its deposit API.
Open License Configuration — automate SPDX-validated open-license selection and embedding.
Core Architecture & FAIR Mapping — the ingestion, enrichment, and validation internals that feed this pipeline.
Data Ingestion & Metadata Enrichment — lab-notebook parsing, Pydantic validation, and batch processing upstream of deposit.

Open Science Infrastructure Planning: Governance, Repository, and Funder-Mandate Automation for FAIR Research Data #

Architecture Overview #

Layered Infrastructure and Service Boundaries #

Core Pipeline Patterns #

Compliance Enforcement as Pipeline Checkpoints #

Identity and Resolution #

Failure Modes and Operational Runbook #

Security and Access Control #

FAQ #

How do I decide between a DOI, a Handle, and an ARK for a new repository? #

What happens when the DataCite registration API is unreachable during a deposit? #

How do I keep funder mandates from silently going stale? #

Can metadata be public while the underlying data stays restricted? #

Which repository should we standardize on for grant-funded projects? #

How do I prevent duplicate DOIs when a webhook or batch replays? #

Related #

Explore this section

Open Science Infrastructure Planning: Governance, Repository, and Funder-Mandate Automation for FAIR Research Data

Architecture Overview

Layered Infrastructure and Service Boundaries

Core Pipeline Patterns

Compliance Enforcement as Pipeline Checkpoints

Identity and Resolution

Failure Modes and Operational Runbook

Security and Access Control

FAQ

How do I decide between a DOI, a Handle, and an ARK for a new repository?

What happens when the DataCite registration API is unreachable during a deposit?

How do I keep funder mandates from silently going stale?

Can metadata be public while the underlying data stays restricted?

Which repository should we standardize on for grant-funded projects?

How do I prevent duplicate DOIs when a webhook or batch replays?

Related