The NIH Data Management and Sharing (DMS) Policy, effective January 25, 2023, requires NIH-funded researchers who generate scientific data to submit a Data Management and Sharing Plan (DMSP) and to share that data — ideally through an established repository — with standardized metadata, persistent identifiers, and explicit access conditions. This page is the concrete, field-level worked example behind the broader Funder Mandate Alignment guide: where that guide compiles any funder’s policy into a typed object, here we take one federal mandate end to end and show exactly which DMS-plan field becomes which DataCite property, which SPDX identifier a license string must resolve to, and how a retention window becomes an infrastructure-level lock. It assumes you already run a deposit pipeline and are comfortable with Python 3.10+ and the Pydantic V2 API; the goal is to replace manual reconciliation of administrative DMS plans against repository ingestion with a set of deterministic, machine-actionable gates.

Manual reconciliation fails at institutional scale for a reason. Operational compliance requires translating policy prose into machine-actionable FAIR architecture, and the mapping is not obvious to a curator reading a PDF at reporting time. The diagram below shows how the four operative DMS-plan requirements each satisfy one or more of the FAIR principles the pipeline must ultimately enforce.

Each operative NIH DMS plan requirement maps to one or more FAIR principles: standardized metadata and persistent identifiers make data Findable, an established repository and explicit access conditions make it Accessible, metadata carries Interoperability, and access terms plus metadata make it Reusable.

Compliance failures typically originate from three architectural misalignments: static PDF-based DMS plans lacking machine-readable metadata, repository configurations defaulting to ambiguous licensing strings, and disconnected retention workflows that fail to enforce the institution’s defined preservation window. When institutional systems operate in isolation, automated compliance checks fail at the ingestion stage. Resolving this means embedding the Open Science Infrastructure Planning pipeline directly into repository API paths and metadata harvesters so every gate runs before a dataset is ever exposed. The way each FAIR principle becomes an enforced checkpoint rather than an after-the-fact audit is worked through in the FAIR Principle Breakdown.

The Core Crosswalk: DMS Plan Fields to DataCite Properties

FAIR alignment demands that DMS metadata be exposed via OAI-PMH or REST APIs and shaped to the DataCite Metadata Schema (version 4.5 or later). A critical control point is mapping DMS-plan fields to DataCite properties: legacy institutional repositories often export Dublin Core-only records, which lack the fundingReferences and rightsList arrays needed for automated validation. The full field-level translation from internal records into DataCite is covered in Metadata Schema Mapping; the table below is the authoritative, no-ambiguity crosswalk for the five NIH DMS fields that gate ingestion.

DMS plan field	DataCite 4.5+ property	Required	Format / constraint	Example
`project_title`	`titles.title`	yes	Non-empty string	`"Murine Cortical Single-Cell Atlas"`
`principal_investigator`	`creators.nameIdentifier`	yes	ORCID URI `https://orcid.org/XXXX-XXXX-XXXX-XXXX`	`https://orcid.org/0000-0002-1825-0097`
`funding_agency`	`fundingReferences.funderIdentifier`	yes	Must resolve to the NIH ROR identifier	`https://ror.org/01cwqze88`
`data_format`	`resourceType.resourceTypeGeneral`	yes	DataCite controlled vocabulary	`Dataset`, `Software`, `Collection`
`access_conditions`	`rightsList.rights`	yes	SPDX identifier or controlled-access URI	`CC-BY-4.0` or `dbGaP:phs000001`

Identity fields reference two registries by their canonical form: contributor identity as an ORCID iD in full URI form, and the funding body as a Research Organization Registry (ROR) identifier, so funding_agency resolves to the stable NIH record https://ror.org/01cwqze88 rather than a free-text agency name that drifts between grants.

Production Implementation

Use the Pydantic V2 API for strict schema validation before submission. The DataCite REST API v2 rejects malformed ORCID/ROR URIs, so the same Pydantic schema validation discipline the ingestion layer applies to datasets is turned here toward DMS-plan fields — failing loudly at the edge instead of deep inside the dispatch loop.

python

import logging
from typing import List
from pydantic import BaseModel, Field, field_validator, HttpUrl
import re

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

class Creator(BaseModel):
    name: str
    nameIdentifier: HttpUrl
    nameIdentifierScheme: str = Field(default="ORCID")

class FundingReference(BaseModel):
    funderName: str
    funderIdentifier: HttpUrl
    funderIdentifierType: str = Field(default="ROR")

class DMSMetadata(BaseModel):
    titles: List[dict]
    creators: List[Creator]
    fundingReferences: List[FundingReference]
    resourceTypeGeneral: str
    rightsList: List[dict]

    @field_validator("creators")
    @classmethod
    def validate_orcid(cls, v: List[Creator]) -> List[Creator]:
        for c in v:
            if not re.match(r"^https://orcid\.org/\d{4}-\d{4}-\d{4}-\d{3}[\dX]$", str(c.nameIdentifier)):
                raise ValueError(f"Invalid ORCID URI: {c.nameIdentifier}")
        return v

    @field_validator("fundingReferences")
    @classmethod
    def validate_ror(cls, v: List[FundingReference]) -> List[FundingReference]:
        for f in v:
            if "ror.org" not in str(f.funderIdentifier):
                raise ValueError(f"Funder identifier must be a valid ROR URI: {f.funderIdentifier}")
        return v

def validate_and_serialize(metadata_dict: dict) -> str:
    """Validates NIH DMS fields against DataCite 4.5+ constraints."""
    try:
        validated = DMSMetadata(**metadata_dict)
        return validated.model_dump_json(indent=2)
    except Exception as e:
        logging.error("Schema validation failed: %s", e)
        raise

Pre-Ingest License and Access Control Validation

NIH policy explicitly requires open licensing unless justified by privacy, legal, or ethical constraints. Academic IT teams must implement a validation gate that rejects ambiguous license strings and resolves them against the SPDX License List, whose canonical, case-sensitive identifiers (CC-BY-4.0, CC0-1.0, Apache-2.0) make license resolution deterministic. Controlled-access datasets must route through dbGaP or an institutional IRB-approved gateway before repository ingestion. The registry-driven selection logic that keeps this whitelist maintainable without hardcoding exceptions lives in Open License Configuration.

The gate runs in four steps: parse the incoming license string; match it against the SPDX list and reject non-standard strings (e.g. "Open Access", "Permissive"); if access is restricted, require an IRB protocol number or dbGaP study ID; then return a structured compliance payload for downstream API routing.

python

import re
import logging
import requests
from typing import Optional
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

SPDX_API = "https://spdx.org/licenses/licenses.json"
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.5, status_forcelist=[429, 500, 502, 503, 504])
session.mount("https://", HTTPAdapter(max_retries=retries))

def fetch_spdx_licenses() -> dict:
    """Caches SPDX license list for offline validation."""
    resp = session.get(SPDX_API, timeout=10)
    resp.raise_for_status()
    return {lic["licenseId"]: lic["name"] for lic in resp.json()["licenses"]}

SPDX_CACHE = fetch_spdx_licenses()

def validate_license_and_access(license_str: str, access_type: str, irb_id: Optional[str] = None) -> dict:
    """Enforces NIH open-access expectations and controlled-access routing."""
    # SPDX identifiers are case-sensitive (e.g. "CC-BY-4.0", "Apache-2.0"),
    # so only trim surrounding whitespace; do not alter the case.
    license_str = license_str.strip()
    if license_str not in SPDX_CACHE:
        raise ValueError(f"License '{license_str}' is not a valid SPDX identifier. Use SPDX standard IDs only.")

    if access_type == "open":
        if license_str not in ("CC-BY-4.0", "CC0-1.0", "MIT", "Apache-2.0"):
            logging.warning("Non-preferred open license detected: %s", license_str)
        return {"status": "approved", "license_id": license_str, "routing": "public_ingest"}

    elif access_type == "controlled":
        if not irb_id or not re.match(r"^(IRB|dbGaP)-\d{4,10}$", irb_id):
            raise ValueError("Controlled access requires valid IRB or dbGaP protocol ID.")
        return {"status": "approved", "license_id": license_str, "routing": "restricted_gateway", "protocol": irb_id}

    raise ValueError("Invalid access_type. Must be 'open' or 'controlled'.")

Automated Retention and Preservation Manifests

The NIH DMS Policy expects shared scientific data to remain accessible for a meaningful period, but it does not fix a single federal retention duration; the applicable window is set by the chosen repository’s preservation commitment and the institution’s records-retention requirements. Disconnected retention workflows result in premature deletion or unverified storage states. Compliance requires automated lifecycle tagging, cryptographic checksum verification, and periodic format-migration triggers, all driven by a configurable retention policy: apply immutable retention_until and policy_version metadata at ingest, generate SHA-256 manifests at upload and re-verify quarterly, enforce WORM (Write Once Read Many) locks on institutional object storage (AWS S3 Object Lock, Wasabi, or on-prem Ceph) for the configured window, and trigger periodic format-validation checks across that window to ensure long-term readability.

The retention lifecycle as a state machine: an artifact is Tagged at ingest, Verified by SHA-256 manifest, then WORM-locked into Preserved, where a quarterly checksum re-verifies it. A format risk detours through Migrated and a checksum mismatch through Quarantined, both returning to Preserved, until the retention window is reached and the object is Expired.

python

import hashlib
import datetime
from pathlib import Path
from typing import BinaryIO, Optional

# Retention window is institution/repository-defined, not a fixed NIH duration.
RETENTION_YEARS = 25

def compute_sha256(file_stream: BinaryIO, chunk_size: int = 8192) -> str:
    """Generates SHA-256 checksum for integrity verification."""
    sha256 = hashlib.sha256()
    while chunk := file_stream.read(chunk_size):
        sha256.update(chunk)
    file_stream.seek(0)
    return sha256.hexdigest()

def add_years(dt: datetime.datetime, years: int) -> datetime.datetime:
    """Adds calendar years, clamping Feb 29 to Feb 28 on non-leap years."""
    try:
        return dt.replace(year=dt.year + years)
    except ValueError:
        return dt.replace(year=dt.year + years, day=28)

def generate_retention_manifest(file_path: Path, ingest_date: Optional[datetime.datetime] = None) -> dict:
    """Creates machine-actionable retention metadata for the configured retention window."""
    ingest = ingest_date or datetime.datetime.now(datetime.timezone.utc)
    expiry = add_years(ingest, RETENTION_YEARS)

    with open(file_path, "rb") as f:
        checksum = compute_sha256(f)

    return {
        "object_id": file_path.name,
        "ingest_timestamp": ingest.isoformat(),
        "retention_expiry": expiry.isoformat(),
        "policy": f"NIH_DMS_2023_WORM_{RETENTION_YEARS}YR",
        "integrity_sha256": checksum,
        "next_verification": (ingest + datetime.timedelta(days=90)).isoformat()
    }

Wiring the Gates into a Continuous Compliance Pipeline

Automated validation must run continuously, not as a pre-submission checklist. Integrate the schema, license, and retention validators into a CI/CD pipeline that triggers on repository commits, API webhooks, or scheduled cron jobs. The pipeline pushes validated records to the institutional repository, registers DOIs via DataCite, and logs compliance states for audit trails. Three external constraints shape it: the DataCite API v2 requires Authorization: Bearer <TOKEN> and rate-limits at 100 requests/minute (use exponential backoff); OAI-PMH harvesting must propagate metadata updates within 24 hours and page large syncs with resumptionToken; and every validation failure must route to a dead-letter path with immutable timestamps. That dead-letter and replay discipline is the same one described in API Routing & Fallbacks.

python

import os
import json
import logging
import requests
from pathlib import Path
from typing import Dict, Any

DATACITE_API = "https://api.datacite.org/dois"
DATACITE_TOKEN = os.getenv("DATACITE_API_TOKEN")

def register_doi_and_push_metadata(doi: str, metadata_json: str) -> Dict[str, Any]:
    """Pushes validated metadata to DataCite and logs compliance status."""
    if not DATACITE_TOKEN:
        raise EnvironmentError("DATACITE_API_TOKEN environment variable is required.")

    headers = {
        "Authorization": f"Bearer {DATACITE_TOKEN}",
        "Content-Type": "application/vnd.api+json"
    }

    payload = {
        "data": {
            "id": doi,
            "type": "dois",
            "attributes": json.loads(metadata_json)
        }
    }

    try:
        resp = requests.post(DATACITE_API, json=payload, headers=headers, timeout=30)
        resp.raise_for_status()
        logging.info("DOI %s registered successfully. Compliance status: ACTIVE", doi)
        return resp.json()
    except requests.exceptions.HTTPError as e:
        logging.error("DataCite API rejection: %s | Response: %s", e, resp.text)
        raise

def run_compliance_pipeline(dms_plan: Dict[str, Any], artifact_path: Path, doi: str) -> None:
    """End-to-end validation and ingestion pipeline."""
    metadata_json = validate_and_serialize(dms_plan)
    access_payload = validate_license_and_access(
        dms_plan.get("license", ""),
        dms_plan.get("access_type", "open"),
        dms_plan.get("irb_id")
    )
    retention_manifest = generate_retention_manifest(artifact_path)

    # Merge compliance artifacts
    final_payload = json.loads(metadata_json)
    final_payload.update({
        "access_control": access_payload,
        "retention_policy": retention_manifest
    })

    register_doi_and_push_metadata(doi, json.dumps(final_payload))
    logging.info("Pipeline execution complete. Artifact %s aligned with NIH DMS policy.", artifact_path.name)

Verification

Assert the gates without touching a live DataCite endpoint by exercising the validators directly. The test below confirms that a well-formed DMS plan serializes cleanly, that a bare ORCID (missing the URI prefix) is rejected, and that an open deposit carrying a non-SPDX license string fails the access gate.

python

import pytest

def _valid_plan() -> dict:
    return {
        "titles": [{"title": "Murine Cortical Single-Cell Atlas"}],
        "creators": [{
            "name": "Vale, R.",
            "nameIdentifier": "https://orcid.org/0000-0002-1825-0097",
        }],
        "fundingReferences": [{
            "funderName": "National Institutes of Health",
            "funderIdentifier": "https://ror.org/01cwqze88",
        }],
        "resourceTypeGeneral": "Dataset",
        "rightsList": [{"rights": "CC-BY-4.0"}],
    }

def test_valid_plan_serializes() -> None:
    out = validate_and_serialize(_valid_plan())
    assert "0000-0002-1825-0097" in out

def test_bare_orcid_is_rejected() -> None:
    plan = _valid_plan()
    plan["creators"][0]["nameIdentifier"] = "0000-0002-1825-0097"
    with pytest.raises(Exception):
        validate_and_serialize(plan)

def test_non_spdx_license_is_rejected() -> None:
    with pytest.raises(ValueError):
        validate_license_and_access("Open Access", "open")

Run it with pytest -q test_nih_alignment.py. A passing run prints 3 passed; a rejected ORCID emits the structured log line ERROR Schema validation failed: Invalid ORCID URI: ..., which is the same telemetry the audit log and compliance dashboards consume.

Gotchas

A bare ORCID is not an ORCID URI. DataCite’s nameIdentifier requires the full https://orcid.org/ prefix; a value like 0000-0002-1825-0097 passes a naive length check but is rejected at registration. Fix: validate against the full URI regex (as validate_orcid does) and never accept the 16-digit form alone.
SPDX identifiers are case-sensitive. Lowercasing or “normalizing” a license string turns CC-BY-4.0 into a value that is not on the SPDX License List, so the gate silently rejects a perfectly valid license. Fix: only strip() surrounding whitespace; never alter the case of an SPDX identifier.
“A meaningful period” is not a retention number. Hardcoding a federal duration invents a rule NIH never set. Fix: source RETENTION_YEARS from the repository’s preservation commitment and institutional records policy, and carry the origin in policy_version so an auditor can reconstruct why that window applied.

Funder Mandate Alignment — the parent guide that compiles any funder’s policy into the typed object and gates this NIH example specializes.
Open License Configuration — the SPDX registry and resolution rules the license gate above depends on.
Institutional Repository Strategy — the deposit pipeline that consumes the validated metadata and retention manifests produced here.
Metadata Schema Mapping — the full field-by-field crosswalk from internal records into the DataCite Metadata Schema.

Mapping NIH Data Sharing Policy Fields to DataCite Metadata and FAIR Controls #

The Core Crosswalk: DMS Plan Fields to DataCite Properties #

Production Implementation #

Pre-Ingest License and Access Control Validation #

Automated Retention and Preservation Manifests #

Wiring the Gates into a Continuous Compliance Pipeline #

Verification #

Gotchas #

Related Guides #

Mapping NIH Data Sharing Policy Fields to DataCite Metadata and FAIR Controls

The Core Crosswalk: DMS Plan Fields to DataCite Properties

Production Implementation

Pre-Ingest License and Access Control Validation

Automated Retention and Preservation Manifests

Wiring the Gates into a Continuous Compliance Pipeline

Verification

Gotchas

Related Guides