Building a Data Management Plan Template for Researchers: FAIR Automation & Compliance Implementation

Static, manually authored Data Management Plans (DMPs) consistently fail runtime validation against evolving funder mandates and institutional retention policies. The structural deficiency stems from narrative formatting, absent machine-readable schemas, non-deterministic license mapping, and missing persistent identifier (PID) resolution pathways. This page is a build guide: it shows a Python automation engineer or research data manager how to replace a free-text DMP with a typed template that enforces strict field presence, controlled vocabularies, and cross-referenced compliance rules at write time. It assumes you can already run Pydantic schema validation and want the exact schema contract, funder-conditional logic, and routing dispatcher that a plan template needs. This template is the input artifact for the wider policy pipeline described in Data Governance Frameworks — the plan is where obligations are captured; that pipeline is where they are enforced against every deposit.

The Schema Contract

The template must operate as a strict contract between researcher input and automated compliance engines. Use the JSON Schema serialization standard as the on-disk representation, but enforce validation through a typed object model so every field carries a machine-checkable constraint rather than free text. Separate metadata capture from policy enforcement to prevent cross-contamination of descriptive and compliance-critical fields. Each schema node maps to exactly one FAIR principle, so a failing field names the principle it breaks — the same principle-to-component mapping formalized in the FAIR Principle Breakdown.

The table below is the core reference artifact: it is the field-by-field contract the template enforces before a plan is ever accepted. Avoid free-text fields for any compliance-critical node — use enumerated types, regex constraints, and integer bounds to eliminate ambiguity.

Template field	Type / constraint	FAIR principle	Failure action
`dataset_title`	`str`, 5–255 chars	Findable	Reject — required field
`creator_orcid`	ORCID iD, URI form `https://orcid.org/XXXX-XXXX-XXXX-XXXX`	Findable	Reject — malformed identifier
`pid_assigned`	`bool` (triggers minting when `False`)	Findable	Route to PID minting service
`funding_agency`	`Literal["NSF","NIH","Horizon_Europe","UKRI","Internal"]`	Reusable	Reject — unknown funder
`grant_type`	`str`, regex `^[A-Z0-9\-]+$`	Reusable	Reject — unstructured grant code
`access_protocol`	`Literal["Public","Controlled","Restricted","Confidential"]`	Accessible	Route to matching access tier
`data_format`	Controlled vocabulary (e.g. `CSV`, `NetCDF`, `FASTQ`)	Interoperable	Reject — out-of-vocabulary format
`spdx_license_id`	SPDX License List identifier, allowlist-checked	Reusable	Reject — invalid or prohibited license
`retention_months`	`int ≥ 0`, `≥` funder minimum	Reusable	Reject — below funder floor
`repository_target`	`str`, regex `^https?://`	Accessible	Reject — non-resolvable endpoint
`embargo_end_date`	`date` or `None`; forbidden when access is `Public`	Accessible	Reject — embargo/sensitivity conflict

Licensing is recorded as an SPDX License List identifier so a legal permission set reduces to a single canonical token; selecting and encoding that token is covered in Open License Configuration. Discovery metadata carries a Schema.org Dataset shape on export so harvesters can index the plan’s dataset once published.

Funder Mandate Decision Table

Funder mandate misalignment is the primary cause of validation failure. NSF, NIH, Horizon Europe, and UKRI enforce divergent retention windows and license allowances, so hardcoding a single retention value or accepting an unstructured license string will break automated checks. The template resolves funding_agency and grant_type at initialization and applies the deterministic rules below. These are the obligations formalized in Funder Mandate Alignment; the NIH-specific crosswalk is worked end-to-end in aligning NIH data sharing policies with FAIR principles.

Funding agency	Retention minimum (months)	License rule	Embargo allowance
`NSF`	36	SPDX open license; reject `-NC`/`-ND` variants	Discipline-dependent, bounded
`NIH`	120	SPDX open license; reject `-NC`/`-ND` variants	Justified access limits only
`Horizon_Europe`	60	SPDX open license; reject `-NC`/`-ND` variants	Bounded embargo permitted
`UKRI`	120	SPDX open license; reject `-NC`/`-ND` variants	Bounded embargo permitted
`Internal`	0	Any valid SPDX identifier	Unrestricted

The retention minimums are the institutional floors used throughout this page; the underlying preservation window is set by the chosen repository and your records-retention policy, so treat the table as the lower bound the template must never fall below.

Production Python Implementation

The following implementation uses the Pydantic V2 API for strict type enforcement, field-level validation, and conditional mandate checking. The pipeline isolates validation from UI generation, enabling direct integration with CI/CD workflows and repository submission APIs. Both the SPDX allowlist and the funder-retention floors live in single shared constants so a policy change is one edit, not two drifting copies.

python

from pydantic import BaseModel, Field, field_validator, model_validator, ValidationError
from typing import Literal, Optional
from datetime import date

# SPDX License Registry (truncated for brevity; load from the full SPDX License
# List in production). NC/ND variants are valid SPDX identifiers, so they pass
# schema validation here but are blocked for public funders by the model-level
# policy check below.
VALID_SPDX_LICENSES: set[str] = {
    "CC0-1.0", "CC-BY-4.0", "CC-BY-SA-4.0", "CC-BY-NC-4.0", "CC-BY-ND-4.0",
    "MIT", "Apache-2.0", "BSD-3-Clause",
}

# Funder mandate configuration: retention floor in months, keyed by agency.
FUNDER_RETENTION_MIN: dict[str, int] = {
    "NSF": 36,
    "NIH": 120,
    "Horizon_Europe": 60,
    "UKRI": 120,
    "Internal": 0,
}

# Public funders reject non-commercial / no-derivatives licenses.
PUBLIC_FUNDERS: frozenset[str] = frozenset({"NSF", "NIH", "Horizon_Europe", "UKRI"})
PROHIBITED_PUBLIC_LICENSES: frozenset[str] = frozenset({"CC-BY-NC-4.0", "CC-BY-ND-4.0"})


class DMPComplianceError(ValueError):
    """Raised for funder / policy violations; wrapped by Pydantic's ValidationError."""


class DataManagementPlan(BaseModel):
    dataset_title: str = Field(..., min_length=5, max_length=255)
    funding_agency: Literal["NSF", "NIH", "Horizon_Europe", "UKRI", "Internal"]
    grant_type: str = Field(..., pattern=r"^[A-Z0-9\-]+$")
    retention_months: int = Field(..., ge=0)
    spdx_license_id: str
    sensitivity_classification: Literal["Public", "Controlled", "Restricted", "Confidential"]
    repository_target: str = Field(..., pattern=r"^https?://")
    pid_assigned: bool = False
    embargo_end_date: Optional[date] = None

    @field_validator("spdx_license_id")
    @classmethod
    def validate_spdx(cls, v: str) -> str:
        if v not in VALID_SPDX_LICENSES:
            raise DMPComplianceError(f"Invalid SPDX license: {v}.")
        return v

    @field_validator("retention_months")
    @classmethod
    def validate_retention(cls, v: int, info) -> int:
        # info.data holds already-validated fields; funding_agency is declared
        # before retention_months, so it is available here.
        agency = info.data.get("funding_agency")
        if agency in FUNDER_RETENTION_MIN:
            minimum = FUNDER_RETENTION_MIN[agency]
            if v < minimum:
                raise DMPComplianceError(
                    f"Retention {v} months violates {agency} minimum of {minimum} months."
                )
        return v

    @model_validator(mode="after")
    def enforce_cross_field_policy(self) -> "DataManagementPlan":
        if self.funding_agency in PUBLIC_FUNDERS and self.spdx_license_id in PROHIBITED_PUBLIC_LICENSES:
            raise DMPComplianceError(
                f"Public funder {self.funding_agency} prohibits {self.spdx_license_id}; "
                "use CC-BY-4.0 or CC0-1.0."
            )
        if self.sensitivity_classification == "Public" and self.embargo_end_date is not None:
            raise DMPComplianceError("Public datasets cannot carry an embargo period.")
        return self

    def generate_compliance_report(self) -> dict:
        return {
            "status": "VALID",
            "funder": self.funding_agency,
            "retention_months": self.retention_months,
            "license": self.spdx_license_id,
            "pid_required": not self.pid_assigned,
            "routing_target": self.repository_target,
        }


# Production usage: any validator failure surfaces as a single ValidationError.
try:
    dmp = DataManagementPlan(
        dataset_title="Genomic Sequencing Cohort Alpha",
        funding_agency="NIH",
        grant_type="R01-CA-2024",
        retention_months=120,
        spdx_license_id="CC-BY-4.0",
        sensitivity_classification="Controlled",
        repository_target="https://repository.institution.edu/ds/alpha",
        pid_assigned=False,
        embargo_end_date=None,
    )
    print(dmp.generate_compliance_report())
except ValidationError as e:
    # Pydantic V2 wraps every validator failure — including the custom
    # DMPComplianceError raised above — in one ValidationError. Catch it at the
    # submission gateway, log e.json() for audit, and reject before the payload
    # reaches the repository API. The original message is preserved in the wrapper.
    print(f"Validation failed: {e}")

The template enforces SPDX compliance via a strict allowlist, validates retention against funder-specific minimums, and blocks contradictory license and embargo states at the model level. Integrate this validation layer directly into your submission gateway so a non-compliant plan never advances.

Routing & PID Assignment

Once validation passes, the template must trigger automated routing — do not rely on manual curator intervention. A dispatcher reads repository_target, sensitivity_classification, and pid_assigned and drives the plan’s dataset through the lifecycle below.

PID resolution. When pid_assigned is False, invoke the institutional Handle/DOI registration API. Pass the validated dataset_title, creator_orcid, and spdx_license_id to the minting service, then store the returned PID on the plan artifact.
Access-tier mapping. Route Public datasets to open-access endpoints; route Controlled and Restricted datasets to authenticated gateways with role-based access control; route Confidential datasets to secure institutional storage behind explicit data use agreements. The boundary enforcement for these tiers is detailed in security & access control.
Retention enforcement. Use retention_months to schedule expiration or review triggers, and align them with the deposit target chosen under Institutional Repository Strategy so records are neither deleted early nor stored indefinitely out of policy.

Verification

Assert the template’s rejection paths the same way you assert application logic. The test below exercises the compliant case plus the three rules most likely to drift — the funder retention floor, the public-funder license ban, and the embargo/sensitivity conflict.

python

import pytest
from pydantic import ValidationError

BASE = dict(
    dataset_title="Genomic Sequencing Cohort Alpha",
    funding_agency="NIH",
    grant_type="R01-CA-2024",
    retention_months=120,
    spdx_license_id="CC-BY-4.0",
    sensitivity_classification="Controlled",
    repository_target="https://repository.institution.edu/ds/alpha",
)


def test_compliant_plan_validates() -> None:
    dmp = DataManagementPlan(**BASE)
    assert dmp.generate_compliance_report()["status"] == "VALID"


def test_retention_below_funder_floor_rejected() -> None:
    with pytest.raises(ValidationError):
        DataManagementPlan(**{**BASE, "retention_months": 24})  # < NIH 120


def test_public_funder_rejects_nc_license() -> None:
    with pytest.raises(ValidationError):
        DataManagementPlan(**{**BASE, "spdx_license_id": "CC-BY-NC-4.0"})


def test_public_dataset_forbids_embargo() -> None:
    from datetime import date
    with pytest.raises(ValidationError):
        DataManagementPlan(
            **{**BASE, "sensitivity_classification": "Public", "embargo_end_date": date(2027, 1, 1)}
        )

Run it with pytest -q dmp_template_test.py; a clean run proves the schema, the funder floor, and the cross-field policy all reject the payloads they are meant to reject.

Gotchas

Hardcoded retention overrides. Legacy intake scripts inject a static retention_months that silently undercuts the funder floor. Root cause: the value is set outside the model. Fix: only ever construct retention_months from FUNDER_RETENTION_MIN[funding_agency] as the lower bound, never from a caller-supplied default.
Free-text license fields. Accepting an arbitrary string breaks SPDX mapping and downstream license-compatibility checks. Root cause: no allowlist at the field boundary. Fix: validate against VALID_SPDX_LICENSES and reject anything not in the official SPDX License List.
Allowlist drift across the two constants. The prohibited-license set and the SPDX allowlist are separate; add a license to one and forget the other and you quarantine valid plans. Fix: change VALID_SPDX_LICENSES and PROHIBITED_PUBLIC_LICENSES in the same reviewed commit so the diff is the audit record.

Data Governance Frameworks — the parent guide that consumes this template and enforces its rules at every deposit.
Aligning NIH data sharing policies with FAIR principles — the funder-specific crosswalk behind the NIH row of the decision table.
Configuring CC-BY licenses for automated dataset publishing — how the SPDX token validated here is encoded on the published record.
See the Open Science Infrastructure Planning overview for how the plan template feeds the full policy-to-deposit pipeline.

Frequently Asked Questions

Should the DMP template be JSON Schema or a Pydantic model?

Both, at different layers. Serialize the on-disk plan as JSON Schema so it is tool-agnostic and reviewable, but enforce it through the Pydantic V2 model shown above, which adds cross-field policy rules — funder retention floors, license bans, embargo conflicts — that plain JSON Schema cannot express. The model is the runtime gate; the JSON Schema is the interchange format.

How do I add a new funder without touching the validators?

Add one row to FUNDER_RETENTION_MIN with the agency’s retention floor, extend the funding_agency Literal, and, if the funder is public, add it to PUBLIC_FUNDERS. The retention and license validators read those constants directly, so no validator body changes — and the diff on the constants is the audit record of the new obligation.

Where does the template sit relative to PID minting?

Before it. The template validates first; only a plan that clears schema, funder, and cross-field checks reaches the routing dispatcher, which then mints a Handle or DOI when pid_assigned is False. Minting an identifier for a plan that later fails validation would leave an orphaned, non-compliant PID in the registry.

What happens when a researcher submits a non-SPDX license string?

The validate_spdx field validator rejects it immediately, and Pydantic surfaces the failure as a single ValidationError at the gateway. Catch that error, log e.json() for the audit trail, and return the message to the depositor so they can pick a valid SPDX License List identifier before resubmitting.

Building a Data Management Plan Template for Researchers: FAIR Automation & Compliance Implementation #

The Schema Contract #

Funder Mandate Decision Table #

Production Python Implementation #

Routing & PID Assignment #

Verification #

Gotchas #

Related #

Frequently Asked Questions #

Should the DMP template be JSON Schema or a Pydantic model? #

How do I add a new funder without touching the validators? #

Where does the template sit relative to PID minting? #

What happens when a researcher submits a non-SPDX license string? #