Building a Data Management Plan Template for Researchers: FAIR Automation & Compliance Implementation

Static, manually authored Data Management Plans (DMPs) consistently fail runtime validation against evolving funder mandates and institutional retention policies. The structural deficiency stems from narrative formatting, absent machine-readable schemas, non-deterministic license mapping, and missing persistent identifier (PID) resolution pathways. Transitioning to a programmatic DMP template requires enforcing strict typing, mandatory field presence, and cross-referenced compliance rules. This guide provides a step-by-step implementation architecture, production-ready validation pipelines, and deterministic routing logic for FAIR compliance automation.

Step 1: Define the Schema Contract

The template must operate as a strict contract between researcher input and automated compliance engines. Use JSON Schema or YAML as the foundational serialization format, but enforce validation through a typed object model. Separate metadata capture from policy enforcement to prevent cross-contamination of descriptive and compliance-critical fields.

Map each schema node to a specific FAIR metric:

  • Findable: Require dataset_title, creator_orcid, collection_methodology, and pid_assignment_status.
  • Accessible: Mandate access_protocol, authentication_tier, and repository_endpoint.
  • Interoperable: Enforce controlled vocabularies for data_format, domain_ontology, and sensitivity_classification.
  • Reusable: Require explicit spdx_license_id, provenance_chain, and retention_months.

Aligning these constraints with established Open Science Infrastructure Planning ensures that the template routes data deterministically upon project closure. Avoid free-text fields for compliance-critical nodes. Use enumerated types, regex constraints, and integer bounds to eliminate ambiguity.

Step 2: Implement Conditional Funder Mandate Logic

Funder mandate misalignment is the primary cause of validation failure. NIH, NSF, Horizon Europe, and UKRI enforce divergent retention windows, metadata standards, and embargo allowances. Hardcoding retention values or accepting unstructured license strings will break automated compliance checks.

Implement a conditional routing layer that evaluates funding_agency and grant_type at initialization. The logic must:

  1. Validate retention periods against agency-specific minimums (e.g., NSF requires 3 years post-project; NIH requires indefinite or 10-year minimums depending on institute).
  2. Restrict license selection to SPDX-compliant identifiers. Publicly funded datasets must reject CC-BY-NC-4.0 or proprietary licenses unless an explicit exemption is documented.
  3. Map sensitivity classifications to institutional access tiers. Classifications like Public, Controlled, Restricted, or Confidential must trigger corresponding repository routing rules.

Reference the Data Governance Frameworks specification when defining embargo expiration logic and access tier inheritance. Conditional rules must be evaluated before schema serialization to prevent downstream repository rejection.

Step 3: Deploy the Validation Pipeline

The following implementation uses pydantic v2 for strict type enforcement, field-level validation, and conditional mandate checking. The pipeline isolates validation from UI generation, enabling direct integration with CI/CD workflows and repository submission APIs.

python
from pydantic import BaseModel, Field, field_validator, model_validator, ValidationError
from typing import Literal, Optional
from datetime import date

# SPDX License Registry (truncated for brevity; load from https://spdx.org/licenses/ in production).
# NC/ND variants are valid SPDX identifiers, so they pass schema validation here but are
# blocked for public funders by the model-level policy check below.
VALID_SPDX_LICENSES = {
    "CC0-1.0", "CC-BY-4.0", "CC-BY-SA-4.0", "CC-BY-NC-4.0", "CC-BY-ND-4.0",
    "MIT", "Apache-2.0", "BSD-3-Clause",
}

# Funder Mandate Configuration
FUNDER_RETENTION_MIN = {
    "NSF": 36,
    "NIH": 120,
    "Horizon_Europe": 60,
    "UKRI": 120,
    "Internal": 0
}

class DMPComplianceError(ValueError):
    pass

class DataManagementPlan(BaseModel):
    dataset_title: str = Field(..., min_length=5, max_length=255)
    funding_agency: Literal["NSF", "NIH", "Horizon_Europe", "UKRI", "Internal"]
    grant_type: str = Field(..., pattern=r"^[A-Z0-9\-]+$")
    retention_months: int = Field(..., ge=0)
    spdx_license_id: str
    sensitivity_classification: Literal["Public", "Controlled", "Restricted", "Confidential"]
    repository_target: str = Field(..., pattern=r"^https?://")
    pid_assigned: bool = False
    embargo_end_date: Optional[date] = None

    @field_validator("spdx_license_id")
    @classmethod
    def validate_spdx(cls, v: str) -> str:
        if v not in VALID_SPDX_LICENSES:
            raise DMPComplianceError(f"Invalid SPDX license: {v}. Must be from {VALID_SPDX_LICENSES}.")
        return v

    @field_validator("retention_months")
    @classmethod
    def validate_retention(cls, v: int, info) -> int:
        # Access funding_agency safely during validation
        agency = info.data.get("funding_agency")
        if agency and agency in FUNDER_RETENTION_MIN:
            min_retention = FUNDER_RETENTION_MIN[agency]
            if v < min_retention:
                raise DMPComplianceError(
                    f"Retention period {v} months violates {agency} minimum of {min_retention} months."
                )
        return v

    @model_validator(mode="after")
    def enforce_public_funding_license_rules(self) -> "DataManagementPlan":
        if self.funding_agency in ("NSF", "NIH", "Horizon_Europe", "UKRI"):
            if self.spdx_license_id in ("CC-BY-NC-4.0", "CC-BY-ND-4.0"):
                raise DMPComplianceError(
                    f"Public funding agency {self.funding_agency} prohibits {self.spdx_license_id}. "
                    "Use CC-BY-4.0 or CC0-1.0."
                )
        if self.sensitivity_classification == "Public" and self.embargo_end_date:
            raise DMPComplianceError("Public datasets cannot have an embargo period.")
        return self

    def generate_compliance_report(self) -> dict:
        return {
            "status": "VALID",
            "funder": self.funding_agency,
            "retention_compliant": True,
            "license_compliant": True,
            "pid_required": not self.pid_assigned,
            "routing_target": self.repository_target
        }

# Production Usage Example
try:
    dmp = DataManagementPlan(
        dataset_title="Genomic Sequencing Cohort Alpha",
        funding_agency="NIH",
        grant_type="R01-CA-2024",
        retention_months=120,
        spdx_license_id="CC-BY-4.0",
        sensitivity_classification="Controlled",
        repository_target="https://repository.institution.edu/ds/alpha",
        pid_assigned=False,
        embargo_end_date=None
    )
    print(dmp.generate_compliance_report())
except ValidationError as e:
    # Pydantic v2 wraps every validator failure (including the custom
    # DMPComplianceError raised above) in a single ValidationError.
    print(f"Validation Failed: {e}")

The pipeline enforces SPDX compliance via a strict allowlist, validates retention against funder-specific minimums, and blocks non-compliant license combinations at the model level. Integrate this validation layer directly into your submission gateway. Because Pydantic v2 wraps any error raised inside a validator—including the custom DMPComplianceError—in a single ValidationError, catch ValidationError at the gateway and reject the payload before it reaches the repository API. The original DMPComplianceError message is preserved in the wrapped error for audit logging.

Step 4: Integrate with Repository Routing & PID Assignment

Once validation passes, the DMP must trigger automated routing. Do not rely on manual curator intervention for dataset deposition. Implement a routing dispatcher that reads repository_target, sensitivity_classification, and pid_assigned.

%% caption: Dataset publication lifecycle from draft to published stateDiagram-v2 [*] --> Draft Draft --> Validated: passes schema and funder checks Draft --> Rejected: compliance failure Validated --> Embargoed: embargo date set Validated --> Published: immediate open access Embargoed --> Published: embargo expires Rejected --> Draft: researcher revises Published --> [*]
Dataset publication lifecycle from draft to published
  1. PID Resolution: If pid_assigned is False, invoke the institutional Handle/DOI registration API. Pass validated metadata (dataset_title, creator_orcid, spdx_license_id) to the PID minting service. Store the returned PID in the DMP artifact.
  2. Access Tier Mapping: Route Public datasets to open-access endpoints. Route Controlled or Restricted datasets to authenticated gateways with role-based access control (RBAC). Enforce Confidential routing to secure institutional storage with explicit data use agreements (DUAs).
  3. Retention Enforcement: Schedule automated lifecycle hooks. Use the retention_months field to set expiration or review triggers. Integrate with institutional artifact retention policies to prevent premature deletion or indefinite storage of non-compliant datasets.

Align routing logic with your broader Open Science Infrastructure Planning roadmap to ensure consistent metadata propagation across discovery indexes and institutional catalogs.

Step 5: Troubleshooting & Maintenance

Validation failures typically originate from three structural misconfigurations:

  • Hardcoded Retention Overrides: Researchers or legacy scripts may inject static retention values that bypass funder minimums. Resolve by locking the retention_months field to a computed minimum derived from funding_agency during initialization.
  • Free-Text License Fields: Accepting arbitrary strings breaks SPDX mapping and downstream license compatibility checks. Enforce strict enumeration and reject inputs that do not match the official SPDX registry.
  • Embargo vs. Sensitivity Conflicts: Publicly classified datasets with active embargo dates fail compliance routing. Implement the model_validator check shown in Step 3 to block contradictory states before submission.

Maintain the validation pipeline by versioning the schema alongside funder policy updates. Automate SPDX registry synchronization and funder mandate refreshes via scheduled CI jobs. Test edge cases using synthetic payloads that trigger boundary conditions (e.g., retention_months exactly at minimum, expired embargo dates, mixed funding sources).

Production deployment requires logging all validation attempts, capturing DMPComplianceError traces, and exposing a health endpoint that reports schema version, active funder rules, and license registry sync status. Monitor routing success rates and PID minting latency to ensure the template scales across institutional research output.