Building a Data Management Plan Template for Researchers: FAIR Automation & Compliance Implementation
Static, manually authored Data Management Plans (DMPs) consistently fail runtime validation against evolving funder mandates and institutional retention policies. The structural deficiency stems from narrative formatting, absent machine-readable schemas, non-deterministic license mapping, and missing persistent identifier (PID) resolution pathways. Transitioning to a programmatic DMP template requires enforcing strict typing, mandatory field presence, and cross-referenced compliance rules. This guide provides a step-by-step implementation architecture, production-ready validation pipelines, and deterministic routing logic for FAIR compliance automation.
Step 1: Define the Schema Contract
The template must operate as a strict contract between researcher input and automated compliance engines. Use JSON Schema or YAML as the foundational serialization format, but enforce validation through a typed object model. Separate metadata capture from policy enforcement to prevent cross-contamination of descriptive and compliance-critical fields.
Map each schema node to a specific FAIR metric:
- Findable: Require
dataset_title,creator_orcid,collection_methodology, andpid_assignment_status. - Accessible: Mandate
access_protocol,authentication_tier, andrepository_endpoint. - Interoperable: Enforce controlled vocabularies for
data_format,domain_ontology, andsensitivity_classification. - Reusable: Require explicit
spdx_license_id,provenance_chain, andretention_months.
Aligning these constraints with established Open Science Infrastructure Planning ensures that the template routes data deterministically upon project closure. Avoid free-text fields for compliance-critical nodes. Use enumerated types, regex constraints, and integer bounds to eliminate ambiguity.
Step 2: Implement Conditional Funder Mandate Logic
Funder mandate misalignment is the primary cause of validation failure. NIH, NSF, Horizon Europe, and UKRI enforce divergent retention windows, metadata standards, and embargo allowances. Hardcoding retention values or accepting unstructured license strings will break automated compliance checks.
Implement a conditional routing layer that evaluates funding_agency and grant_type at initialization. The logic must:
- Validate retention periods against agency-specific minimums (e.g., NSF requires 3 years post-project; NIH requires indefinite or 10-year minimums depending on institute).
- Restrict license selection to SPDX-compliant identifiers. Publicly funded datasets must reject
CC-BY-NC-4.0or proprietary licenses unless an explicit exemption is documented. - Map sensitivity classifications to institutional access tiers. Classifications like
Public,Controlled,Restricted, orConfidentialmust trigger corresponding repository routing rules.
Reference the Data Governance Frameworks specification when defining embargo expiration logic and access tier inheritance. Conditional rules must be evaluated before schema serialization to prevent downstream repository rejection.
Step 3: Deploy the Validation Pipeline
The following implementation uses pydantic v2 for strict type enforcement, field-level validation, and conditional mandate checking. The pipeline isolates validation from UI generation, enabling direct integration with CI/CD workflows and repository submission APIs.
from pydantic import BaseModel, Field, field_validator, model_validator, ValidationError
from typing import Literal, Optional
from datetime import date
# SPDX License Registry (truncated for brevity; load from https://spdx.org/licenses/ in production).
# NC/ND variants are valid SPDX identifiers, so they pass schema validation here but are
# blocked for public funders by the model-level policy check below.
VALID_SPDX_LICENSES = {
"CC0-1.0", "CC-BY-4.0", "CC-BY-SA-4.0", "CC-BY-NC-4.0", "CC-BY-ND-4.0",
"MIT", "Apache-2.0", "BSD-3-Clause",
}
# Funder Mandate Configuration
FUNDER_RETENTION_MIN = {
"NSF": 36,
"NIH": 120,
"Horizon_Europe": 60,
"UKRI": 120,
"Internal": 0
}
class DMPComplianceError(ValueError):
pass
class DataManagementPlan(BaseModel):
dataset_title: str = Field(..., min_length=5, max_length=255)
funding_agency: Literal["NSF", "NIH", "Horizon_Europe", "UKRI", "Internal"]
grant_type: str = Field(..., pattern=r"^[A-Z0-9\-]+$")
retention_months: int = Field(..., ge=0)
spdx_license_id: str
sensitivity_classification: Literal["Public", "Controlled", "Restricted", "Confidential"]
repository_target: str = Field(..., pattern=r"^https?://")
pid_assigned: bool = False
embargo_end_date: Optional[date] = None
@field_validator("spdx_license_id")
@classmethod
def validate_spdx(cls, v: str) -> str:
if v not in VALID_SPDX_LICENSES:
raise DMPComplianceError(f"Invalid SPDX license: {v}. Must be from {VALID_SPDX_LICENSES}.")
return v
@field_validator("retention_months")
@classmethod
def validate_retention(cls, v: int, info) -> int:
# Access funding_agency safely during validation
agency = info.data.get("funding_agency")
if agency and agency in FUNDER_RETENTION_MIN:
min_retention = FUNDER_RETENTION_MIN[agency]
if v < min_retention:
raise DMPComplianceError(
f"Retention period {v} months violates {agency} minimum of {min_retention} months."
)
return v
@model_validator(mode="after")
def enforce_public_funding_license_rules(self) -> "DataManagementPlan":
if self.funding_agency in ("NSF", "NIH", "Horizon_Europe", "UKRI"):
if self.spdx_license_id in ("CC-BY-NC-4.0", "CC-BY-ND-4.0"):
raise DMPComplianceError(
f"Public funding agency {self.funding_agency} prohibits {self.spdx_license_id}. "
"Use CC-BY-4.0 or CC0-1.0."
)
if self.sensitivity_classification == "Public" and self.embargo_end_date:
raise DMPComplianceError("Public datasets cannot have an embargo period.")
return self
def generate_compliance_report(self) -> dict:
return {
"status": "VALID",
"funder": self.funding_agency,
"retention_compliant": True,
"license_compliant": True,
"pid_required": not self.pid_assigned,
"routing_target": self.repository_target
}
# Production Usage Example
try:
dmp = DataManagementPlan(
dataset_title="Genomic Sequencing Cohort Alpha",
funding_agency="NIH",
grant_type="R01-CA-2024",
retention_months=120,
spdx_license_id="CC-BY-4.0",
sensitivity_classification="Controlled",
repository_target="https://repository.institution.edu/ds/alpha",
pid_assigned=False,
embargo_end_date=None
)
print(dmp.generate_compliance_report())
except ValidationError as e:
# Pydantic v2 wraps every validator failure (including the custom
# DMPComplianceError raised above) in a single ValidationError.
print(f"Validation Failed: {e}")
The pipeline enforces SPDX compliance via a strict allowlist, validates retention against funder-specific minimums, and blocks non-compliant license combinations at the model level. Integrate this validation layer directly into your submission gateway. Because Pydantic v2 wraps any error raised inside a validator—including the custom DMPComplianceError—in a single ValidationError, catch ValidationError at the gateway and reject the payload before it reaches the repository API. The original DMPComplianceError message is preserved in the wrapped error for audit logging.
Step 4: Integrate with Repository Routing & PID Assignment
Once validation passes, the DMP must trigger automated routing. Do not rely on manual curator intervention for dataset deposition. Implement a routing dispatcher that reads repository_target, sensitivity_classification, and pid_assigned.
- PID Resolution: If
pid_assignedisFalse, invoke the institutional Handle/DOI registration API. Pass validated metadata (dataset_title,creator_orcid,spdx_license_id) to the PID minting service. Store the returned PID in the DMP artifact. - Access Tier Mapping: Route
Publicdatasets to open-access endpoints. RouteControlledorRestricteddatasets to authenticated gateways with role-based access control (RBAC). EnforceConfidentialrouting to secure institutional storage with explicit data use agreements (DUAs). - Retention Enforcement: Schedule automated lifecycle hooks. Use the
retention_monthsfield to set expiration or review triggers. Integrate with institutional artifact retention policies to prevent premature deletion or indefinite storage of non-compliant datasets.
Align routing logic with your broader Open Science Infrastructure Planning roadmap to ensure consistent metadata propagation across discovery indexes and institutional catalogs.
Step 5: Troubleshooting & Maintenance
Validation failures typically originate from three structural misconfigurations:
- Hardcoded Retention Overrides: Researchers or legacy scripts may inject static retention values that bypass funder minimums. Resolve by locking the
retention_monthsfield to a computed minimum derived fromfunding_agencyduring initialization. - Free-Text License Fields: Accepting arbitrary strings breaks SPDX mapping and downstream license compatibility checks. Enforce strict enumeration and reject inputs that do not match the official SPDX registry.
- Embargo vs. Sensitivity Conflicts: Publicly classified datasets with active embargo dates fail compliance routing. Implement the
model_validatorcheck shown in Step 3 to block contradictory states before submission.
Maintain the validation pipeline by versioning the schema alongside funder policy updates. Automate SPDX registry synchronization and funder mandate refreshes via scheduled CI jobs. Test edge cases using synthetic payloads that trigger boundary conditions (e.g., retention_months exactly at minimum, expired embargo dates, mixed funding sources).
Production deployment requires logging all validation attempts, capturing DMPComplianceError traces, and exposing a health endpoint that reports schema version, active funder rules, and license registry sync status. Monitor routing success rates and PID minting latency to ensure the template scales across institutional research output.