Automating Data Governance Frameworks for FAIR Research Infrastructure

Research data governance frameworks are no longer purely administrative artifacts; they are executable specifications that dictate how datasets are ingested, described, preserved, and shared. Within the broader scope of Open Science Infrastructure Planning, governance translates directly into machine-readable policies, automated validation pipelines, and standardized metadata schemas. For research data managers, academic IT teams, and Python automation engineers, the operational challenge lies in bridging high-level compliance requirements with low-level API integrations and workflow orchestration. Treating governance as code enables reproducible, auditable, and scalable FAIR compliance across institutional research ecosystems.

A robust governance framework begins with precise schema mapping. FAIR compliance requires datasets to carry rich, structured metadata that aligns with community standards like DataCite Metadata Schema and domain-specific ontologies. Rather than relying on manual curation, automation engineers must implement programmatic validation layers that enforce mandatory fields, controlled vocabularies, and persistent identifier resolution. The following implementation guide details how to operationalize these requirements step-by-step.

%% caption: Governance-as-code pipeline from schema validation to audited ingestion flowchart LR raw["Raw Metadata Payload"] --> schema["Schema Validation (Pydantic)"] schema --> pid["PID Resolution (DataCite API)"] pid --> policy["Policy Enforcement (License / Embargo)"] policy --> gate{"Compliant?"} gate -->|"yes"| ingest["Repository Ingestion"] gate -->|"no"| reject["Reject / Quarantine"] ingest --> log["Audit Checkpoint Log"] reject --> log
Governance-as-code pipeline from schema validation to audited ingestion

Step 1: Define Strict Metadata Schemas with Programmatic Validation

Governance-as-code requires deterministic validation boundaries. Using pydantic, we define a strict schema that enforces field constraints, validates ORCID formats, and maps internal records to Schema.org/Dataset JSON-LD. This step ensures the Interoperable and Reusable FAIR principles are met before data enters any downstream pipeline.

python
import logging
from datetime import datetime, timezone
from typing import Optional, List
from pydantic import BaseModel, Field, field_validator, ValidationError

# Structured logging configuration for audit trails
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s | %(message)s",
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger("governance.schema_validator")

# Canonical license URLs keyed by SPDX identifier (string templating cannot
# derive these reliably, since SPDX IDs do not map 1:1 to URL paths).
LICENSE_URLS = {
    "CC-BY-4.0": "https://creativecommons.org/licenses/by/4.0/",
    "CC0-1.0": "https://creativecommons.org/publicdomain/zero/1.0/",
    "MIT": "https://opensource.org/licenses/MIT",
    "Apache-2.0": "https://www.apache.org/licenses/LICENSE-2.0",
}

class DatasetGovernanceSchema(BaseModel):
    """Strict schema mapping for FAIR-compliant dataset metadata."""
    dataset_id: str = Field(..., description="Internal project identifier (UUID v4)")
    title: str = Field(..., min_length=5, max_length=300)
    creators: List[str] = Field(..., min_length=1, description="ORCID or ROR identifiers")
    license: str = Field(..., pattern=r"^(CC-BY-4\.0|CC0-1\.0|MIT|Apache-2\.0)$")
    embargo_date: Optional[datetime] = None
    data_types: List[str] = Field(..., min_length=1)
    funding_grant: Optional[str] = None

    @field_validator("creators")
    @classmethod
    def validate_orcid_format(cls, v: List[str]) -> List[str]:
        for id_ in v:
            if not (id_.startswith("https://orcid.org/") or id_.startswith("0000-000")):
                raise ValueError(f"Invalid creator identifier format: {id_}")
        return v

    @field_validator("data_types")
    @classmethod
    def validate_data_types(cls, v: List[str]) -> List[str]:
        allowed = {"CSV", "JSON", "Parquet", "NetCDF", "TIFF", "FASTQ", "RDF"}
        for dtype in v:
            if dtype.upper() not in allowed:
                raise ValueError(f"Unsupported data type: {dtype}. Allowed: {allowed}")
        return [d.upper() for d in v]

    def to_json_ld(self) -> dict:
        """Serialize to Schema.org/Dataset JSON-LD for repository ingestion."""
        return {
            "@context": "https://schema.org",
            "@type": "Dataset",
            "identifier": self.dataset_id,
            "name": self.title,
            "author": [{"@type": "Person", "identifier": c} for c in self.creators],
            "license": LICENSE_URLS[self.license],
            "datePublished": (self.embargo_date or datetime.now(timezone.utc)).isoformat(),
            "encodingFormat": self.data_types
        }

def validate_and_serialize(raw_payload: dict) -> dict:
    """Execute validation, log FAIR checkpoints, and return JSON-LD."""
    try:
        validated = DatasetGovernanceSchema.model_validate(raw_payload)
        logger.info("FAIR Checkpoint PASSED: Metadata schema validation successful.")
        logger.info("FAIR Checkpoint PASSED: Controlled vocabulary & license constraints enforced.")
        return validated.to_json_ld()
    except ValidationError as e:
        logger.error("FAIR Checkpoint FAILED: Schema validation error. Details: %s", e.json())
        raise

Step 2: Resolve Persistent Identifiers with Resilient API Integration

Automated governance requires verifying that referenced PIDs actually resolve and match expected metadata formats. External registry APIs enforce strict rate limits, require specific headers, and return structured errors. The pipeline must implement exponential backoff retries, strict HTTP status validation, and payload schema checking to guarantee the Findable and Accessible principles.

python
import logging
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
from typing import Dict, Any

logger = logging.getLogger("governance.pid_resolver")

# Exact API constraints for DataCite REST API
DATACITE_BASE_URL = "https://api.datacite.org/dois"
MAX_RETRIES = 3
BACKOFF_FACTOR = 0.5
TIMEOUT = (5, 15)  # (connect, read)

def configure_session() -> requests.Session:
    """Configure session with exact retry policy and timeout constraints."""
    session = requests.Session()
    retry_strategy = Retry(
        total=MAX_RETRIES,
        backoff_factor=BACKOFF_FACTOR,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"],
        respect_retry_after_header=True
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("https://", adapter)
    session.headers.update({
        "Accept": "application/vnd.api+json",
        "User-Agent": "FAIR-Governance-Pipeline/1.0"
    })
    return session

def resolve_doi_metadata(doi: str) -> Dict[str, Any]:
    """Fetch and validate DOI metadata with resilient retry logic."""
    session = configure_session()
    url = f"{DATACITE_BASE_URL}/{doi}"
    
    logger.info("Attempting PID resolution: %s", doi)
    try:
        response = session.get(url, timeout=TIMEOUT)
        
        # Exact API constraint: enforce 200 OK and JSON-API content type
        if response.status_code != 200:
            logger.error("FAIR Checkpoint FAILED: DOI resolution returned HTTP %s", response.status_code)
            raise requests.HTTPError(f"DOI resolution failed: HTTP {response.status_code}")
        
        payload = response.json()
        data = payload.get("data", {})
        
        # Validation: ensure required FAIR fields exist in registry response
        if not data.get("attributes", {}).get("title"):
            raise ValueError("Registry response missing required 'title' field.")
            
        logger.info("FAIR Checkpoint PASSED: PID resolved and metadata structure validated.")
        return data["attributes"]
        
    except requests.exceptions.RetryError as e:
        logger.error("FAIR Checkpoint FAILED: Max retries exceeded for %s. Error: %s", doi, e)
        raise
    except requests.exceptions.RequestException as e:
        logger.error("Network failure during PID resolution: %s", e)
        raise
    except ValueError as e:
        logger.error("Schema validation failed on registry payload: %s", e)
        raise

Step 3: Automate Policy Enforcement for Licensing & Retention

Funder mandates and institutional retention policies dictate precise access windows and license compatibility. An automated pipeline must evaluate license strings against approved allowlists, calculate embargo expiration dates, and block ingestion if compliance thresholds are breached. This operational layer directly supports Funder Mandate Alignment workflows and ensures downstream Institutional Repository Strategy deployments do not violate compliance windows.

python
import logging
from datetime import datetime, timedelta
from typing import Optional

logger = logging.getLogger("governance.policy_engine")

ALLOWED_LICENSES = {"CC-BY-4.0", "CC0-1.0", "MIT", "Apache-2.0"}
MAX_EMBARGO_DAYS = 730  # 2-year institutional maximum

def enforce_access_policy(license_str: str, embargo_date: Optional[datetime], publication_date: datetime) -> bool:
    """Validate license and embargo against institutional retention policies."""
    logger.info("Evaluating access policy for license: %s", license_str)
    
    # Validation: License allowlist check
    if license_str not in ALLOWED_LICENSES:
        logger.error("FAIR Checkpoint FAILED: License '%s' not in institutional allowlist.", license_str)
        return False
        
    # Validation: Embargo constraint check
    if embargo_date:
        if embargo_date < publication_date:
            logger.error("FAIR Checkpoint FAILED: Embargo date precedes publication date.")
            return False
            
        embargo_duration = (embargo_date - publication_date).days
        if embargo_duration > MAX_EMBARGO_DAYS:
            logger.error("FAIR Checkpoint FAILED: Embargo exceeds institutional maximum of %d days.", MAX_EMBARGO_DAYS)
            return False
            
        logger.info("FAIR Checkpoint PASSED: Embargo window validated (%d days).", embargo_duration)
    else:
        logger.info("FAIR Checkpoint PASSED: Immediate open access configured.")
        
    logger.info("FAIR Checkpoint PASSED: Policy enforcement successful. Dataset cleared for publication.")
    return True

Step 4: Execute Repository Ingestion & Audit Compliance Checkpoints

The final stage involves pushing validated metadata to a repository API (e.g., Invenio, Dataverse, or DSpace). The pipeline must enforce exact API constraints, handle authentication tokens securely, and log every compliance checkpoint for institutional audit trails. This operationalizes the guidance found in Building a data management plan template for researchers by converting plan requirements into machine-executable ingestion rules.

python
import logging
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
from typing import Dict, Any

logger = logging.getLogger("governance.repository_ingest")

REPO_API_URL = "https://repository.institution.edu/api/v1/datasets"
REPO_TIMEOUT = (5, 30)

def ingest_to_repository(json_ld: Dict[str, Any], api_token: str) -> Dict[str, Any]:
    """Push validated JSON-LD to institutional repository with audit logging."""
    session = requests.Session()
    # POST is excluded from urllib3's default retryable methods, so opt in
    # explicitly. Only do this for endpoints that handle deposits idempotently
    # (the 409 branch below guards against duplicate ingestion).
    retry_strategy = Retry(
        total=3,
        backoff_factor=1.0,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["POST"]
    )
    session.mount("https://", HTTPAdapter(max_retries=retry_strategy))
    session.headers.update({
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/json",
        "X-FAIR-Compliance-Check": "PASSED"
    })
    
    logger.info("Initiating repository ingestion for dataset ID: %s", json_ld.get("identifier"))
    
    try:
        response = session.post(REPO_API_URL, json=json_ld, timeout=REPO_TIMEOUT)
        
        # Exact API constraint: enforce 201 Created and validate response schema
        if response.status_code == 201:
            resp_data = response.json()
            logger.info("FAIR Checkpoint PASSED: Dataset successfully ingested. Repository ID: %s", resp_data.get("id"))
            return resp_data
        elif response.status_code == 409:
            logger.warning("FAIR Checkpoint WARNING: Dataset already exists in repository. Skipping duplicate.")
            return {"status": "duplicate", "message": response.text}
        else:
            logger.error("FAIR Checkpoint FAILED: Repository returned HTTP %s. Body: %s", response.status_code, response.text)
            response.raise_for_status()
            
    except requests.exceptions.RetryError as e:
        logger.error("Repository ingestion failed after retries: %s", e)
        raise
    except requests.exceptions.RequestException as e:
        logger.error("Network or API failure during ingestion: %s", e)
        raise

Operationalizing Governance at Scale

Treating research data governance as executable code transforms compliance from a retrospective audit activity into a proactive, automated pipeline. By embedding strict schema validation, resilient PID resolution, policy-as-code enforcement, and auditable repository ingestion, institutions can guarantee that every dataset entering their ecosystem adheres to FAIR principles. This architecture reduces manual curation overhead, minimizes human error, and provides transparent, timestamped audit trails for funder reporting and institutional accreditation.

For academic IT teams and open science advocates, the next phase involves integrating these Python modules into CI/CD workflows, containerizing the validation services, and exposing governance APIs to researcher-facing submission portals. When governance is automated, FAIR compliance becomes a default state rather than an exceptional outcome.