Open License Configuration: Automating SPDX Assignment, Policy Enforcement, and Audited Deposit

Open license configuration is not a metadata afterthought; it is a deterministic control plane for research data dissemination. In a working FAIR platform the license that governs reuse must be resolved, validated, and written at deposit time — never patched in afterwards when a funder audit surfaces a null rights field. This guide sits inside the Open Science Infrastructure Planning pipeline as the stage that decides which reuse terms a dataset carries before it becomes discoverable, and shows Python automation engineers and research data managers how to compile heterogeneous license declarations into a single canonical token, assign it through an idempotent repository API, and prove after the fact that every published record still matches institutional policy. It assumes you already run an ingestion pipeline and want to know exactly where license normalization, policy enforcement, and audit reconciliation are enforced, what happens when a declaration is ambiguous, and how each decision is recorded. Treating license configuration as code turns reuse rights from a manual curation task into a property the archive holds by construction.

Concept & Specification: Reducing Reuse Rights to a Canonical Token

A license is enforceable in an automated pipeline only when it is expressed as an identifier a machine can assert rather than a free-text string a human must interpret. Four bodies of standard supply that vocabulary. The SPDX License List provides the canonical short identifiers — CC-BY-4.0, CC0-1.0, MIT, Apache-2.0 — that reduce a full legal permission set to one unambiguous token; this is the primary key the rest of the pipeline routes on. The Creative Commons license suite defines the actual permission semantics behind the CC-family identifiers, including whether attribution or share-alike is required. The DataCite Metadata Schema defines the rightsList property, with its rights, rightsURI, and rightsIdentifier sub-fields, that a citable dataset must carry for the license to be machine-readable in the registry record. Finally the Schema.org Dataset type carries a license property that search engines and harvesters index, so the same canonical token must serialize consistently into both the deposit payload and the discovery document.

Each configuration rule binds one of these standards to a decision point and a failure action, and the rules fall into three classes. Normalization rules map a heterogeneous declaration — a bare URL, a vendor label, a legacy Creative Commons version — onto a canonical SPDX identifier, and are enforced by Pydantic schema validation at the ingestion boundary. Policy rules assert that the resolved identifier is actually permitted for this dataset under institutional and funder obligations — the same allowlist discipline the Data Governance Frameworks policy layer applies to embargo windows. Serialization rules guarantee that the token, its canonical URI, and its attribution flag are written identically into the DataCite rightsList and the Schema.org license field so the record never disagrees with itself. The crosswalk from an internal license field into the DataCite and Schema.org rights properties is the same field-mapping problem handled generally in Metadata Schema Mapping; this page is its license-specific specialization. The sections below implement one gate per rule class, in the order a record must clear them.

Step-by-Step Implementation

The pipeline advances a license declaration through four ordered stages: deterministic ingestion and SPDX normalization, pre-flight policy verification and idempotent assignment, funder-mandate override resolution, and continuous audit reconciliation. A declaration that fails any stage is quarantined with a machine-readable reason rather than published with an ambiguous or non-compliant license.

Step 1 — Normalize declarations to a canonical SPDX identifier

License configuration begins at the point of ingestion. Rather than trusting manual form submissions or ad-hoc spreadsheet tracking, the pipeline parses license declarations from submission manifests, ORCID-linked author profiles, or institutional policy registries and normalizes them into canonical SPDX identifiers using a version-controlled mapping table. The SPDX License List short identifier — not the display name or URL — is the value every downstream stage keys on, which is why the deterministic mapping lives in a reviewed license_mappings.yaml rather than being derived by string manipulation. This stage enforces the normalization rules that satisfy the Reusable principle before any repository call is made: a record whose license cannot be reduced to an approved token never advances.

Using the Pydantic V2 API, the ingestion contract enforces type safety, required fields, and SPDX allowlist membership. If the license field is missing, malformed, or references a deprecated Creative Commons version, validation raises, the payload is written to a structured quarantine log, and the workflow halts before any downstream publishing system receives ambiguous terms.

python

import json
import logging
from datetime import datetime, timezone
from typing import Optional

import yaml
from pydantic import BaseModel, ValidationError, field_validator

logger = logging.getLogger("license.ingestion")

# Load the version-controlled SPDX mapping from license_mappings.yaml.
# Expected structure:
#   approved_licenses:
#     - CC-BY-4.0
#     - CC0-1.0
#     - MIT
#     - Apache-2.0
def load_spdx_registry(path: str = "license_mappings.yaml") -> dict:
    with open(path, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)

SPDX_REGISTRY = load_spdx_registry()

# Canonical URIs keyed by SPDX identifier. SPDX IDs do not map 1:1 to URL
# paths, so this table cannot be derived by string templating.
CANONICAL_URI = {
    "CC-BY-4.0": "https://creativecommons.org/licenses/by/4.0/",
    "CC-BY-SA-4.0": "https://creativecommons.org/licenses/by-sa/4.0/",
    "CC0-1.0": "https://creativecommons.org/publicdomain/zero/1.0/",
    "MIT": "https://opensource.org/licenses/MIT",
    "Apache-2.0": "https://www.apache.org/licenses/LICENSE-2.0",
}

# Deprecated or ambiguous inbound labels mapped to their canonical SPDX token.
LEGACY_ALIASES = {
    "CC-BY": "CC-BY-4.0",
    "CC BY 4.0": "CC-BY-4.0",
    "https://creativecommons.org/licenses/by/4.0/": "CC-BY-4.0",
    "public domain": "CC0-1.0",
}

class LicensePayload(BaseModel):
    """Strict contract for a normalized, deposit-ready license record."""
    raw_declaration: str
    spdx_id: Optional[str] = None
    canonical_uri: Optional[str] = None
    attribution_required: bool = False

    @field_validator("spdx_id")
    @classmethod
    def validate_spdx(cls, v: Optional[str]) -> str:
        if v is None:
            raise ValueError("SPDX identifier must be resolved before assignment.")
        if v not in SPDX_REGISTRY.get("approved_licenses", []):
            raise ValueError(f"License {v} is not in the approved institutional registry.")
        return v

def normalize_spdx(raw: str) -> Optional[str]:
    """Reduce a heterogeneous declaration to a canonical SPDX token."""
    candidate = raw.strip()
    if candidate in SPDX_REGISTRY.get("approved_licenses", []):
        return candidate
    return LEGACY_ALIASES.get(candidate) or LEGACY_ALIASES.get(candidate.lower())

def ingest_license(raw_manifest: dict, quarantine_log_path: str = "quarantine.ndjson") -> LicensePayload:
    """Normalize and validate a raw license declaration.

    On failure, writes a structured error record to the quarantine log and re-raises.
    """
    declaration = raw_manifest.get("license_string", "")
    spdx_id = raw_manifest.get("spdx_id") or normalize_spdx(declaration)
    try:
        payload = LicensePayload(
            raw_declaration=declaration,
            spdx_id=spdx_id,
            canonical_uri=CANONICAL_URI.get(spdx_id or ""),
            attribution_required=(spdx_id or "").startswith("CC-BY"),
        )
        logger.info("License normalized: %r -> %s", declaration, payload.spdx_id)
        return payload
    except ValidationError as e:
        quarantine_entry = {
            "status": "quarantined",
            "error": e.errors(),
            "payload": raw_manifest,
            "timestamp": datetime.now(timezone.utc).isoformat(),
        }
        with open(quarantine_log_path, "a", encoding="utf-8") as qf:
            qf.write(json.dumps(quarantine_entry) + "\n")
        logger.error("License quarantined: %s", e.errors())
        raise

Step 2 — Pre-flight the policy check, then assign idempotently

Once normalized, the license token propagates to the repository metadata endpoint. Integration with institutional repository systems requires idempotent PUT or PATCH operations so that a retried request after a network timeout never double-writes or corrupts the rights record. The payload carries the SPDX identifier, the canonical license URI, the attribution flag, and a DataCite Metadata Schema rightsList block, and the client must respect rate limits, back off with jitter, and verify response codes before proceeding. This stage is where a validated token becomes a durable property of the deposited dataset described in Institutional Repository Strategy.

The critical production pattern is pre-flight validation: before attempting assignment, the client queries the repository’s license policy registry to confirm the requested identifier is permitted for this dataset. A 403 Forbidden or 409 Conflict is a policy mismatch, not a transient error — the pipeline logs it, opens a compliance-review workflow, and refuses the assignment rather than silently publishing terms the institution disallows.

python

import logging

import requests
from tenacity import (
    retry,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential_jitter,
)

logger = logging.getLogger("license.repository_client")

class RepositoryLicenseClient:
    """Idempotent license assignment with a pre-flight policy gate."""

    def __init__(self, base_url: str, api_token: str) -> None:
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({"Authorization": f"Bearer {api_token}"})

    def preflight_check(self, spdx_id: str, dataset_doi: str) -> bool:
        """Verify institutional policy allowance before assignment."""
        resp = self.session.get(
            f"{self.base_url}/api/v1/policies/licenses/{spdx_id}",
            params={"dataset": dataset_doi},
            timeout=(5, 15),
        )
        if resp.status_code == 200:
            return bool(resp.json().get("allowed", False))
        if resp.status_code in (403, 409):
            logger.warning(
                "Policy mismatch for %s / %s: %s", dataset_doi, spdx_id, resp.json()
            )
            return False
        resp.raise_for_status()
        return False

    @retry(
        stop=stop_after_attempt(5),
        wait=wait_exponential_jitter(initial=2, max=30),
        retry=retry_if_exception_type(requests.exceptions.RequestException),
        reraise=True,
    )
    def assign_license(self, dataset_doi: str, payload: dict) -> requests.Response:
        """Idempotent PATCH with exponential backoff and jitter."""
        endpoint = f"{self.base_url}/api/v1/datasets/{dataset_doi}/metadata"
        response = self.session.patch(endpoint, json=payload, timeout=(5, 30))
        if response.status_code == 429:
            raise requests.exceptions.RequestException("Rate limited")
        response.raise_for_status()
        return response

def build_rights_block(payload: "LicensePayload") -> dict:
    """Serialize a normalized license into a DataCite rightsList entry."""
    return {
        "rightsList": [
            {
                "rights": payload.spdx_id,
                "rightsURI": payload.canonical_uri,
                "rightsIdentifier": payload.spdx_id,
                "rightsIdentifierScheme": "SPDX",
            }
        ]
    }

Step 3 — Resolve funder mandates and record every override

License automation must intersect with policy enforcement. Research workflows operate within regulatory ecosystems where funder requirements, institutional defaults, and retention policies dictate permissible reuse terms, and those obligations frequently conflict. When a grant carries an open-access mandate — resolved against the centralized registry described in Funder Mandate Alignment — the automation layer must override the institutional default and enforce the mandated license, writing the override to the audit ledger so the decision is reconstructable. The concrete field-level mapping for one such mandate is worked through in Aligning NIH data sharing policies with FAIR principles.

The selection function is deterministic: given a grant context and an institutional default, it returns exactly one SPDX token plus a structured provenance record naming the rule that produced it. That provenance is what distinguishes a defensible license decision from an accidental one at audit time.

python

import logging
from typing import Optional, TypedDict

logger = logging.getLogger("license.mandate_resolver")

# Funder registry: grant-agency code -> required SPDX token (None = no mandate).
FUNDER_LICENSE_MANDATE = {
    "NIH": "CC-BY-4.0",
    "NSF": "CC-BY-4.0",
    "WELLCOME": "CC-BY-4.0",
    "GATES": "CC0-1.0",
}

class LicenseDecision(TypedDict):
    spdx_id: str
    source: str          # "funder_mandate" | "institutional_default"
    overridden: bool

def resolve_license(
    grant_agency: Optional[str],
    institutional_default: str = "CC-BY-4.0",
) -> LicenseDecision:
    """Return the governing SPDX token and the rule that selected it."""
    mandated = FUNDER_LICENSE_MANDATE.get((grant_agency or "").upper())
    if mandated and mandated != institutional_default:
        logger.info(
            "Funder %s mandate overrides default %s -> %s",
            grant_agency, institutional_default, mandated,
        )
        return {"spdx_id": mandated, "source": "funder_mandate", "overridden": True}
    if mandated:
        return {"spdx_id": mandated, "source": "funder_mandate", "overridden": False}
    return {
        "spdx_id": institutional_default,
        "source": "institutional_default",
        "overridden": False,
    }

Data governance frameworks further dictate how licenses interact with retention schedules. Attribution licenses may require longer metadata preservation windows, while a CC0-1.0 public-domain dedication often aligns with shorter ones, so the resolved token is attached to the artifact’s lifecycle policy object and honored by storage tiering, archival migration, and deletion workflows. The end-to-end injection of a CC-BY-4.0 record into a publishing pipeline is detailed in Configuring CC-BY licenses for automated dataset publishing.

Step 4 — Reconcile published licenses against current policy

Production license configuration requires continuous drift detection. Metadata schemas evolve, SPDX identifiers are deprecated, and institutional policies shift, so a license that was compliant at deposit can silently fall out of policy months later. A scheduled reconciliation job queries every published dataset for its rights record, compares the stored SPDX token against the current version-controlled mapping, and flags any record whose license no longer matches policy or has been superseded. This stage produces the compliance reports that satisfy institutional review boards and funder audits without a human re-reading each record.

python

import logging
from typing import Iterable, Iterator

logger = logging.getLogger("license.reconciler")

def reconcile_licenses(
    published: Iterable[dict],
    approved: set[str],
    superseded: dict[str, str],
) -> Iterator[dict]:
    """Yield a drift finding for every non-compliant published record.

    `superseded` maps a retired SPDX token to its replacement, e.g.
    {"CC-BY-3.0": "CC-BY-4.0"}.
    """
    for record in published:
        doi = record["doi"]
        token = record.get("spdx_id")
        if token is None:
            yield {"doi": doi, "issue": "missing_license", "current": None}
        elif token in superseded:
            yield {
                "doi": doi,
                "issue": "superseded",
                "current": token,
                "recommended": superseded[token],
            }
        elif token not in approved:
            yield {"doi": doi, "issue": "policy_violation", "current": token}
        else:
            logger.debug("Record %s compliant: %s", doi, token)

Structured logging captures every resolution, pre-flight check, assignment, and mandate override, keyed by dataset DOI, SPDX identifier, and policy version. That log is the immutable audit trail; the access controls that keep it tamper-evident are covered in Security & Access Control.

License Configuration Reference Matrix

Every rule in the pipeline maps a governing standard to an enforcement stage, a decision function, and the action taken on failure. Use this matrix to reconcile a quarantined or flagged record against the exact stage that stopped it. The attribution_required and canonical-URI columns are the values written verbatim into the DataCite Metadata Schema rightsList and the Schema.org Dataset license field.

SPDX identifier	Canonical URI	Attribution	Share-alike	Typical trigger
`CC-BY-4.0`	`creativecommons.org/licenses/by/4.0/`	Required	No	NIH / NSF / Wellcome open-access mandate
`CC-BY-SA-4.0`	`creativecommons.org/licenses/by-sa/4.0/`	Required	Yes	Derivative-corpus share-alike requirement
`CC0-1.0`	`creativecommons.org/publicdomain/zero/1.0/`	No	No	Public-domain dedication; shortest retention
`MIT`	`opensource.org/licenses/MIT`	Required (notice)	No	Software artifact bundled with dataset
`Apache-2.0`	`apache.org/licenses/LICENSE-2.0`	Required (notice)	No	Patent-grant software component

Configuration control	Governing standard	Enforcement stage	Decision rule	Failure action
Token normalization	SPDX License List	Step 1 `normalize_spdx`	Declaration maps to an approved token	Quarantine — unresolved declaration
Allowlist membership	Institutional policy	Step 1 `validate_spdx`	`spdx_id` in `approved_licenses`	Quarantine — `ValidationError`
Policy allowance	Repository policy registry	Step 2 `preflight_check`	HTTP 200 with `allowed: true`	Block — compliance review (403/409)
Idempotent write	DataCite Metadata Schema	Step 2 `assign_license`	HTTP 2xx on `PATCH`	Retry with jittered backoff, then raise
Mandate override	Funder registry	Step 3 `resolve_license`	Funder token differs from default	Override default, log provenance
Drift detection	SPDX License List	Step 4 `reconcile_licenses`	Stored token in current allowlist	Flag — superseded / policy_violation

Error Handling and Edge Cases

A license pipeline is judged on how it fails. Each stage distinguishes transient failures, which are retried, from permanent failures, which are quarantined or blocked for human decision, and never conflates the two.

Unresolvable declarations at Step 1 are permanent. A manifest whose license_string matches neither an approved token nor a legacy alias is quarantined with the raw payload so a curator can extend LEGACY_ALIASES or reject the submission — never guessed at, because an incorrectly guessed license is a legal defect, not a data-quality one.
Policy mismatches (403/409 at Step 2) are business-rule failures, not exceptions. They open a compliance-review ticket rather than triggering the retry policy, because retrying a forbidden assignment just re-confirms it is forbidden.
Transient repository failures (HTTP 429/5xx during Step 2) are retried with jittered exponential backoff. Because assign_license is an idempotent PATCH, a retry after a server-side commit rewrites the same rights block rather than corrupting it.
Mandate conflicts at Step 3 — two co-funders requiring incompatible licenses, e.g. a CC0-1.0 public-domain dedication against a CC-BY-SA-4.0 share-alike requirement — cannot be resolved by the machine. Emit a mandate_conflict finding to the audit ledger and route to a data manager; silently picking one funder’s terms is the failure mode this stage exists to prevent.
Reconciliation of superseded tokens at Step 4 must recommend, not auto-migrate. A CC-BY-3.0 -> CC-BY-4.0 upgrade changes the legal terms a depositor agreed to, so the reconciler flags it for review and never rewrites a published record unattended.

Verification and Testing

Assert license behavior the same way you assert application logic — with tests that exercise both the pass path and every rejection path. The mandate resolver is the rule most likely to drift as funder policies change, so test its override branches explicitly.

python

def test_mandate_resolution() -> None:
    # Funder mandate overrides a differing institutional default.
    d = resolve_license("NIH", institutional_default="MIT")
    assert d == {"spdx_id": "CC-BY-4.0", "source": "funder_mandate", "overridden": True}

    # Funder mandate that equals the default is not flagged as an override.
    d = resolve_license("NIH", institutional_default="CC-BY-4.0")
    assert d["overridden"] is False

    # No mandate -> institutional default is applied.
    d = resolve_license(None, institutional_default="CC0-1.0")
    assert d == {"spdx_id": "CC0-1.0", "source": "institutional_default", "overridden": False}

def test_normalization_rejects_unknown() -> None:
    # A legacy alias resolves; an unknown string does not.
    assert normalize_spdx("public domain") == "CC0-1.0"
    assert normalize_spdx("Proprietary-Internal") is None

A passing ingestion run emits a contiguous ladder of checkpoint logs; the presence of every expected line is itself an assertion you can grep in CI:

code

2026-07-02 09:14:01 [INFO] license.ingestion | License normalized: 'CC BY 4.0' -> CC-BY-4.0
2026-07-02 09:14:01 [INFO] license.mandate_resolver | Funder NIH mandate overrides default MIT -> CC-BY-4.0
2026-07-02 09:14:02 [INFO] license.repository_client | Assigned CC-BY-4.0 to 10.5281/zenodo.7654321 (HTTP 200)

Gotchas and Known Pitfalls

License URL versus SPDX token. A record whose license field holds https://creativecommons.org/licenses/by/4.0/ is not interchangeable with the token CC-BY-4.0. Normalize to the SPDX identifier at ingestion and keep the URI only in the rightsURI slot; keying policy logic on the URL means one trailing slash silently defeats the allowlist.
Deprecated Creative Commons versions. A manifest declaring CC-BY-2.0 or CC-BY-3.0 is a valid SPDX token but usually not an approved one. Reject it at Step 1 rather than letting it reach the repository, and let Step 4 reconciliation, not the ingestion path, decide how already-published legacy records are migrated.
Allowlist drift across stages. The approved-token set is referenced by Step 1 validation, Step 2 pre-flight, and Step 4 reconciliation. Hoist it into the single license_mappings.yaml source of truth; when the three drift apart, ingestion accepts a token that reconciliation then flags, quarantining valid records.
Attribution flag divergence. Deriving attribution_required from the token prefix (CC-BY*) in one place and hard-coding it elsewhere produces a Dataset document whose license implies attribution while the deposit payload does not. Compute it once, in LicensePayload, and pass the object — never recompute the flag downstream.
Retrying a forbidden assignment. Wrapping preflight_check inside the same retry policy as assign_license turns a deterministic 403 into five wasted calls and a delayed failure. Keep the policy gate outside the retry decorator so a policy mismatch fails fast.

Frequently Asked Questions

Why normalize to SPDX identifiers instead of storing the license URL?

Because the SPDX short identifier is a single canonical token that every stage — allowlist check, funder mandate, drift reconciliation — can key on unambiguously, whereas a URL varies by trailing slash, protocol, and version path. Store the token as the primary value and keep the canonical URI only in the DataCite Metadata Schema rightsURI field, derived deterministically from the token so the two can never disagree.

What happens when a funder mandate conflicts with the institutional default?

The mandate wins and the override is written to the audit ledger with its provenance. The resolve_license function returns the funder token plus overridden: true, so the decision is reconstructable at audit time. The only case the machine refuses to resolve is two co-funders requiring incompatible licenses — that emits a mandate_conflict finding for a data manager rather than silently choosing one.

How do I add a new approved license without breaking existing records?

Add the SPDX token to approved_licenses in license_mappings.yaml, add its canonical URI to the CANONICAL_URI map, add any inbound aliases to LEGACY_ALIASES, and add a test asserting it normalizes and passes the allowlist. Because ingestion, pre-flight, and reconciliation all read the same file, deploying that one reviewed change keeps every stage consistent and gives you a diff that is itself the audit record of the policy change.

How is a superseded license version handled after publication?

Step 4 reconciliation flags it — a stored CC-BY-3.0 maps to a superseded finding recommending CC-BY-4.0 — but never rewrites the published record automatically. Upgrading a license version changes the terms the depositor agreed to, so the migration is routed to a curator for an explicit decision, not applied unattended by the reconciler.

Open Science Infrastructure Planning — the parent overview showing how license configuration composes with the governance, funder, and repository stages.
Data Governance Frameworks — the policy layer whose allowlist and embargo discipline this stage extends to reuse rights.
Funder Mandate Alignment — the registry the Step 3 mandate resolver reads to override institutional defaults.
Institutional Repository Strategy — the deposit target that consumes the audited rightsList this pipeline produces.
Configuring CC-BY licenses for automated dataset publishing — the end-to-end injection of a CC-BY-4.0 record into a publishing pipeline.

Open License Configuration: Automating SPDX Assignment, Policy Enforcement, and Audited Deposit #

Concept & Specification: Reducing Reuse Rights to a Canonical Token #

Step-by-Step Implementation #

Step 1 — Normalize declarations to a canonical SPDX identifier #

Step 2 — Pre-flight the policy check, then assign idempotently #

Step 3 — Resolve funder mandates and record every override #

Step 4 — Reconcile published licenses against current policy #

License Configuration Reference Matrix #

Error Handling and Edge Cases #

Verification and Testing #

Gotchas and Known Pitfalls #

Frequently Asked Questions #

Why normalize to SPDX identifiers instead of storing the license URL? #

What happens when a funder mandate conflicts with the institutional default? #

How do I add a new approved license without breaking existing records? #

How is a superseded license version handled after publication? #

Related Guides #

Explore this section

Open License Configuration: Automating SPDX Assignment, Policy Enforcement, and Audited Deposit

Concept & Specification: Reducing Reuse Rights to a Canonical Token

Step-by-Step Implementation

Step 1 — Normalize declarations to a canonical SPDX identifier

Step 2 — Pre-flight the policy check, then assign idempotently

Step 3 — Resolve funder mandates and record every override

Step 4 — Reconcile published licenses against current policy

License Configuration Reference Matrix

Error Handling and Edge Cases

Verification and Testing

Gotchas and Known Pitfalls

Frequently Asked Questions

Why normalize to SPDX identifiers instead of storing the license URL?

What happens when a funder mandate conflicts with the institutional default?

How do I add a new approved license without breaking existing records?

How is a superseded license version handled after publication?

Related Guides