Configuring CC-BY 4.0 Licenses for Automated Dataset Publishing

Automated deposit pipelines fracture most often at the license injection stage: a Creative Commons Attribution 4.0 International (CC-BY 4.0) label is written as free text, mismatched against the repository’s schema, or silently overwritten during a metadata crosswalk, and the deposit is either rejected or defaulted to a restrictive placeholder. This page is the concrete, field-level worked example behind the broader Open License Configuration guide: where that guide defines the registry and resolution rules for any license, here we take CC-BY 4.0 end to end and show exactly which input labels resolve to which canonical URI, how that URI populates a DataCite rightsList entry, and how a Pydantic V2 model rejects a malformed license before serialization. It assumes you already operate a deposit pipeline and are comfortable with Python 3.10+ and the Pydantic V2 API; the goal is to replace ad-hoc string concatenation with a deterministic, validated, and drift-resistant license gate.

The failure is architectural, not cosmetic. When a repository API expects license: "https://creativecommons.org/licenses/by/4.0/" and receives "CC-BY-4.0", "Creative Commons Attribution", or "Open Access", the ingestion service coerces or drops the value, and the deposit lands with no enforceable license at all — a direct FAIR-reusability failure. The gate below runs the same resolve-then-validate discipline the ingestion layer applies to metadata, turned toward the license field: detect the incoming label, resolve it to a canonical CC-BY record, validate the record against a strict schema, and only then serialize the payload.

The Core Resolution Table: Labels to Canonical CC-BY Records

The single most effective control is a closed vocabulary that maps every accepted human-readable label to exactly one canonical record. Free-text acceptance is the root cause of silent coercion, so the resolver rejects anything not in this table. Identifiers follow the SPDX License List, whose canonical, case-sensitive IDs (CC-BY-4.0, CC-BY-SA-4.0, CC0-1.0) make license resolution deterministic; the SPDX registry and its case rules are covered in the Open License Configuration guide. The table below is the authoritative, no-ambiguity resolution map — the core reference artifact for this operation.

Accepted input label (matched case-insensitively)	SPDX identifier	Canonical license URI	Legal code URI
`CC-BY-4.0`, `CC BY 4.0`, `Attribution 4.0 International`	`CC-BY-4.0`	`https://creativecommons.org/licenses/by/4.0/`	`https://creativecommons.org/licenses/by/4.0/legalcode`
`CC-BY-3.0`, `CC BY 3.0`, `Attribution 3.0 Unported`	`CC-BY-3.0`	`https://creativecommons.org/licenses/by/3.0/`	`https://creativecommons.org/licenses/by/3.0/legalcode`
`CC-BY-SA-4.0`, `CC BY-SA 4.0`, `Attribution-ShareAlike 4.0`	`CC-BY-SA-4.0`	`https://creativecommons.org/licenses/by-sa/4.0/`	`https://creativecommons.org/licenses/by-sa/4.0/legalcode`
`CC0-1.0`, `CC0`, `Public Domain Dedication`	`CC0-1.0`	`https://creativecommons.org/publicdomain/zero/1.0/`	`https://creativecommons.org/publicdomain/zero/1.0/legalcode`

Ambiguous strings — "Open Access", "Permissive", "Free to use", "CC-BY" with no version — are not rows in this table and must be rejected at the pipeline ingress, not guessed. A missing version is the single most common cause of a deposit publishing under the wrong CC-BY generation.

DataCite rightsList field population

Once a label resolves, the canonical record populates one DataCite Metadata Schema (version 4.5+) rightsList entry field-for-field. The full internal-record-to-DataCite translation is covered in Metadata Schema Mapping; the mapping below is the exact shape a CC-BY 4.0 deposit must serialize to.

DataCite rightsList field	Value for CC-BY 4.0	Source
`rights`	`Creative Commons Attribution 4.0 International`	human-readable name
`rightsUri`	`https://creativecommons.org/licenses/by/4.0/legalcode`	legal code URI
`rightsIdentifier`	`CC-BY-4.0`	SPDX identifier
`rightsIdentifierScheme`	`SPDX`	fixed
`schemeUri`	`https://spdx.org/licenses/`	fixed

Production Implementation

The resolver is a frozen lookup, and validation runs through a Pydantic V2 model so a malformed license fails loudly at the edge instead of deep inside the deposit dispatch loop — the same Pydantic schema validation discipline the ingestion layer applies to dataset metadata. The model rejects any URI that is not one of the canonical CC-BY forms, so a coerced or partially-filled record can never serialize.

python

from __future__ import annotations

import logging
from typing import Final

from pydantic import BaseModel, Field, field_validator

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

# Closed vocabulary: every accepted label -> a single canonical record.
# Keys are lowercased at lookup time; SPDX identifiers themselves are NOT lowercased.
_LICENSE_TABLE: Final[dict[str, dict[str, str]]] = {
    "cc-by-4.0": {
        "spdx": "CC-BY-4.0",
        "name": "Creative Commons Attribution 4.0 International",
        "uri": "https://creativecommons.org/licenses/by/4.0/",
        "legal_code": "https://creativecommons.org/licenses/by/4.0/legalcode",
    },
    "cc-by-3.0": {
        "spdx": "CC-BY-3.0",
        "name": "Creative Commons Attribution 3.0 Unported",
        "uri": "https://creativecommons.org/licenses/by/3.0/",
        "legal_code": "https://creativecommons.org/licenses/by/3.0/legalcode",
    },
    "cc-by-sa-4.0": {
        "spdx": "CC-BY-SA-4.0",
        "name": "Creative Commons Attribution-ShareAlike 4.0 International",
        "uri": "https://creativecommons.org/licenses/by-sa/4.0/",
        "legal_code": "https://creativecommons.org/licenses/by-sa/4.0/legalcode",
    },
    "cc0-1.0": {
        "spdx": "CC0-1.0",
        "name": "Creative Commons Zero v1.0 Universal",
        "uri": "https://creativecommons.org/publicdomain/zero/1.0/",
        "legal_code": "https://creativecommons.org/publicdomain/zero/1.0/legalcode",
    },
}

# Common label aliases that must collapse onto a canonical table key.
_ALIASES: Final[dict[str, str]] = {
    "cc by 4.0": "cc-by-4.0",
    "attribution 4.0 international": "cc-by-4.0",
    "cc by 3.0": "cc-by-3.0",
    "attribution 3.0 unported": "cc-by-3.0",
    "cc by-sa 4.0": "cc-by-sa-4.0",
    "attribution-sharealike 4.0": "cc-by-sa-4.0",
    "cc0": "cc0-1.0",
    "public domain dedication": "cc0-1.0",
}

_CANONICAL_URIS: Final[frozenset[str]] = frozenset(r["uri"] for r in _LICENSE_TABLE.values())


class LicenseResolutionError(ValueError):
    """Raised when an input label cannot be resolved to a canonical CC-BY record."""


def resolve_license(label: str) -> dict[str, str]:
    """Resolve a free-text label to its canonical record, or reject it.

    Matching is case-insensitive on the label only; the returned SPDX
    identifier keeps its canonical case (e.g. "CC-BY-4.0").
    """
    key = " ".join(label.strip().lower().split())  # normalise whitespace + case
    key = _ALIASES.get(key, key)
    record = _LICENSE_TABLE.get(key)
    if record is None:
        raise LicenseResolutionError(
            f"Unresolvable license label {label!r}. "
            f"Supply an exact CC-BY version (e.g. 'CC-BY-4.0'); ambiguous strings are rejected."
        )
    return record


class RightsEntry(BaseModel):
    """A single DataCite 4.5+ rightsList entry, validated before serialization."""

    rights: str
    rightsUri: str
    rightsIdentifier: str
    rightsIdentifierScheme: str = Field(default="SPDX")
    schemeUri: str = Field(default="https://spdx.org/licenses/")

    @field_validator("rightsUri")
    @classmethod
    def uri_must_be_canonical(cls, v: str) -> str:
        # A legal-code URI is the canonical URI plus the "/legalcode" suffix.
        base = v.removesuffix("/legalcode")
        if not base.endswith("/"):
            base += "/"
        if base not in _CANONICAL_URIS:
            raise ValueError(f"Non-canonical license URI: {v!r}")
        return v


def build_rights_entry(label: str) -> RightsEntry:
    """Resolve a label and produce a validated DataCite rightsList entry."""
    record = resolve_license(label)
    entry = RightsEntry(
        rights=record["name"],
        rightsUri=record["legal_code"],
        rightsIdentifier=record["spdx"],
    )
    logging.info("Resolved %r -> %s", label, entry.rightsIdentifier)
    return entry


def inject_license(metadata: dict[str, object], label: str) -> dict[str, object]:
    """Attach a validated rightsList to a DataCite metadata payload."""
    entry = build_rights_entry(label)
    metadata["rightsList"] = [entry.model_dump()]
    return metadata

Runtime Drift and Fallback Handling

Serialization failures are compounded over time by metadata drift. As institutional policy or funder terms evolve, pipelines that read a license string from a cached deployment manifest keep publishing under a stale version — a fault that rarely raises an API error and instead surfaces during a compliance audit. Decouple resolution from static config by querying a policy registry at runtime with a short cache TTL, and on registry outage fall back to a pinned, known-good CC-BY 4.0 record rather than skipping the license entirely. The dead-letter and replay discipline that catches the failures this fallback cannot resolve is the same one described in API Routing & Fallbacks, and the batch orchestration around it is covered in async batch uploads for large datasets.

python

import time
from typing import Final

# Immutable, pinned fallback used only when the policy registry is unreachable.
_FALLBACK_LABEL: Final[str] = "CC-BY-4.0"
_CACHE_TTL_SECONDS: Final[int] = 900  # 15 minutes

_cache: dict[str, tuple[float, str]] = {}


def resolve_policy_label(dataset_id: str, registry_lookup) -> str:
    """Return the governing license label for a dataset, with cache-aside + fallback.

    `registry_lookup(dataset_id) -> str` calls the central policy registry.
    On any failure the pinned fallback is used and the event is logged for audit.
    """
    now = time.monotonic()
    cached = _cache.get(dataset_id)
    if cached and (now - cached[0]) < _CACHE_TTL_SECONDS:
        return cached[1]

    try:
        label = registry_lookup(dataset_id)
        resolve_license(label)  # validate before caching; reject drifted junk
        _cache[dataset_id] = (now, label)
        return label
    except Exception as exc:  # registry down OR returned an unresolvable label
        logging.warning(
            "Policy registry lookup failed for %s (%s); using pinned fallback %s",
            dataset_id, exc, _FALLBACK_LABEL,
        )
        return _FALLBACK_LABEL

Verification

Assert the gate without touching a live repository by exercising the resolver and validator directly. The test confirms that a well-formed label injects a canonical rightsList, that an unversioned "CC-BY" string is rejected, and that a hand-forged non-canonical URI cannot pass the Pydantic model.

python

import pytest


def test_canonical_label_injects_rightslist() -> None:
    meta = inject_license({"titles": [{"title": "Reef Temperature Series"}]}, "CC BY 4.0")
    entry = meta["rightsList"][0]
    assert entry["rightsIdentifier"] == "CC-BY-4.0"
    assert entry["rightsUri"].endswith("/by/4.0/legalcode")


def test_unversioned_label_is_rejected() -> None:
    with pytest.raises(LicenseResolutionError):
        resolve_license("CC-BY")


def test_forged_uri_fails_validation() -> None:
    with pytest.raises(Exception):
        RightsEntry(
            rights="Creative Commons Attribution 4.0 International",
            rightsUri="https://example.org/licenses/by/4.0/",
            rightsIdentifier="CC-BY-4.0",
        )

Run it with pytest -q test_ccby_config.py. A passing run prints 3 passed; a rejected label emits LicenseResolutionError: Unresolvable license label 'CC-BY', the same structured signal the dead-letter path and compliance dashboard consume.

Gotchas

"CC-BY" with no version is not a license. A bare label passes a naive substring check but silently resolves to whichever generation your defaulting code assumes. Fix: keep the resolution table version-explicit and reject any label not present as a row.
SPDX identifiers are case-sensitive; the input label is not. Lowercasing the whole record turns CC-BY-4.0 into a value absent from the SPDX License List, so the identifier fails downstream. Fix: normalise case only on the lookup key, and emit the SPDX identifier exactly as stored in the table.
rightsUri must point at a canonical CC URL, not a local mirror. Repositories that rewrite the URI to an institutional proxy break machine-readable license detection by aggregators. Fix: validate rightsUri against the frozen canonical set (as uri_must_be_canonical does) before serialization.

Open License Configuration — the parent guide defining the SPDX registry and resolution rules this CC-BY example specializes.
Aligning NIH Data Sharing Policies with FAIR Principles — where the same license gate is one control in a funder-mandate compliance pipeline.
Metadata Schema Mapping — the full field-by-field crosswalk that carries this rightsList into a complete DataCite record.

See the Open Science Infrastructure Planning overview for how license configuration fits the wider governance and deposit pipeline.

Configuring CC-BY 4.0 Licenses for Automated Dataset Publishing #

The Core Resolution Table: Labels to Canonical CC-BY Records #

DataCite rightsList field population #

Production Implementation #

Runtime Drift and Fallback Handling #

Verification #

Gotchas #

Related Guides #

Configuring CC-BY 4.0 Licenses for Automated Dataset Publishing

The Core Resolution Table: Labels to Canonical CC-BY Records

DataCite rightsList field population

Production Implementation

Runtime Drift and Fallback Handling

Verification

Gotchas

Related Guides