How to Map Dublin Core to schema.org for Research Data

Translating the Dublin Core Metadata Element Set (Dublin Core) into schema.org Dataset JSON-LD requires deterministic field mapping, strict type coercion, and an explicit context declaration. This is the exact operation the enrichment stage performs when it turns a flat, legacy metadata record into machine-readable structured data that Google Dataset Search, institutional harvesters, and cross-repository indexers can consume. This page is the field-level companion to the FAIR Principle Breakdown — it implements the Interoperability crosswalk (sub-principles I1–I3) that the parent guide describes at the architectural level. It assumes you can read Python 3.10+ and already have Dublin Core records in hand, whether harvested over OAI-PMH or emitted by the automated Dublin Core enrichment from raw CSV pipeline.

Direct one-to-one translation fails because the schema.org vocabulary enforces strict typing, nested object hierarchies, and URI normalization that Dublin Core does not. The mapping layer must therefore act as a stateless transformer: it accepts a Dublin Core dictionary, applies coercion rules in a fixed order, and emits validated JSON-LD with no side effects. That determinism is what makes the output reproducible across harvester polls and safe to cache.

Stateless transformation pipeline: a Dublin Core record is parsed, coerced, run through the field crosswalk, and wrapped in @context and @type Dataset before a validity gate either emits schema.org JSON-LD or routes rejected values into _validation_warnings while still returning the partial record.

Deterministic Field Crosswalk

The crosswalk is the core reference artifact for this operation. Apply the rules in the order shown: each row preserves the semantic intent of the Dublin Core element while satisfying the target Dataset type’s constraints. Structural adaptation — wrapping a bare string in a Person, Organization, or DefinedTerm object — is mandatory wherever schema.org expects a typed node rather than a literal.

Dublin Core Element	schema.org Property	Transformation Rule
`dc:title`	`name`	String literal. Strip trailing punctuation. Reject empty values.
`dc:creator` / `dc:contributor`	`author` / `contributor`	Wrap each value in a `Person` (or `Organization`) object. Map `dc:creator` to `author`. Use an array when multiple.
`dc:date` / `dc:dateSubmitted`	`datePublished` / `dateCreated`	Normalize to ISO 8601 `YYYY-MM-DD`. Reject partial dates. Collapse timezone offsets.
`dc:description`	`description`	String. Preserve paragraph breaks as `\n\n`. Truncate to 5000 chars for harvester limits.
`dc:identifier`	`@id` / `url` / `identifier`	DOI or Handle → `@id`. HTTP(S) URL → `url`. Local ID → `PropertyValue` with `propertyID`.
`dc:subject` / `dc:coverage`	`keywords` / `spatialCoverage`	Array of strings; deduplicate. Map to `DefinedTerm` when a vocabulary URI is supplied.
`dc:rights` / `dc:license`	`license`	Resolve to an SPDX License List URL or a Creative Commons URI. Reject free-text licenses.
`dc:type`	`@type`	Map to `Dataset` explicitly. Fall back to `CreativeWork` only when the Dublin Core type is ambiguous.
`dc:publisher`	`publisher`	`Organization` object with `name` and optional `url`.
`dc:relation`	`isPartOf` / `citation`	Resolve to a `URL` or a `CreativeWork` reference.

Two rules are non-negotiable: the root object must declare "@context": "https://schema.org" and "@type": "Dataset". Omitting either makes JSON-LD parsers in harvesters and search engines silently drop the record. For the broader mapping discipline that governs this table — controlled-vocabulary alignment, cardinality, and cross-standard equivalence — see the Metadata Schema Mapping reference.

Production Python Implementation

The transformer below uses only the standard library, so it runs unchanged in restricted institutional environments. It enforces the crosswalk above, coerces types explicitly, handles missing fields gracefully, and collects every rejected value in a _validation_warnings array rather than raising — a partial record still yields usable output plus a diagnosis of what failed.

python

import json
import re
from datetime import datetime
from typing import Any
from urllib.parse import urlparse

SCHEMA_CONTEXT = "https://schema.org"
SPDX_BASE = "https://spdx.org/licenses/"
DOI_PATTERN = re.compile(r"^10\.\d{4,9}/[-._;()/:A-Z0-9]+$", re.IGNORECASE)


class DCtoSchemaOrgMapper:
    """Deterministic transformer for Dublin Core to schema.org JSON-LD."""

    def __init__(self) -> None:
        self._errors: list[str] = []

    def _normalize_date(self, raw: str | None) -> str | None:
        if not raw:
            return None
        try:
            cleaned = raw.replace("Z", "+00:00").strip()
            dt = datetime.fromisoformat(cleaned)  # rejects partial dates
            return dt.strftime("%Y-%m-%d")
        except ValueError:
            self._errors.append(f"Invalid date format: {raw}")
            return None

    def _resolve_license(self, raw: str | None) -> str | None:
        if not raw:
            return None
        if raw.startswith(("http://", "https://")):
            return raw
        # Map common free-text tokens to Creative Commons / SPDX URIs
        mapping = {
            "cc0": "https://creativecommons.org/publicdomain/zero/1.0/",
            "cc-by": "https://creativecommons.org/licenses/by/4.0/",
            "cc-by-sa": "https://creativecommons.org/licenses/by-sa/4.0/",
            "mit": f"{SPDX_BASE}MIT",
            "apache-2.0": f"{SPDX_BASE}Apache-2.0",
        }
        resolved = mapping.get(raw.lower().strip())
        if resolved is None:
            self._errors.append(f"Unresolvable license: {raw}")
        return resolved

    def _parse_identifier(self, raw: str | None) -> dict[str, Any]:
        if not raw:
            return {}
        if DOI_PATTERN.match(raw):
            return {"@id": f"https://doi.org/{raw}"}
        if urlparse(raw).scheme in ("http", "https"):
            return {"url": raw}
        # Local/opaque identifier: keep it typed, not a bare string
        return {"identifier": {"@type": "PropertyValue",
                               "propertyID": "local", "value": raw}}

    def _build_agent(self, name: str, affiliation: str | None = None) -> dict[str, Any]:
        if not name.strip():
            return {}
        agent: dict[str, Any] = {"@type": "Person", "name": name.strip()}
        if affiliation:
            agent["affiliation"] = {"@type": "Organization", "name": affiliation.strip()}
        return agent

    def transform(self, dc_record: dict[str, Any]) -> dict[str, Any]:
        self._errors.clear()
        output: dict[str, Any] = {"@context": SCHEMA_CONTEXT, "@type": "Dataset"}

        title = dc_record.get("dc:title", "").strip().rstrip(".,;:")
        if title:
            output["name"] = title

        pub_date = self._normalize_date(
            dc_record.get("dc:date") or dc_record.get("dc:dateSubmitted"))
        if pub_date:
            output["datePublished"] = pub_date

        desc = dc_record.get("dc:description", "").strip()
        if desc:  # normalize line endings; \n\n paragraph breaks survive
            output["description"] = desc.replace("\r\n", "\n").replace("\r", "\n")

        creators = dc_record.get("dc:creator", [])
        if isinstance(creators, str):
            creators = [creators]
        agents = [a for c in creators if (a := self._build_agent(c))]
        if agents:
            output["author"] = agents

        pub = dc_record.get("dc:publisher", "").strip()
        if pub:
            output["publisher"] = {"@type": "Organization", "name": pub}

        lic = self._resolve_license(
            dc_record.get("dc:rights") or dc_record.get("dc:license"))
        if lic:
            output["license"] = lic

        output.update(self._parse_identifier(dc_record.get("dc:identifier", "")))

        subjects = dc_record.get("dc:subject", [])
        if isinstance(subjects, str):
            subjects = [s.strip() for s in subjects.split(";")]
        keywords = list(dict.fromkeys(s for s in subjects if s.strip()))
        if keywords:
            output["keywords"] = keywords

        rel = dc_record.get("dc:relation", "")
        if rel and urlparse(rel).scheme in ("http", "https"):
            output["citation"] = {"@type": "CreativeWork", "url": rel}
        elif rel:
            output["citation"] = rel

        if self._errors:
            output["_validation_warnings"] = self._errors
        return output

    def to_jsonld(self, dc_record: dict[str, Any]) -> str:
        return json.dumps(self.transform(dc_record), indent=2,
                          ensure_ascii=False, sort_keys=True)

When the source records arrive as typed Python objects rather than loose dictionaries, front the transformer with Pydantic schema validation so structural errors are caught before coercion even begins.

Verification

Prove the crosswalk with a minimal pytest case that exercises the happy path and the two failure modes the transformer is designed to survive — an unparseable date and an unresolvable license both land in _validation_warnings without aborting the run.

python

from dc_mapper import DCtoSchemaOrgMapper  # module holding the class above


def test_transform_maps_core_fields_and_collects_warnings() -> None:
    record = {
        "dc:title": "Arctic Sea Ice Thickness 2019-2023.",
        "dc:creator": ["Ada Lovelace", "Grace Hopper"],
        "dc:date": "2024-03-01",
        "dc:identifier": "10.5281/zenodo.1234567",
        "dc:rights": "cc-by",
        "dc:subject": "sea ice; cryosphere; sea ice",  # note the duplicate
        "dc:date_bad": "spring 2024",                    # ignored
        "dc:license": None,
    }
    out = DCtoSchemaOrgMapper().transform(record)

    assert out["@context"] == "https://schema.org"
    assert out["@type"] == "Dataset"
    assert out["name"] == "Arctic Sea Ice Thickness 2019-2023"   # period stripped
    assert out["@id"] == "https://doi.org/10.5281/zenodo.1234567"
    assert out["license"] == "https://creativecommons.org/licenses/by/4.0/"
    assert out["keywords"] == ["sea ice", "cryosphere"]           # deduplicated
    assert [a["name"] for a in out["author"]] == ["Ada Lovelace", "Grace Hopper"]
    assert "_validation_warnings" not in out                      # nothing failed

Run it directly with pytest -q test_dc_mapper.py; a green result confirms punctuation stripping, DOI-to-@id resolution, license normalization, and keyword deduplication all fire in one pass. To serve the output, place the transformer behind a stateless endpoint that answers Accept: application/ld+json, cache responses with Cache-Control: public, max-age=86400, and route registry-dependent lookups through the resilient control plane described in API routing & fallbacks. Before any record is published, scrub restricted author attribution and validate every outbound URI per the Security & Access Control rules.

Gotchas

Bare strings where schema.org wants a node. Emitting "author": "Ada Lovelace" validates loosely but loses the Person/Organization distinction and blocks ORCID/ROR enrichment downstream. Fix: always wrap agents, publishers, and identifiers in a typed object with an explicit @type.
Silent partial-date acceptance. datetime.fromisoformat("2024") raises, but strptime with a loose format would coerce 2024 to 2024-01-01 and fabricate a day. Fix: keep fromisoformat, which rejects partial dates, and route the failure to _validation_warnings instead of guessing.
Free-text licenses that pass ingestion but fail Dataset Search. A dc:rights value of “open access” is not a machine-actionable license. Fix: reject anything that does not resolve to an SPDX License List URL or a Creative Commons URI, exactly as _resolve_license does, and quarantine the record for manual license assignment.

FAIR Principle Breakdown — the parent guide; this crosswalk is its Interoperability (I1–I3) checkpoint.
Metadata Schema Mapping — the deterministic mapping layer this table plugs into.
Validating metadata against FAIR criteria automatically — asserts that the JSON-LD emitted here satisfies each sub-principle.
Core Architecture & FAIR Mapping — the full pipeline topology this transformer sits inside.

How to Map Dublin Core to schema.org for Research Data #

Deterministic Field Crosswalk #

Production Python Implementation #

Verification #

Gotchas #

Related #

How to Map Dublin Core to schema.org for Research Data

Deterministic Field Crosswalk

Production Python Implementation

Verification

Gotchas

Related