Choosing the Right Repository for Grant-Funded Projects

Choosing a repository for a grant-funded dataset is a routing decision, not a storage-shopping exercise: the same five inputs — funder mandate, discipline, largest file size, access sensitivity, and available budget — deterministically pick the correct archive, and the only way to make that decision auditable is to encode it. This page is the routing companion to the Institutional Repository Strategy deposit pipeline: it implements the Step 3 branch where classification tags decide whether a record stays in the institutional archive or is redirected to a certified disciplinary one, while institutional metadata stays in sync. It assumes you can read Python 3.10+, and that you already hold each dataset’s grant identifier, primary discipline, and a rough size and sensitivity profile before deposit.

A hand-made choice fails at scale because the constraints interact. A 200 GB imaging dataset rules out platforms with a 50 GB per-record ceiling regardless of discipline; an NIH-generated sequence dataset is expected in a designated archive no matter how convenient a generalist repository would be; a dataset carrying human-subjects data cannot land in an open, world-readable archive at all. The selector below treats those interactions as ordered hard filters followed by a scored tie-break, so every dataset resolves to exactly one repository plus a reproducible reason string.

Deterministic Selection Rules

The selector applies the criteria in the fixed order below. Each rule is a hard filter that removes non-viable candidates before the next rule runs; the first rule that resolves to a single archive wins. This ordering encodes policy precedence — sensitivity and an explicit funder mandate outrank discipline, which outranks capacity, which outranks cost — so a cheaper option can never override a compliance obligation.

Order	Criterion	Rule	Resolves to
1	Access sensitivity	Dataset carries human-subjects, PII, or export-controlled data	Controlled-access gateway (dbGaP / institutional restricted repository); never an open archive
2	Funder mandate	Grant’s funder designates a named archive for this data type	The designated archive (e.g. NIH sequence data → GEO/SRA)
3	Discipline standard	A certified disciplinary repository is the community norm	Domain repository (e.g. Dryad, PANGAEA, ProteomeXchange), with institutional metadata mirrored
4	File capacity	Largest single file exceeds the generalist per-record ceiling (50 GB)	Institutional Dataverse (supports large files + catalog sync)
5	Budget	Curation budget exists and managed onboarding is required	Figshare (managed, human-curated)
6	Default	None of the above constrain the choice	Zenodo (free, DOI-minting, CERN-backed)

The capability table below fixes the exact platform facts the rules depend on. Every value is a hard specification the selector reads — not a preference — so a change in a platform’s ceiling or API is a one-line data edit, not a logic rewrite.

Repository	Per-record size ceiling	DOI minting	Automated deposit API	Cost model	Typical fit
Zenodo	50 GB / record	DataCite DOI, automatic	REST + OAI-PMH	Free	Generalist default, software + data
Figshare	20 GB / file (higher on plan)	DataCite DOI, automatic	REST (`/account/articles`)	Institutional subscription	Managed curation, public engagement
Dryad	~300 GB / submission	DataCite DOI, automatic	REST + Dryad API	Per-deposit fee (often waived)	Data underlying publications
Dataverse (institutional)	Configurable (TB-scale)	DataCite or Handle	Native REST + OAI-PMH	Institution-hosted	Large files, catalog sync, governance
Domain archive (GEO, SRA, PDB, PANGAEA)	Format-specific	Accession + optional DOI	Program-specific submission API	Free	Funder-designated / discipline standard

Whichever archive is selected, the record must still carry the same citation core defined by the DataCite Metadata Schema — identifier, creators, titles, publisher, publication year, resource type, and a rights entry — and a single SPDX License List identifier; the field-level translation and license encoding are owned by Open License Configuration, not by the routing step.

Production Python Implementation

The selector is a stateless, deterministic function: given a dataset profile it returns exactly one RepositoryChoice plus a machine-readable reason. It raises only on an incoherent profile (for example, a negative file size), and it never silently falls through — the final Zenodo default is explicit. Wire it in front of the deposit pipeline so the routing decision is logged with the record’s correlation ID.

python

from __future__ import annotations

from dataclasses import dataclass
from enum import Enum


class Repository(str, Enum):
    CONTROLLED_ACCESS = "controlled_access_gateway"
    DOMAIN = "domain_repository"
    DATAVERSE = "institutional_dataverse"
    FIGSHARE = "figshare"
    ZENODO = "zenodo"


# Generalist per-record ceiling in gigabytes (Zenodo/Figshare class).
GENERALIST_SIZE_CEILING_GB: float = 50.0

# Funders that designate a named archive for a given data type.
# key = (funder_ror, data_type) -> designated archive label.
FUNDER_DESIGNATED_ARCHIVES: dict[tuple[str, str], str] = {
    ("https://ror.org/01cwqze88", "sequence"): "NCBI GEO / SRA",   # NIH
    ("https://ror.org/01cwqze88", "structure"): "wwPDB",           # NIH
    ("https://ror.org/021nxhr62", "earth_observation"): "PANGAEA", # NSF (example)
}

# Disciplines with a community-certified repository.
DISCIPLINE_ARCHIVES: dict[str, str] = {
    "ecology": "Dryad",
    "earth_science": "PANGAEA",
    "proteomics": "ProteomeXchange",
    "genomics": "NCBI GEO / SRA",
}


@dataclass(frozen=True)
class DatasetProfile:
    """Everything the routing decision needs, known before deposit."""
    dataset_id: str
    funder_ror: str
    data_type: str          # e.g. "sequence", "structure", "tabular"
    discipline: str         # e.g. "genomics", "ecology"
    largest_file_gb: float
    is_restricted: bool     # human-subjects, PII, export-controlled
    has_curation_budget: bool


@dataclass(frozen=True)
class RepositoryChoice:
    repository: Repository
    target_archive: str     # concrete archive name for the operator
    reason: str             # audit string, logged with the correlation ID


def select_repository(profile: DatasetProfile) -> RepositoryChoice:
    """Route a grant-funded dataset to exactly one repository, deterministically."""
    if profile.largest_file_gb < 0:
        raise ValueError(f"largest_file_gb must be non-negative: {profile.largest_file_gb}")

    # Rule 1 — sensitivity is an absolute filter: never an open archive.
    if profile.is_restricted:
        return RepositoryChoice(
            Repository.CONTROLLED_ACCESS,
            "dbGaP / institutional restricted gateway",
            "restricted data: routed to a controlled-access gateway",
        )

    # Rule 2 — an explicit funder mandate outranks discipline and convenience.
    designated = FUNDER_DESIGNATED_ARCHIVES.get((profile.funder_ror, profile.data_type))
    if designated is not None:
        return RepositoryChoice(
            Repository.DOMAIN, designated,
            f"funder mandate: {profile.data_type} data must go to {designated}",
        )

    # Rule 3 — a certified disciplinary repository is the community norm.
    discipline_archive = DISCIPLINE_ARCHIVES.get(profile.discipline)
    if discipline_archive is not None:
        return RepositoryChoice(
            Repository.DOMAIN, discipline_archive,
            f"discipline standard: {profile.discipline} deposits to {discipline_archive}",
        )

    # Rule 4 — capacity forces the institutional large-file platform.
    if profile.largest_file_gb > GENERALIST_SIZE_CEILING_GB:
        return RepositoryChoice(
            Repository.DATAVERSE, "Institutional Dataverse",
            f"file {profile.largest_file_gb} GB exceeds the {GENERALIST_SIZE_CEILING_GB} GB "
            "generalist ceiling",
        )

    # Rule 5 — budgeted projects get managed curation.
    if profile.has_curation_budget:
        return RepositoryChoice(
            Repository.FIGSHARE, "Figshare",
            "curation budget available: managed onboarding via Figshare",
        )

    # Rule 6 — explicit default; free, DOI-minting, no capacity conflict.
    return RepositoryChoice(
        Repository.ZENODO, "Zenodo",
        "no constraint binds: default to free DOI-minting Zenodo",
    )

Once the archive is chosen, the automated deposit itself — batching, the repository REST call, and DOI registration — runs through the Institutional Repository Strategy pipeline, and high-volume grant-deadline uploads are throttled via async batch processing so a bulk submission cannot saturate the target API.

Verification

Prove the precedence ordering with a pytest case that exercises each binding rule plus the default. The critical assertions are that sensitivity beats a funder mandate and that a funder mandate beats a mere discipline match — a regression in the ordering is exactly the failure that silently sends restricted data to an open archive.

python

from repository_selector import (  # module holding the code above
    DatasetProfile, Repository, select_repository,
)


def _profile(**overrides) -> DatasetProfile:
    base = dict(
        dataset_id="DS-20260701", funder_ror="https://ror.org/00000000x",
        data_type="tabular", discipline="chemistry",
        largest_file_gb=2.0, is_restricted=False, has_curation_budget=False,
    )
    return DatasetProfile(**{**base, **overrides})


def test_precedence_and_default() -> None:
    # Rule 1 wins even when a funder mandate would otherwise apply.
    restricted = _profile(is_restricted=True, funder_ror="https://ror.org/01cwqze88",
                          data_type="sequence")
    assert select_repository(restricted).repository is Repository.CONTROLLED_ACCESS

    # Rule 2: NIH sequence data is routed to its designated archive.
    nih = _profile(funder_ror="https://ror.org/01cwqze88", data_type="sequence")
    assert select_repository(nih).target_archive == "NCBI GEO / SRA"

    # Rule 3: discipline standard, no funder mandate.
    eco = _profile(discipline="ecology")
    assert select_repository(eco).target_archive == "Dryad"

    # Rule 4: oversized file forces institutional Dataverse.
    big = _profile(largest_file_gb=120.0)
    assert select_repository(big).repository is Repository.DATAVERSE

    # Rule 5 vs Rule 6: budget flips managed curation on.
    assert select_repository(_profile(has_curation_budget=True)).repository is Repository.FIGSHARE
    assert select_repository(_profile()).repository is Repository.ZENODO


def test_incoherent_profile_raises() -> None:
    import pytest
    with pytest.raises(ValueError, match="non-negative"):
        select_repository(_profile(largest_file_gb=-1.0))

Run it with pytest -q test_repository_selector.py; a green result confirms that the six rules fire in precedence order and that an incoherent profile fails loudly instead of defaulting to Zenodo.

Gotchas

Convenience overriding a funder mandate. Teams reach for Zenodo because its API is easiest, even when the grant designates GEO/SRA — a compliance failure that surfaces only at the final report. Fix: keep the funder-mandate filter (Rule 2) ahead of every capacity and cost rule, exactly as the ordering above enforces.
Sensitivity treated as a metadata flag instead of a hard gate. Marking a human-subjects dataset “restricted” in the catalog but still depositing it to an open archive exposes protected data irreversibly. Fix: make is_restricted the first, non-overridable filter and route to a controlled-access gateway; the classification rules that set this flag come from Funder Mandate Alignment.
Stale capacity ceilings. A platform raises its per-record limit and the hard-coded 50 GB constant silently misroutes datasets that would now fit. Fix: keep ceilings in the capability table as data and review them on a schedule, so a platform change is a one-line edit rather than a logic bug.

Institutional Repository Strategy — the parent deposit pipeline this routing decision feeds; it runs the actual ingestion, enrichment, and FAIR gate.
Funder Mandate Alignment — the compliance baselines that populate the funder-mandate and sensitivity rules used here.
Open License Configuration — how the SPDX license token every chosen archive requires is selected and encoded.
Open Science Infrastructure Planning — the full infrastructure topology this decision sits inside.

Choosing the Right Repository for Grant-Funded Projects #

Deterministic Selection Rules #

Production Python Implementation #

Verification #

Gotchas #

Related #

Choosing the Right Repository for Grant-Funded Projects

Deterministic Selection Rules

Production Python Implementation

Verification

Gotchas

Related