API Routing & Fallbacks: Resilient Endpoint Selection for FAIR Data Pipelines

Q: When should a request retry in place versus cascade to a fallback?

Classify by failure permanence. Transient signals such as HTTP 429, 502, 503, 504, and connection or read timeouts mean the endpoint may recover, so retry in place with jittered exponential backoff up to a small attempt cap. Permanent signals such as HTTP 400, 404, and any local schema-validation error mean the request will never succeed on that endpoint, so cascade immediately and spend no retry budget.

Q: How do I stop a fallback endpoint from accepting records the primary would reject?

Validate against the target route's own schema before every dispatch, not once at the pipeline edge. Because the dispatcher re-checks the payload against each endpoint's declared schema as it walks the chain, a record that fails strict-tier validation cannot be quietly accepted by a mirror. Records that fail validation are routed to a quarantine sink for curator remediation rather than downgraded to a relaxed tier.

Q: Does the routing layer manage credentials for each endpoint?

It injects credentials but does not own them. Tokens are scoped per endpoint and rotated through a centralized secrets manager; when a fallback route activates, the dispatcher attaches that route's own credential bundle so a mirror or cache never inherits the primary's elevated permissions.

Scientific research data pipelines rarely talk to a single, always-on service. A single dataset deposit can fan out to an institutional repository, a DataCite or Crossref registry for identifier minting, a vocabulary resolver for controlled terms, and a funder reporting portal — each with its own API contract, rate limit, authentication model, and maintenance window. When any one of those hops fails silently, the visible symptom surfaces days later as an unresolvable DOI, a half-enriched record, or a funder audit that can’t reconcile a deposit. This guide covers the routing layer that sits between the ingestion boundary and those external systems: how to select an endpoint deterministically, how to cascade to fallbacks without dropping data or weakening compliance, and how to prove the whole chain still enforces FAIR guarantees under degradation. It sits inside the broader Core Architecture & FAIR Mapping pipeline, immediately downstream of validation and immediately upstream of persistence, and it is written for the Python automation engineers and academic IT teams who own that control plane in production.

Unlike a generic network load balancer, a FAIR routing layer cannot treat every request as interchangeable traffic. Each routing decision is also a compliance decision: it must confirm the payload still validates against the target schema, that the chosen endpoint’s compliance tier is adequate for the data’s sensitivity, and that a provenance record captures which path the payload actually took. The sub-pipeline below shows the decision surface — how a dispatched payload is classified, retried, mirrored, cached, or dead-lettered.

Concepts, Standards, and the Compliance Contract

Three properties separate a research-grade routing layer from a naive retry loop. It must be deterministic — given the same routing table and health state, the same payload always takes the same path, so behaviour is reproducible during an incident review. It must be schema-preserving — a fallback endpoint may never accept a payload the primary would have rejected, or the archive fills with records that pass on the mirror but fail on the canonical store. And it must be auditable — every hop emits a provenance record so a curator can later reconstruct exactly why a given dataset landed in the cache instead of the primary repository.

Those guarantees rest on external metadata standards, each cited here by its full name and covered in depth elsewhere in this section. Payloads bound for a repository are validated against the DCAT Application Profile (DCAT-AP) or the DataCite Metadata Schema before dispatch; the field-level crosswalks that produce those payloads live in Metadata Schema Mapping. Semantic serialization uses JSON-LD (JSON for Linking Data), so a routing decision can carry a self-describing, machine-actionable body rather than an opaque blob. Provenance records follow the W3C PROV Data Model, expressing each dispatch as an activity linking a source entity to a delivered entity. The way these standards become enforced checkpoints rather than after-the-fact audits is the subject of the FAIR Principle Breakdown, which maps the Accessible and Reusable principles directly onto the routing and provenance stages described here.

The routing layer evaluates three orthogonal dimensions on every request. Protocol and content negotiation decides whether a REST, GraphQL, OAI-PMH, or SWORD endpoint is invoked, based on payload structure and the metadata enrichment the target expects. Compliance-tier mapping routes datasets needing strict FAIR alignment to endpoints with hard JSON-LD or Schema.org validation, while legacy or provisional records traverse relaxed transformation gateways. Health and latency state determines whether the request proceeds to the primary endpoint, a regional mirror, or a read-only compliance cache. Every decision emits structured telemetry — endpoint latency, HTTP status, and schema-validation outcome — before the payload is allowed to advance.

Step-by-Step Implementation

The following four steps build a production dispatcher: a declarative table, a typed loader, a schema-gated fallback dispatcher, and a deterministic retry boundary. Each code block uses Python 3.10+ with full type hints and the Pydantic V2 API.

Step 1 — Declare the routing table as version-controlled configuration

Endpoints, priorities, and fallback links belong in declarative configuration, never in hardcoded URLs or DNS tricks, so that a change to the fallback order is a reviewable diff rather than a redeploy. Each route names its successor through fallback_id, forming an explicit chain terminating in a null link. The compliance rationale: expressing the chain as data makes the entire degradation path auditable — a reviewer can see, without reading code, that the terminal hop is a relaxed-tier local cache and not an unvalidated public endpoint.

yaml

routing_table:
  - id: "primary_repo"
    url: "https://api.repository.edu/v2/ingest"
    protocol: "REST"
    priority: 1
    compliance_tier: "strict"
    schema: "dcat-ap-3.0"
    timeout_ms: 3000
    fallback_id: "secondary_mirror"
  - id: "secondary_mirror"
    url: "https://mirror.repository.edu/v2/ingest"
    protocol: "REST"
    priority: 2
    compliance_tier: "strict"
    schema: "dcat-ap-3.0"
    timeout_ms: 5000
    fallback_id: "local_cache"
  - id: "local_cache"
    url: "internal://compliance-proxy/v1/store"
    protocol: "HTTP"
    priority: 3
    compliance_tier: "relaxed"
    schema: "internal-fair-v1"
    timeout_ms: 1000
    fallback_id: null

Step 2 — Model and load routes with strict typing

Loading raw YAML into dictionaries invites KeyError surprises deep inside the dispatch loop. Modelling each route with a Pydantic V2 BaseModel moves those failures to load time, where they belong. This is the same Pydantic schema validation discipline the ingestion stage applies to datasets, turned inward on the pipeline’s own configuration. Note the one non-obvious detail: the YAML key schema collides with a reserved attribute on Pydantic models, so it is aliased to schema_name.

python

import yaml
from pydantic import BaseModel, ConfigDict, Field
from typing import Optional, Dict


class RouteConfig(BaseModel):
    # Allow population by field name as well as by the "schema" YAML key.
    model_config = ConfigDict(populate_by_name=True)

    id: str
    url: str
    protocol: str
    priority: int
    compliance_tier: str
    # "schema" shadows a reserved attribute in Pydantic v2, so alias it.
    schema_name: str = Field(alias="schema")
    timeout_ms: int
    fallback_id: Optional[str] = None


class RoutingTable:
    def __init__(self, config_path: str) -> None:
        with open(config_path, "r") as f:
            raw = yaml.safe_load(f)
        self.routes: Dict[str, RouteConfig] = {
            r["id"]: RouteConfig(**r) for r in raw["routing_table"]
        }

    def get_next_route(self, current_id: Optional[str] = None) -> Optional[RouteConfig]:
        if current_id is None:
            # Entry point: pick the highest-priority (lowest number) route.
            return min(self.routes.values(), key=lambda x: x.priority)
        current = self.routes.get(current_id)
        if current and current.fallback_id:
            return self.routes.get(current.fallback_id)
        return None

Step 3 — Dispatch through a schema-gated fallback loop

The dispatcher validates the payload against the target route’s schema before spending a network round trip, then walks the fallback chain on failure. The compliance rationale is decisive here: validating per-route guarantees the schema-preservation property — a payload that would fail on the primary is never quietly accepted by a mirror, because it is re-checked against each endpoint’s declared schema. Asynchronous I/O via httpx keeps the dispatcher usable inside high-throughput async batch processing workers.

python

import httpx
import logging
from pydantic import ValidationError
from typing import Any, Dict

logger = logging.getLogger(__name__)


class FAIRRouter:
    def __init__(self, table: RoutingTable, validator: Any) -> None:
        self.table = table
        self.validator = validator
        self.client = httpx.AsyncClient(timeout=httpx.Timeout(10.0))

    async def dispatch(self, payload: Dict[str, Any]) -> Dict[str, Any]:
        route = self.table.get_next_route()
        while route:
            try:
                # Compliance gate: validate against THIS route's schema first.
                self.validator.validate(payload, route.schema_name)
                response = await self.client.post(
                    route.url, json=payload, timeout=route.timeout_ms / 1000
                )
                response.raise_for_status()
                return {
                    "status": "success",
                    "route_id": route.id,
                    "payload": response.json(),
                }
            except (httpx.HTTPStatusError, httpx.TimeoutException, ValidationError) as exc:
                logger.warning("route %s failed: %s", route.id, exc)
                route = self.table.get_next_route(route.id)
        return {"status": "exhausted", "error": "all fallback routes failed"}

Step 4 — Wrap network calls in a deterministic retry boundary

Cascading to a mirror on the first timeout is wasteful when the primary is merely rate-limited. A per-request retry boundary absorbs transient failures before the fallback chain is consulted. tenacity supplies exponential backoff with jitter, retrying only genuinely transient exception types. The stable X-Idempotency-Key, derived from the dataset identifier rather than the attempt count, ensures a retried deposit is deduplicated server-side instead of double-registered — the single most important guardrail for retrying a non-idempotent POST.

python

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
)
import httpx
from typing import Any, Dict


@retry(
    stop=stop_after_attempt(3),
    # Exponential backoff with built-in jitter, capped at 10 seconds.
    wait=wait_random_exponential(multiplier=1, max=10),
    retry=retry_if_exception_type(
        (httpx.ConnectTimeout, httpx.ReadTimeout, httpx.RemoteProtocolError)
    ),
)
async def resilient_request(
    client: httpx.AsyncClient, route: RouteConfig, payload: Dict[str, Any]
) -> httpx.Response:
    headers = {
        # Stable across retries so the server can deduplicate the deposit.
        "X-Idempotency-Key": f"fair-{payload.get('dataset_id', 'unknown')}-{route.id}",
        "Accept": "application/ld+json",
    }
    return await client.post(
        route.url, json=payload, headers=headers, timeout=route.timeout_ms / 1000
    )

Reference: Route Fields and Error Classification

Every field in the routing table has an exact meaning; misconfiguring one silently changes the degradation path. The table below is the authoritative field specification.

Field	Type	Required	Purpose	Example
`id`	string	yes	Unique key referenced by `fallback_id` in other routes	`primary_repo`
`url`	string (URI)	yes	Endpoint address; `internal://` scheme marks a local sink	`https://api.repository.edu/v2/ingest`
`protocol`	enum	yes	Dispatch adapter: `REST`, `GraphQL`, `OAI-PMH`, `SWORD`, `HTTP`	`REST`
`priority`	int	yes	Lower wins; the entry route is `min(priority)`	`1`
`compliance_tier`	enum	yes	`strict` enforces full schema validation; `relaxed` permits provisional records	`strict`
`schema`	string	yes	Validation contract applied before dispatch (aliased to `schema_name`)	`dcat-ap-3.0`
`timeout_ms`	int	yes	Per-request deadline in milliseconds, converted to seconds for `httpx`	`3000`
`fallback_id`	string \| null	yes	`id` of the next route, or `null` to terminate the chain	`secondary_mirror`

Routing decisions hinge on classifying each failure correctly. Transient failures are retried in place; permanent failures cascade immediately so no retry budget is wasted on a request that can never succeed. The decision matrix below drives that split.

Condition	Classification	Retry in place?	Action
HTTP 429 Too Many Requests	transient	yes	Backoff with jitter; honour `Retry-After` if present
HTTP 503 / 502 / 504	transient	yes	Backoff, then re-probe the same endpoint
Connect / read timeout	transient	yes	Backoff up to the attempt cap, then cascade
HTTP 400 Bad Request	permanent	no	Cascade; the payload is structurally wrong for this endpoint
HTTP 404 Not Found	permanent	no	Cascade; endpoint or resource path is invalid
Local schema validation error	permanent	no	Cascade to a relaxed tier or quarantine; never retry
HTTP 401 / 403	permanent	no	Refresh credentials once, else quarantine for operator review

Error Handling and Edge Cases

When the retry boundary is exhausted and every route in the chain has failed, the payload must not vanish. It is serialized to a dead-letter queue with its full provenance graph attached — the originating system, every route attempted, each failure classification, and the last-known-good validation result. This guarantees zero data loss during an extended registry outage: once the upstream recovers, a reconciliation worker replays the dead-letter records through the same dispatcher. The read-only local compliance cache serves a complementary role. It holds the last validated payload for each dataset, so a downstream consumer requesting metadata during an outage receives a stale but valid record explicitly flagged as such, rather than a hard error.

Schema-validation failures deserve separate handling from network failures. A record that fails strict-tier validation should not be blindly retried against a relaxed mirror, because that would defeat the compliance gate; instead it is routed to a quarantine sink where a curator resolves the structural defect before re-submission. Provenance is what keeps these paths honest: when a fallback route activates, the W3C PROV record explicitly marks the degradation event, preserving the distinction between a primary ingestion and a cached reconciliation so audits never mistake one for the other.

Verification and Testing

Assert the fallback cascade without touching live registries by stubbing the transport. The test below drives a primary that returns 503, and confirms the dispatcher lands on the mirror and reports the correct route_id.

python

import httpx
import pytest
from typing import Any, Dict


class _AllowValidator:
    def validate(self, payload: Dict[str, Any], schema_name: str) -> None:
        return None  # accept everything; we are testing routing, not schema


@pytest.mark.asyncio
async def test_cascades_to_mirror_on_primary_503(tmp_path) -> None:
    cfg = tmp_path / "routes.yaml"
    cfg.write_text(
        "routing_table:\n"
        "  - {id: primary_repo, url: 'https://primary/ingest', protocol: REST,\n"
        "     priority: 1, compliance_tier: strict, schema: dcat-ap-3.0,\n"
        "     timeout_ms: 3000, fallback_id: secondary_mirror}\n"
        "  - {id: secondary_mirror, url: 'https://mirror/ingest', protocol: REST,\n"
        "     priority: 2, compliance_tier: strict, schema: dcat-ap-3.0,\n"
        "     timeout_ms: 5000, fallback_id: null}\n"
    )

    def handler(request: httpx.Request) -> httpx.Response:
        if request.url.host == "primary":
            return httpx.Response(503)
        return httpx.Response(200, json={"accepted": True})

    router = FAIRRouter(RoutingTable(str(cfg)), _AllowValidator())
    router.client = httpx.AsyncClient(transport=httpx.MockTransport(handler))

    result = await router.dispatch({"dataset_id": "ds-001"})
    assert result["status"] == "success"
    assert result["route_id"] == "secondary_mirror"

Run it with pytest -q test_routing.py. The expected log line, emitted at WARNING, confirms the cascade fired exactly once: route primary_repo failed: Server error '503 Service Unavailable' .... In production the same signal is a per-endpoint circuit breaker, which tracks error rate and latency percentiles and opens to divert traffic before every request pays the timeout cost.

Structured logs should emit JSON telemetry compatible with OpenTelemetry collectors so circuit transitions, retry counts, and dead-letter events feed the same dashboards that watch the rest of the pipeline.

Gotchas and Known Pitfalls

schema shadows a reserved model attribute. Naming a field schema on a Pydantic V2 model collides with reserved machinery and raises at class-definition time. Fix: alias it — schema_name: str = Field(alias="schema") — and set populate_by_name=True so both the YAML key and the Python name work.
Attempt-scoped idempotency keys defeat deduplication. If the X-Idempotency-Key embeds a retry counter or timestamp, each retry looks like a new deposit and the registry mints duplicate identifiers. Fix: derive the key solely from stable payload identity (dataset_id plus route.id) so every retry of the same deposit collides server-side.
Client timeout silently overrides per-route deadlines. An httpx.AsyncClient(timeout=Timeout(10.0)) caps every request at ten seconds regardless of a route’s timeout_ms. A route declaring timeout_ms: 3000 will still wait ten seconds unless the deadline is passed per request. Fix: always pass timeout=route.timeout_ms / 1000 on the individual post call, as the dispatcher does.
Retrying a timed-out POST can double-write. A request that times out client-side may still have succeeded on the server; a naive retry then creates a second record. Fix: only retry non-idempotent writes when the endpoint honours the idempotency key, and treat ambiguous timeouts as needing reconciliation rather than a blind resend.
Circuit-breaker counters race under async concurrency. Multiple asyncio tasks sharing one breaker can interleave increments and miss the threshold, keeping a dead endpoint marked healthy. Fix: guard the failure counter with an asyncio.Lock (or an atomic per-endpoint structure) so concurrent dispatches update it consistently.

Frequently Asked Questions

When should a request retry in place versus cascade to a fallback?

Classify by failure permanence. Transient signals — HTTP 429, 502, 503, 504, and connection or read timeouts — mean the endpoint may recover, so retry in place with jittered exponential backoff up to a small attempt cap. Permanent signals — HTTP 400, 404, and any local schema-validation error — mean the request will never succeed on that endpoint, so cascade immediately and spend no retry budget. The decision matrix in the reference section encodes exactly this split.

How do I stop a fallback endpoint from accepting records the primary would reject?

Validate against the target route’s own schema before every dispatch, not once at the pipeline’s edge. Because the dispatcher re-checks the payload against each endpoint’s declared schema as it walks the chain, a record that fails strict-tier validation cannot be quietly accepted by a mirror. Records that fail validation are routed to a quarantine sink for curator remediation rather than downgraded to a relaxed tier.

What happens to a deposit when every route in the chain fails?

It is written to a dead-letter queue with its full W3C PROV Data Model provenance graph — origin, every route attempted, each failure classification, and the last validated payload. Nothing is discarded. A reconciliation worker replays dead-letter records through the same dispatcher once the upstream registry recovers, and the read-only compliance cache can serve the last valid metadata, flagged as stale, to any consumer that queries during the outage.

Does the routing layer manage credentials for each endpoint?

It injects them but does not own them. Tokens are scoped per endpoint and rotated through a centralized secrets manager; when a fallback route activates, the dispatcher attaches that route’s own credential bundle so a mirror or cache never inherits the primary’s elevated permissions. The full least-privilege and TLS enforcement model for these boundaries is covered in Security & Access Control.

FAIR Principle Breakdown — how the Accessible and Reusable principles become enforced checkpoints, including the registry-resolution fallback pattern.
Metadata Schema Mapping — the DCAT-AP and DataCite crosswalks that produce the payloads this layer validates and routes.
Security & Access Control — per-endpoint credential scoping, TLS 1.3 enforcement, and audit logging for routed traffic.
Core Architecture & FAIR Mapping — the parent overview showing where routing sits in the ingestion-to-exposure pipeline topology.

API Routing & Fallbacks: Resilient Endpoint Selection for FAIR Data Pipelines #

Concepts, Standards, and the Compliance Contract #

Step-by-Step Implementation #

Step 1 — Declare the routing table as version-controlled configuration #

Step 2 — Model and load routes with strict typing #

Step 3 — Dispatch through a schema-gated fallback loop #

Step 4 — Wrap network calls in a deterministic retry boundary #

Reference: Route Fields and Error Classification #

Error Handling and Edge Cases #

Verification and Testing #

Gotchas and Known Pitfalls #

Frequently Asked Questions #

When should a request retry in place versus cascade to a fallback? #

How do I stop a fallback endpoint from accepting records the primary would reject? #

What happens to a deposit when every route in the chain fails? #

Does the routing layer manage credentials for each endpoint? #

Related Guides #

API Routing & Fallbacks: Resilient Endpoint Selection for FAIR Data Pipelines

Concepts, Standards, and the Compliance Contract

Step-by-Step Implementation

Step 1 — Declare the routing table as version-controlled configuration

Step 2 — Model and load routes with strict typing

Step 3 — Dispatch through a schema-gated fallback loop

Step 4 — Wrap network calls in a deterministic retry boundary

Reference: Route Fields and Error Classification

Error Handling and Edge Cases

Verification and Testing

Gotchas and Known Pitfalls

Frequently Asked Questions

When should a request retry in place versus cascade to a fallback?

How do I stop a fallback endpoint from accepting records the primary would reject?

What happens to a deposit when every route in the chain fails?

Does the routing layer manage credentials for each endpoint?

Related Guides