Parsing ELN Exports with Python and pandas: A Field-Mapped Ingestion Guide

Turning an Electronic Lab Notebook (ELN) export into a validated research record is a narrow, deterministic operation: read the vendor’s CSV or Excel dump, coerce each column to a declared type, normalize the identifier and license fields, and hand a clean frame downstream — without letting pandas guess. This page implements the extraction-and-coercion step of the Lab Notebook Parsing workflow: the point where a nested JSON export from a commercial ELN, a flattened CSV from a legacy LIMS, and an Excel sheet from a benchtop notebook all have to converge on one typed contract before normalization and validation run. It assumes you can read Python 3.10+, have an export file in hand, and want a repeatable loader rather than a one-off script. Schema enforcement with pydantic and provenance logging are adjacent concerns handled after this step, so this guide links to them rather than duplicating them.

The core failure mode is silent: pd.read_csv and pd.read_excel infer dtypes per column, so a sample_id that is numeric in the first 10,000 rows and alphanumeric afterward lands as object, a blank concentration cell becomes NaN that later reads as a valid zero, and a timezone-naive created string quietly breaks embargo math. The fix is to never infer — declare an explicit schema, stream the file in bounded batches, and route anything that does not conform to a quarantine sink.

The field crosswalk: raw ELN headers to a typed contract

ELN vendors export the same concepts under different column names and formats. The mapping below is the canonical crosswalk this loader enforces: every raw header variant is aliased to one canonical field, coerced to a fixed Arrow type, and passed through one normalization rule. There are no aspirational rows — each maps to code in the loader that follows. Canonical fields are named to align with the downstream Dublin Core Metadata Element Set (Dublin Core) crosswalk so enrichment needs no second rename pass.

Raw ELN header (observed variants)	Canonical field	Arrow type	Normalization rule
`Experiment ID`, `expt_id`, `EXPERIMENT`	`experiment_id`	`string`	Trim, uppercase, assert `^EXP-\d{4}-[A-Z]{2}$`
`Sample ID`, `sample`, `SampleName`	`sample_id`	`string`	Trim; never coerce to numeric
`Investigator`, `Author`, `Owner`	`investigator_name`	`string`	Collapse internal whitespace
`ORCID`, `orcid_id`, `Researcher ORCID`	`investigator_orcid`	`string`	Prefix bare IDs with `https://orcid.org/`
`Created`, `Timestamp`, `date_run`	`timestamp_utc`	`timestamp[ns, tz=UTC]`	Parse then convert to UTC; reject naive
`Concentration (mM)`, `conc_mM`	`concentration_mm`	`float32`	Blank → null, never `0.0`
`License`, `Rights`, `Reuse`	`license_spdx`	`string`	Map free text to an SPDX License List identifier
`Attachments`, `Files`, `linked_files`	`attachment_refs`	`string`	Keep pipe-delimited; split downstream

Two rules on this table cause most real-world corruption if skipped. First, sample_id must stay a string even when a batch happens to be all-numeric, or joins against experiment records silently drop rows. Second, a blank concentration_mm must become a genuine null, not 0.0 — a fabricated zero is indistinguishable from a real measurement and poisons every downstream statistic.

Production implementation

The loader streams the export as Arrow record batches so memory stays flat on multi-gigabyte files, applies the declared schema at read time, aliases headers to canonical names, normalizes the identifier and license fields, and splits each batch into a conforming frame and a quarantine list. External httpx/registry lookups are deliberately absent — identifier resolution and rate-limit handling belong to async batch processing, and typed-model enforcement belongs to Pydantic schema validation; this stage only produces a clean, typed frame.

python

from __future__ import annotations

import hashlib
import json
import re
from collections.abc import Iterator
from pathlib import Path

import pandas as pd
import pyarrow as pa
from pyarrow import csv

# --- Declared contract: Arrow types, header aliases, and the ORCID form ------

ARROW_SCHEMA: dict[str, pa.DataType] = {
    "experiment_id": pa.string(),
    "sample_id": pa.string(),
    "investigator_name": pa.string(),
    "investigator_orcid": pa.string(),
    "timestamp_utc": pa.timestamp("ns", tz="UTC"),
    "concentration_mm": pa.float32(),
    "license_spdx": pa.string(),
    "attachment_refs": pa.string(),
}

# Every known raw header variant maps to exactly one canonical field.
HEADER_ALIASES: dict[str, str] = {
    "experiment id": "experiment_id", "expt_id": "experiment_id", "experiment": "experiment_id",
    "sample id": "sample_id", "sample": "sample_id", "samplename": "sample_id",
    "investigator": "investigator_name", "author": "investigator_name", "owner": "investigator_name",
    "orcid": "investigator_orcid", "orcid_id": "investigator_orcid", "researcher orcid": "investigator_orcid",
    "created": "timestamp_utc", "timestamp": "timestamp_utc", "date_run": "timestamp_utc",
    "concentration (mm)": "concentration_mm", "conc_mm": "concentration_mm",
    "license": "license_spdx", "rights": "license_spdx", "reuse": "license_spdx",
    "attachments": "attachment_refs", "files": "attachment_refs", "linked_files": "attachment_refs",
}

# Free-text rights strings mapped to SPDX License List identifiers.
SPDX_ALIASES: dict[str, str] = {
    "cc-by": "CC-BY-4.0", "cc by 4.0": "CC-BY-4.0", "attribution": "CC-BY-4.0",
    "cc0": "CC0-1.0", "public domain": "CC0-1.0", "mit": "MIT",
}

EXPERIMENT_ID_RE = re.compile(r"^EXP-\d{4}-[A-Z]{2}$")


def canonicalize_headers(path: Path) -> list[str]:
    """Read only the header row and rename each column to its canonical field."""
    header = pd.read_csv(path, nrows=0)
    resolved: list[str] = []
    for raw in header.columns:
        key = raw.strip().lower()
        if key not in HEADER_ALIASES:
            raise KeyError(f"Unmapped ELN column {raw!r}; add it to HEADER_ALIASES before ingesting")
        resolved.append(HEADER_ALIASES[key])
    return resolved


def stream_batches(path: Path, column_names: list[str]) -> Iterator[pd.DataFrame]:
    """Stream the export as Arrow batches so peak memory is one batch, not the whole file."""
    reader = csv.open_csv(
        path,
        read_options=csv.ReadOptions(column_names=column_names, skip_rows=1),
        # Declared types stop pandas heuristics; blanks/NA tokens become real nulls.
        convert_options=csv.ConvertOptions(
            column_types=ARROW_SCHEMA,
            strings_can_be_null=True,
            null_values=["", "N/A", "null", "NaN"],
        ),
    )
    for batch in reader:
        yield pa.Table.from_batches([batch]).to_pandas(types_mapper=pd.ArrowDtype)


def normalize(df: pd.DataFrame) -> pd.DataFrame:
    """Apply the crosswalk's per-field normalization rules in place."""
    df["experiment_id"] = df["experiment_id"].str.strip().str.upper()
    df["sample_id"] = df["sample_id"].astype("string").str.strip()
    df["investigator_name"] = df["investigator_name"].str.split().str.join(" ")
    # Prefix bare ORCIDs so F3/I3 resolvable-identifier rules hold downstream.
    bare = df["investigator_orcid"].notna() & ~df["investigator_orcid"].str.startswith("https://")
    df.loc[bare, "investigator_orcid"] = "https://orcid.org/" + df.loc[bare, "investigator_orcid"]
    df["timestamp_utc"] = pd.to_datetime(df["timestamp_utc"], utc=True, errors="coerce")
    df["license_spdx"] = df["license_spdx"].str.strip().str.lower().map(SPDX_ALIASES)
    return df


def split_conforming(df: pd.DataFrame, source: Path) -> tuple[pd.DataFrame, list[dict[str, object]]]:
    """Partition a batch into a conforming frame and quarantine entries (never drop rows)."""
    ok_id = df["experiment_id"].str.match(EXPERIMENT_ID_RE).fillna(False)
    ok_orcid = df["investigator_orcid"].str.startswith("https://orcid.org/").fillna(False)
    ok_license = df["license_spdx"].notna()
    ok_time = df["timestamp_utc"].notna()
    conforms = ok_id & ok_orcid & ok_license & ok_time

    quarantine: list[dict[str, object]] = []
    for idx, row in df.loc[~conforms].iterrows():
        payload = row.to_dict()
        reasons = [
            name for name, ok in (
                ("experiment_id", ok_id[idx]), ("investigator_orcid", ok_orcid[idx]),
                ("license_spdx", ok_license[idx]), ("timestamp_utc", ok_time[idx]),
            ) if not ok
        ]
        quarantine.append({
            "source_file": source.name,
            "row_hash": hashlib.sha256(json.dumps(payload, default=str).encode()).hexdigest(),
            "failed_fields": reasons,
            "raw_payload": {k: str(v) for k, v in payload.items()},
        })
    return df.loc[conforms].reset_index(drop=True), quarantine


def parse_eln_export(path: Path) -> tuple[pd.DataFrame, list[dict[str, object]]]:
    """End-to-end: declared-schema stream → normalize → split conforming vs quarantine."""
    columns = canonicalize_headers(path)
    frames: list[pd.DataFrame] = []
    quarantine: list[dict[str, object]] = []
    for batch in stream_batches(path, columns):
        clean, rejected = split_conforming(normalize(batch), path)
        frames.append(clean)
        quarantine.extend(rejected)
    result = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()
    return result, quarantine

Verification

The test below proves the two rules that break silently in the wild: a bare ORCID is repaired to its resolvable URI form, and a malformed experiment_id is routed to quarantine with a precise reason instead of contaminating the clean frame.

python

from __future__ import annotations

from pathlib import Path


def test_parse_repairs_orcid_and_quarantines_bad_id(tmp_path: Path) -> None:
    export = tmp_path / "eln_export.csv"
    export.write_text(
        "Experiment ID,Sample ID,Investigator,ORCID,Created,Concentration (mM),License,Attachments\n"
        "EXP-2024-AB,S-1,Ada Lovelace,0000-0002-1825-0097,2024-03-01T09:00:00+00:00,1.5,cc-by,a.png\n"
        "bad-id,S-2,Grace Hopper,https://orcid.org/0000-0001-5109-3700,2024-03-01T10:00:00Z,,mit,b.png\n"
    )

    clean, quarantine = parse_eln_export(export)

    # Row 1 conforms; the bare ORCID was normalized to a resolvable URI.
    assert len(clean) == 1
    assert clean.loc[0, "investigator_orcid"] == "https://orcid.org/0000-0002-1825-0097"
    assert clean.loc[0, "license_spdx"] == "CC-BY-4.0"

    # Row 2 is quarantined for the malformed experiment_id, with the reason recorded.
    assert len(quarantine) == 1
    assert "experiment_id" in quarantine[0]["failed_fields"]
    assert quarantine[0]["row_hash"]  # a stable SHA-256 digest for audit reconciliation

Run it with pytest -q; a green result confirms header aliasing, ORCID repair, SPDX License List normalization, and quarantine routing all fire in a single streamed pass. Wire this test into CI so a vendor changing an export header (Owner → Principal Investigator) fails the build at canonicalize_headers instead of silently ingesting an unmapped column.

Gotchas

pandas infers sample_id as an integer. When every value in a batch is numeric, heuristic inference makes sample_id an int64, and later joins against string keys drop rows with no error. Fix: declare it pa.string() in ARROW_SCHEMA and never call astype(int) on identifier columns.
Blank measurements become 0.0. A fillna(0) or a downstream aggregation treats an empty concentration_mm cell as a real zero, fabricating data. Fix: set strings_can_be_null=True with an explicit null_values list so blanks stay null, and use nullable float32 end to end.
Timezone-naive timestamps corrupt embargo windows. An ELN that exports local time without an offset makes to_datetime(..., utc=True) assume UTC and shift the real instant. Fix: reject naive timestamps at the gate — errors="coerce" nulls them so ok_time quarantines the row for manual timezone assignment rather than guessing.

Lab Notebook Parsing — the parent workflow this loader feeds; covers schema-drift detection and the quarantine sink.
Pydantic schema validation — enforces the typed contract on each row after this frame is produced.
Automating Dublin Core enrichment from raw CSV — the enrichment step that consumes these canonical fields.
Data Ingestion & Metadata Enrichment — the full ingestion pipeline this parsing step sits inside.

Parsing ELN Exports with Python and pandas: A Field-Mapped Ingestion Guide #

The field crosswalk: raw ELN headers to a typed contract #

Production implementation #

Verification #

Gotchas #

Related #

Parsing ELN Exports with Python and pandas: A Field-Mapped Ingestion Guide

The field crosswalk: raw ELN headers to a typed contract

Production implementation

Verification

Gotchas

Related