Data & downloads — for reviewers and reusers

Download, audit, and
reproduce the resource

Every causal claim in WORD-DRIFT is an auditable row with source URLs, a typed evidence tier, and a confidence. This page links the full dataset, the claims ledger, the quality gates, and the provenance reports so other researchers can review and reuse the graph.

Start here

The claims ledger

The single most useful artifact for a reviewer. One row per drift:CausalHypothesis: the word, the sense it shifted from and to, the drift type, the proposed trigger and its Wikidata QID, the typed evidence tier, the confidence, and the source URLs backing the claim. Open it in any spreadsheet, pick a row, follow its source_urls, and check whether the evidence_types and confidence are justified. See docs/REVIEW.md for the per-claim audit recipe.

claims-ledger.csv

One auditable row per causal hypothesis. Columns: hypothesis_id, word, language, sense_from, sense_to, drift_type, drift_year, proposed_trigger, trigger_date, trigger_category, wikidata_qid, evidence_types, confidence, source_urls, source_titles, attributed_to, date.

CSV one row per hypothesis CC-BY 4.0

Full dataset and tables

The complete graph in three RDF serializations plus the tabular extracts. All curated RDF instances are released under CC-BY 4.0; the schema, shapes, queries, and code are MIT (see License). Source datasets retain their own licenses (see Data sources).

word-drift.ttl

The full knowledge graph in Turtle. The canonical, human-readable serialization.

RDF / Turtle CC-BY 4.0

word-drift.nt

The full graph as N-Triples. One triple per line; convenient for streaming, diffing, and line-oriented tooling.

RDF / N-Triples CC-BY 4.0

word-drift.jsonld

The full graph as JSON-LD. Drop-in for JSON tooling that understands linked data contexts.

RDF / JSON-LD CC-BY 4.0

words.csv

One row per word: word_id, written_form, language, n_senses. A lightweight index of the lexical layer.

CSV CC-BY 4.0

triggers.csv

One row per trigger event: trigger_id, label, event_date, category, wikidata_qid. The real-world events that reframe words.

CSV CC-BY 4.0

word-drift-nanopubs.trig

The curated causal hypotheses as nanopublications (TriG): each claim packaged with its own assertion, provenance, and publication-info graphs. The most self-contained, citable unit of provenance.

RDF / TriG nanopublications CC-BY 4.0

Reproducibility

Rebuild it with one command

The downloadable artifacts on this page are not hand-edited. They are regenerated by the build, which runs a sequence of quality gates first. Clone the repository and run:

# Regenerate the claims ledger, CSVs, and RDF serializations:
make export

# Full quality gate + frozen-release checklist (does NOT publish anything):
make release
#   = validate (SHACL) + test (pytest) + lint-data + check-qids + stats

SHACL validation gate

make validate runs validate.py, loading the ontology modules and the SHACL shapes (causal-hypothesis-shape.ttl, drift-event-shape.ttl, word-sense-shape.ttl). Every drift claim must structurally cite a source; every hypothesis must carry a confidence and a typed evidence tier.

QID gate

make check-qids runs audit-trigger-qids.py: every trigger owl:sameAs must resolve to a verified-OK Wikidata entity (no categories, disambiguation pages, deleted items, or label mismatches).

Lint gate

make lint-data runs lint-data.py for data-shape checks beyond SHACL (year ranges, dangling references, orphaned senses).

Competency SPARQL

The 12 competency questions (queries/competency/*.rq) encode what the graph should be able to answer. Run them all with run-competency-questions.py. The causal joins (cq01, cq02, cq09) are the ones pure detection data cannot answer.

How the claims were checked

The reports below document what was verified, what is still weak, and how. They are the honest record behind the confidence scores. The curated causal layer is deliberately high-precision rather than representative; the limitations are spelled out in the data card.

Reviewer guide

docs/REVIEW.md — how to audit a single claim, run the queries, check the gates, and where per-claim provenance lives.

Data card

docs/data-card.md — composition, intended use, known biases and limitations, FAIR distribution.

Wikidata audit

wikidata-audit.md — resolution status of every trigger owl:sameAs link.

Verification reports

data/reports/verify-chunk*.md — per-chunk manual verification of curated claims against their sources.

Inter-annotator agreement

eval/iaa/ (annotation sheet, key, kappa script) and iaa-pilot.md. An LLM reliability baseline, not human ground truth; a human round is the documented next step.

Design decisions

docs/plans/ (research-grade plan + waves) and docs/research-log.md — the running record of what was built and why.

Citation

How to cite WORD-DRIFT

If you use the schema, data, or visualiser in your research, please cite the repository. Machine-readable metadata is in CITATION.cff. A companion paper describing the causal ontology layer is in preparation; this entry will be updated when it is published.

BibTeX

@misc{worddrift2026,
  title  = {{WORD-DRIFT}: A Knowledge Graph for Evidenced Causal
            Hypotheses in Lexical Semantic Change},
  author = {Nennemann, Christian},
  year   = {2026},
  url    = {https://github.com/XORwell/word-drift},
  note   = {Version 0.4. Data: CC-BY 4.0; Code: MIT.}
}

Plain text

Nennemann, C. (2026). WORD-DRIFT: A Knowledge Graph for Evidenced Causal Hypotheses in Lexical Semantic Change (Version 0.4). https://github.com/XORwell/word-drift

Repository: github.com/XORwell/word-drift