Data & downloads — for reviewers and reusers
Every causal claim in WORD-DRIFT is an auditable row with source URLs, a typed evidence tier, and a confidence. This page links the full dataset, the claims ledger, the quality gates, and the provenance reports so other researchers can review and reuse the graph.
Start here
The single most useful artifact for a reviewer. One row per
drift:CausalHypothesis: the word, the sense it shifted from
and to, the drift type, the proposed trigger and its Wikidata QID, the
typed evidence tier, the confidence, and the source URLs
backing the claim. Open it in any spreadsheet, pick a row, follow its
source_urls, and check whether the
evidence_types and confidence are justified.
See docs/REVIEW.md for the per-claim audit recipe.
One auditable row per causal hypothesis. Columns:
hypothesis_id, word, language, sense_from, sense_to, drift_type,
drift_year, proposed_trigger, trigger_date, trigger_category,
wikidata_qid, evidence_types, confidence, source_urls, source_titles,
attributed_to, date.
Dataset
The complete graph in three RDF serializations plus the tabular extracts. All curated RDF instances are released under CC-BY 4.0; the schema, shapes, queries, and code are MIT (see License). Source datasets retain their own licenses (see Data sources).
The full knowledge graph in Turtle. The canonical, human-readable serialization.
The full graph as N-Triples. One triple per line; convenient for streaming, diffing, and line-oriented tooling.
The full graph as JSON-LD. Drop-in for JSON tooling that understands linked data contexts.
One row per word: word_id, written_form, language, n_senses. A lightweight index of the lexical layer.
One row per trigger event: trigger_id, label, event_date, category, wikidata_qid. The real-world events that reframe words.
The curated causal hypotheses as nanopublications (TriG): each claim packaged with its own assertion, provenance, and publication-info graphs. The most self-contained, citable unit of provenance.
Reproducibility
The downloadable artifacts on this page are not hand-edited. They are regenerated by the build, which runs a sequence of quality gates first. Clone the repository and run:
# Regenerate the claims ledger, CSVs, and RDF serializations: make export # Full quality gate + frozen-release checklist (does NOT publish anything): make release # = validate (SHACL) + test (pytest) + lint-data + check-qids + stats
make validate runs validate.py, loading the
ontology modules and the
SHACL shapes
(causal-hypothesis-shape.ttl,
drift-event-shape.ttl, word-sense-shape.ttl).
Every drift claim must structurally cite a source; every hypothesis must
carry a confidence and a typed evidence tier.
make check-qids runs
audit-trigger-qids.py:
every trigger owl:sameAs must resolve to a verified-OK
Wikidata entity (no categories, disambiguation pages, deleted items, or
label mismatches).
make lint-data runs
lint-data.py
for data-shape checks beyond SHACL (year ranges, dangling references,
orphaned senses).
The
12 competency questions
(queries/competency/*.rq) encode what the graph should be
able to answer. Run them all with
run-competency-questions.py.
The causal joins (cq01, cq02, cq09) are the ones pure detection data
cannot answer.
Provenance & quality
The reports below document what was verified, what is still weak, and how. They are the honest record behind the confidence scores. The curated causal layer is deliberately high-precision rather than representative; the limitations are spelled out in the data card.
docs/REVIEW.md — how to audit a single claim, run the queries, check the gates, and where per-claim provenance lives.
docs/data-card.md — composition, intended use, known biases and limitations, FAIR distribution.
wikidata-audit.md
— resolution status of every trigger owl:sameAs link.
data/reports/verify-chunk*.md — per-chunk manual verification of curated claims against their sources.
eval/iaa/ (annotation sheet, key, kappa script) and iaa-pilot.md. An LLM reliability baseline, not human ground truth; a human round is the documented next step.
docs/plans/ (research-grade plan + waves) and docs/research-log.md — the running record of what was built and why.
Citation
If you use the schema, data, or visualiser in your research, please cite the repository. Machine-readable metadata is in CITATION.cff. A companion paper describing the causal ontology layer is in preparation; this entry will be updated when it is published.
@misc{worddrift2026,
title = {{WORD-DRIFT}: A Knowledge Graph for Evidenced Causal
Hypotheses in Lexical Semantic Change},
author = {Nennemann, Christian},
year = {2026},
url = {https://github.com/XORwell/word-drift},
note = {Version 0.4. Data: CC-BY 4.0; Code: MIT.}
}
Repository: github.com/XORwell/word-drift