Hanzi Timeline

Implementation Notes

This app shows modern, bronze, seal, and oracle forms with stable IDs and asset-backed rendering. It does not depend on historical-script Unicode coverage.

Data Sources

  • Modern form: Unicode CJK character.
  • Oracle forms: JiaGuWen SQLite + oracle image assets.
  • Bronze + seal forms: EVOBC metadata + local EVOBC image corpus.
  • Origin summary text: English Wiktionary Chinese “Glyph origin” section.
  • Origin references per character: Academia Sinica Xiaoxue + Academia Sinica CharDB + Wiktionary links.
  • EVOBC download source: figshare.com/s/ce2cf55b35a2f8ecc4c6

Current Coverage

  • Generated runtime records: 793
  • Oracle: 1602 variants across 793 records
  • Bronze: 19647 variants across 527 records
  • Seal: 1497 variants across 728 records
  • UI variants drawers are available per stage (Bronze, Seal, Oracle Bone).

Canonical Record Model

Each record is keyed by modern character and codepoint with stage rows:

{
  id,
  modernChar,
  modernCodepoint,
  dataset,
  stages: [{ stageName, glyphId, assetType, assetRef }],
  variants: { bronze: [...], seal: [...], oracle: [...] },
  origin: { summary, source, sourceUrl, license, confidence },
  originReferences: [{ id, label, url }]
}

glyphId is canonical truth. We never use PUA codepoints as database truth.

Ingest and Build Pipeline

  1. Download/extract EVOBC image corpus (use extracted `Data-EN` root).
  2. Extract rows from JiaGuWen DB and group by modern character.
  3. Select subset or full JiaGuWen source (`--target-records=500` default, or `--target-records=all`).
  4. Append EVOBC bronze/seal rows for matching modern characters.
  5. Write normalized NDJSON rows to `data/raw/evolution-rows.ndjson`.
  6. Vectorize oracle JPGs into SVG in one batch.
  7. Vectorize EVOBC bronze/seal rasters into SVG in one batch.
  8. Enrich lexical fields from Unihan (`meaning`, `pinyin`, radical/strokes).
  9. Enrich origin summaries from Wiktionary (`originSummary` + citations).
  10. Build generated records to `data/evolution-records.generated.json`.

Lexical Metadata Status

  • `meaning` records populated: 758
  • `pinyin` records populated: 793
  • Source: Unihan `kDefinition` + `kMandarin`.

Historical Origin Metadata

  • `origin.summary` records populated: 483
  • Source extractor: English Wiktionary Chinese Glyph-origin section.
  • Stored with source URL, license label, and confidence score.
  • Every record also carries direct Xiaoxue + CharDB + Wiktionary lookup links.

Asset Strategy

  • Bronze, seal, and oracle glyphs are rendered from static SVG assets.
  • URLs are cache-busted during data build using file mtime token (`?v=...`) to avoid stale browser assets.
  • Modern stage uses Unicode text rendering.

Search Behavior

  • Primary matching: modern character and modern Unicode codepoint.
  • Secondary scoring: record IDs and glyph IDs.
  • Pinyin query mode: exact full-pinyin match, tone-insensitive.
  • English meaning is not used as a search key.
  • Search runs on IDs/metadata, not rendered oracle glyph strings.
  • Origin summary text is currently display-only (not used as search key).

Deep Links

  • URL state is shareable via query params: `q`, `char`, `theme`, `lang`, `variants`.
  • `variants` supports a comma list of open variant drawers, e.g. `bronze,oracle`.

Commands

npm run data:import:multistage:full -- --skip-download --evobc-image-root=/path/to/Data-EN
npm run data:import:multistage:full
npm run data:import:multistage
npm run data:import:full
npm run data:import:e2e
npm run data:enrich:origins
npm run data:build
npm run dev

Back to Viewer