Hanzi Timeline
Implementation Notes
This app shows modern, bronze, seal, and oracle forms with stable IDs and asset-backed rendering. It does not depend on historical-script Unicode coverage.
Data Sources
- Modern form: Unicode CJK character.
- Oracle forms: JiaGuWen SQLite + oracle image assets.
- Bronze + seal forms: EVOBC metadata + local EVOBC image corpus.
- Origin summary text: English Wiktionary Chinese “Glyph origin” section.
- Origin references per character: Academia Sinica Xiaoxue + Academia Sinica CharDB + Wiktionary links.
- EVOBC download source: figshare.com/s/ce2cf55b35a2f8ecc4c6
Current Coverage
- Generated runtime records: 793
- Oracle: 1602 variants across 793 records
- Bronze: 19647 variants across 527 records
- Seal: 1497 variants across 728 records
- UI variants drawers are available per stage (Bronze, Seal, Oracle Bone).
Canonical Record Model
Each record is keyed by modern character and codepoint with stage rows:
{
id,
modernChar,
modernCodepoint,
dataset,
stages: [{ stageName, glyphId, assetType, assetRef }],
variants: { bronze: [...], seal: [...], oracle: [...] },
origin: { summary, source, sourceUrl, license, confidence },
originReferences: [{ id, label, url }]
}glyphId is canonical truth. We never use PUA codepoints as database truth.
Ingest and Build Pipeline
- Download/extract EVOBC image corpus (use extracted `Data-EN` root).
- Extract rows from JiaGuWen DB and group by modern character.
- Select subset or full JiaGuWen source (`--target-records=500` default, or `--target-records=all`).
- Append EVOBC bronze/seal rows for matching modern characters.
- Write normalized NDJSON rows to `data/raw/evolution-rows.ndjson`.
- Vectorize oracle JPGs into SVG in one batch.
- Vectorize EVOBC bronze/seal rasters into SVG in one batch.
- Enrich lexical fields from Unihan (`meaning`, `pinyin`, radical/strokes).
- Enrich origin summaries from Wiktionary (`originSummary` + citations).
- Build generated records to `data/evolution-records.generated.json`.
Lexical Metadata Status
- `meaning` records populated: 758
- `pinyin` records populated: 793
- Source: Unihan `kDefinition` + `kMandarin`.
Historical Origin Metadata
- `origin.summary` records populated: 483
- Source extractor: English Wiktionary Chinese Glyph-origin section.
- Stored with source URL, license label, and confidence score.
- Every record also carries direct Xiaoxue + CharDB + Wiktionary lookup links.
Asset Strategy
- Bronze, seal, and oracle glyphs are rendered from static SVG assets.
- URLs are cache-busted during data build using file mtime token (`?v=...`) to avoid stale browser assets.
- Modern stage uses Unicode text rendering.
Search Behavior
- Primary matching: modern character and modern Unicode codepoint.
- Secondary scoring: record IDs and glyph IDs.
- Pinyin query mode: exact full-pinyin match, tone-insensitive.
- English meaning is not used as a search key.
- Search runs on IDs/metadata, not rendered oracle glyph strings.
- Origin summary text is currently display-only (not used as search key).
Deep Links
- URL state is shareable via query params: `q`, `char`, `theme`, `lang`, `variants`.
- `variants` supports a comma list of open variant drawers, e.g. `bronze,oracle`.
Commands
npm run data:import:multistage:full -- --skip-download --evobc-image-root=/path/to/Data-EN
npm run data:import:multistage:full
npm run data:import:multistage
npm run data:import:full
npm run data:import:e2e
npm run data:enrich:origins
npm run data:build
npm run dev