Commit Graph

3 Commits

Author SHA1 Message Date
H1K0 4aead2ea20 feat: remove BAR token; bump spec to v2.3; fix max_seq_len
Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.

- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
  detokenize_to_period with position-counting logic; add write_chord_file;
  fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
  token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
  §5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
  McGill data (mean=83, max=283 tokens with BAR-free tokenizer)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:56:34 +03:00
H1K0 75fa07bf6c docs: add README, architecture, glossary, requirements; update CLAUDE.md
Add four Russian-language project documents:
- README.md: user-facing guide (install, quick start, data prep, training,
  evaluation, limitations)
- docs/architecture.md v1.0: system architecture, data flow diagrams,
  module interfaces, 7 architectural decision records, extension points
- docs/glossary.md v1.0: musical, ML, and project-specific term definitions
- docs/requirements.md v1.0: functional/non-functional requirements,
  acceptance criteria, four use-case scenarios

Update CLAUDE.md with project name etymology (hamori / ハモリ) and rename
repo root reference from chord-gen to hamori. Refine chord_format_spec.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 11:00:21 +03:00
H1K0 8672c10f78 chore: initialize project scaffold
Add .gitignore (excludes .claude/, venv, checkpoints, processed data,
external corpora), .gitattributes (LF normalization, binary markers),
full directory tree with .gitkeep placeholders, and src __init__ stubs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:28:17 +03:00