Commit Graph

5 Commits

Author SHA1 Message Date
H1K0 1a63b8e4d8 fix: raise ChordFormatError when detokenize produces empty bars
A sequence of only metadata tokens followed by EOS would silently return a
ChordPeriod with bars=[], which would later crash or produce an empty .chord
file. Now raises immediately with a descriptive message. Added a failing-then-
passing test to cover this path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 15:00:45 +03:00
H1K0 4aead2ea20 feat: remove BAR token; bump spec to v2.3; fix max_seq_len
Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.

- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
  detokenize_to_period with position-counting logic; add write_chord_file;
  fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
  token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
  §5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
  McGill data (mean=83, max=283 tokens with BAR-free tokenizer)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:56:34 +03:00
H1K0 3cd9c29d9f feat: extend time signature support to 9 metres (5/4, 7/4, 7/8, 9/8)
Add 5/4, 7/4, 7/8, 9/8 to _VALID_TIMES and VOCAB (TIME_* tokens).
Vocab size grows from 81 to 85 tokens. _parse_metre in the McGill
converter assigns subdivision=8 to 7/8 and 9/8. Spec bumped to v2.2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 00:37:05 +03:00
H1K0 4fd8ece170 refactor: replace fixed STYLE_user with open-ended style tag system
- STYLE_user renamed to STYLE_H1K0 in VOCAB (author's personal tag)
- Style field now accepts any [A-Za-z][A-Za-z0-9_]* identifier in .chord files
- Unknown styles fall back to STYLE_other at tokenization time with a log warning
- Test fixtures updated to style: other; drop closed _VALID_STYLES frozenset
- Spec bumped to v2.1: documents open style field, fallback behaviour, and §5.7
  guide on registering a new style token

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 00:29:52 +03:00
H1K0 868af4ac42 feat: add vocabulary constants and tokenize/detokenize to tokenizer.py
Adds VOCAB (81 tokens), TOKEN_TO_ID, and ID_TO_TOKEN per spec §5.2.
tokenize_period() transposes to C/Am then emits BOS + metadata tokens +
per-bar chord/HOLD/NC tokens + BAR + EOS.  detokenize_to_period() is the
exact inverse, returning a ChordPeriod in canonical key.  The m(add9)
quality maps to QUAL_m_add9 in the vocab (parentheses not valid in token
names) via _qual_token/_token_qual helpers.

36 new tests cover vocabulary integrity, token sequence structure,
and full round-trip fidelity for all four valid fixture files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 15:47:28 +03:00