feat: remove BAR token; bump spec to v2.3; fix max_seq_len
Bar boundaries are now implicit — the detokenizer counts positions per bar using TIME × SUB, and the generator gates EOS to bar boundaries only. Removing the deterministic BAR token reduces vocab size from 85 to 84 and lets the model focus on meaningful predictions. - src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based detokenize_to_period with position-counting logic; add write_chord_file; fix _tokens_to_symbol for add9/m(add9) qualities - tests/test_tokenizer.py: update vocab-size assertions to 84, structural token test, remove bar-count test, add test_no_bar_token_in_vocab - docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2, §5.3, §5.4, §5.5, §5.6, §6.2, and changelog - CLAUDE.md: remove stale BAR reference, update vocab size to 84 - scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated McGill data (mean=83, max=283 tokens with BAR-free tokenizer) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
+3
-3
@@ -56,9 +56,9 @@ TRAIN_CFG = TrainConfig(
|
||||
warmup_steps=200,
|
||||
seed=42,
|
||||
device="auto",
|
||||
# Real McGill sequences are ≤ 195 tokens (p95 = 146, mean = 92).
|
||||
# Using 256 instead of the 512 default cuts attention cost ~4x.
|
||||
max_seq_len=256,
|
||||
# Regenerated McGill sequences: mean=83, max=283 (BAR-free tokenizer).
|
||||
# 320 covers the full distribution with headroom; still ~2.5x cheaper than 512.
|
||||
max_seq_len=320,
|
||||
)
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user