Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.
- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
detokenize_to_period with position-counting logic; add write_chord_file;
fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
§5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
McGill data (mean=83, max=283 tokens with BAR-free tokenizer)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pretrain.py -> checkpoints/pretrained.report.txt
train.py -> checkpoints/finetuned.report.txt
Single-line [report] saved -> <path> printed to stdout instead.
Also fix arrow character incompatible with Windows cp1251 console.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/run_pretrain.py -> scripts/pretrain.py: pre-trains on McGill
corpus (data/processed/mcgill/), saves checkpoints/pretrained.pt.
- scripts/train.py: rewritten as high-level fine-tune wrapper; loads
pretrained.pt, trains on data/processed/user/, saves finetuned.pt.
Both scripts include timing estimate, loss-curve plot, per-epoch report,
and --skip-training flag.
- README: updated section 7 to reflect new script names and separate
data directories.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>