Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.
- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
detokenize_to_period with position-counting logic; add write_chord_file;
fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
§5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
McGill data (mean=83, max=283 tokens with BAR-free tokenizer)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
README:
- processed/ tree now shows mcgill/ and user/ subdirs
- --style user -> --style H1K0 in quick-start prefix example
- pretrained.report.txt and finetuned.report.txt added to artifact tables
architecture.md (-> v1.1):
- remove stale music21 fallback mention from chord_parser section
- fix ChordDataset: on-demand loading, not eager; remove non-existent
make_dataloader from public interface
- fix train function name: train_model -> train
- update logging description: report goes to .report.txt, not stdout
- note that scripts use max_seq_len=256 (sequences top out at 195 tokens)
requirements.md (-> v1.1):
- FT-12: update from unified script to pretrain.py + train.py pair
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the YYYY_NNN_kebab-case scheme with title_in_snake_case-function.chord.
Snake_case makes the title double-click-selectable; dash unambiguously
separates the title from the optional function suffix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add 5/4, 7/4, 7/8, 9/8 to _VALID_TIMES and VOCAB (TIME_* tokens).
Vocab size grows from 81 to 85 tokens. _parse_metre in the McGill
converter assigns subdivision=8 to 7/8 and 9/8. Spec bumped to v2.2.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- STYLE_user renamed to STYLE_H1K0 in VOCAB (author's personal tag)
- Style field now accepts any [A-Za-z][A-Za-z0-9_]* identifier in .chord files
- Unknown styles fall back to STYLE_other at tokenization time with a log warning
- Test fixtures updated to style: other; drop closed _VALID_STYLES frozenset
- Spec bumped to v2.1: documents open style field, fallback behaviour, and §5.7
guide on registering a new style token
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Six factual corrections: slash chord definition (inversions vs on-chords),
logits described as unnormalized, #11 = eleventh not fourth, duplicate
Polukadentsiya sentence removed, pre-training LR corrected to 1e-4,
unverified PAD index claim removed. All ML term explanations rewritten
with analogies accessible to a near-beginner.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Revise and expand the authoritative .chord format specification:
version bump 1.0 → 2.0, clarified period scope, updated tokenization
rules, grammar tables, and key normalization details.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Authoritative .chord file format spec covering header fields, body
syntax, chord symbol grammar, tokenization rules, and key normalization.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>