Add 5/4, 7/4, 7/8, 9/8 to _VALID_TIMES and VOCAB (TIME_* tokens).
Vocab size grows from 81 to 85 tokens. _parse_metre in the McGill
converter assigns subdivision=8 to 7/8 and 9/8. Spec bumped to v2.2.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/dataset.py: ChordDataset wrapping .pt files with pad/truncate
- scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout
split, logs token length stats and style/function distributions
- src/external_converters/mcgill_to_chord.py: rewrite parser for real
McGill v2 format (2-column annotation, each bar in its own pipe group,
interval bass notation e.g. /5 and /b3)
- .gitignore: exclude data/processed/train, val, holdout subdirectories
- tests: 37 new tests for ChordDataset and converter (260 total, all pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds src/external_converters/mcgill_to_chord.py with two public functions:
- convert_song(song_dir, output_dir) — converts one salami_chords.txt to
per-section .chord files (4–16 bars each, style=other)
- convert_dataset(dataset_dir, output_dir) — batch converts all songs
Key decisions:
- Harte qualities mapped to our 18-quality vocabulary; hdim7 → m7b5,
parenthetical alterations (e.g. 7(b9)) handled via regex
- Bar duration estimated from median non-trivial chord duration
- Mode (major/minor) inferred from tonic chord quality distribution
- Sections with <4 or >16 bars are skipped with a logged reason
- Unrecognized Harte chords skip the whole section (no silent corruption)
48 new tests in tests/test_mcgill_converter.py; total suite 223 passed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>