2 Commits

Author SHA1 Message Date
H1K0 84ba7b4743 feat: add dataset, prepare_data pipeline and fix McGill converter
- src/dataset.py: ChordDataset wrapping .pt files with pad/truncate
- scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout
  split, logs token length stats and style/function distributions
- src/external_converters/mcgill_to_chord.py: rewrite parser for real
  McGill v2 format (2-column annotation, each bar in its own pipe group,
  interval bass notation e.g. /5 and /b3)
- .gitignore: exclude data/processed/train, val, holdout subdirectories
- tests: 37 new tests for ChordDataset and converter (260 total, all pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 18:09:46 +03:00
H1K0 ea32bf43b2 feat: implement McGill Billboard converter (Harte → .chord)
Adds src/external_converters/mcgill_to_chord.py with two public functions:
  - convert_song(song_dir, output_dir) — converts one salami_chords.txt to
    per-section .chord files (4–16 bars each, style=other)
  - convert_dataset(dataset_dir, output_dir) — batch converts all songs

Key decisions:
  - Harte qualities mapped to our 18-quality vocabulary; hdim7 → m7b5,
    parenthetical alterations (e.g. 7(b9)) handled via regex
  - Bar duration estimated from median non-trivial chord duration
  - Mode (major/minor) inferred from tonic chord quality distribution
  - Sections with <4 or >16 bars are skipped with a logged reason
  - Unrecognized Harte chords skip the whole section (no silent corruption)

48 new tests in tests/test_mcgill_converter.py; total suite 223 passed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 17:04:02 +03:00