feat: add dataset, prepare_data pipeline and fix McGill converter

- src/dataset.py: ChordDataset wrapping .pt files with pad/truncate
- scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout
  split, logs token length stats and style/function distributions
- src/external_converters/mcgill_to_chord.py: rewrite parser for real
  McGill v2 format (2-column annotation, each bar in its own pipe group,
  interval bass notation e.g. /5 and /b3)
- .gitignore: exclude data/processed/train, val, holdout subdirectories
- tests: 37 new tests for ChordDataset and converter (260 total, all pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-19 18:09:46 +03:00
parent ea32bf43b2
commit 84ba7b4743
7 changed files with 876 additions and 314 deletions
+4 -10
View File
@@ -3,13 +3,7 @@
# metre: 4/4
# tonic: C
0.000000 Z
4.000000 A,verse C:maj
8.000000 . F:maj
12.000000 . G:7
16.000000 . C:maj
20.000000 B,chorus F:maj
24.000000 . C:maj
28.000000 . G:7
32.000000 . C:maj
36.000000 Z
0.000000 silence
4.000000 A, verse, | C:maj | F:maj | G:7 | C:maj |
20.000000 B, chorus, | F:maj | C:maj | G:7 | C:maj |
36.000000 silence