feat: add dataset, prepare_data pipeline and fix McGill converter
- src/dataset.py: ChordDataset wrapping .pt files with pad/truncate - scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout split, logs token length stats and style/function distributions - src/external_converters/mcgill_to_chord.py: rewrite parser for real McGill v2 format (2-column annotation, each bar in its own pipe group, interval bass notation e.g. /5 and /b3) - .gitignore: exclude data/processed/train, val, holdout subdirectories - tests: 37 new tests for ChordDataset and converter (260 total, all pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -35,6 +35,9 @@ checkpoints/*.ckpt
|
||||
# Processed data (reproducible from source)
|
||||
data/processed/*.pt
|
||||
data/processed/*.pkl
|
||||
data/processed/train/
|
||||
data/processed/val/
|
||||
data/processed/holdout/
|
||||
|
||||
# External corpora (download separately; too large for git)
|
||||
data/raw_external/
|
||||
|
||||
Reference in New Issue
Block a user