scripts/make_colab_zip.py packages src/, scripts/pretrain.py,
requirements.txt, and processed .pt files into hamori_colab.zip,
remapping data/processed/{train,val}/ -> data/processed/mcgill/{train,val}/
so pretrain.py finds the data without modification.
notebooks/colab_pretrain.ipynb guides through upload, extraction,
dependency install, training run, report display, and results download.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pretrain.py -> checkpoints/pretrained.report.txt
train.py -> checkpoints/finetuned.report.txt
Single-line [report] saved -> <path> printed to stdout instead.
Also fix arrow character incompatible with Windows cp1251 console.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/run_pretrain.py -> scripts/pretrain.py: pre-trains on McGill
corpus (data/processed/mcgill/), saves checkpoints/pretrained.pt.
- scripts/train.py: rewritten as high-level fine-tune wrapper; loads
pretrained.pt, trains on data/processed/user/, saves finetuned.pt.
Both scripts include timing estimate, loss-curve plot, per-epoch report,
and --skip-training flag.
- README: updated section 7 to reflect new script names and separate
data directories.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/run_pretrain.py: single-command pre-training runner with
timing estimate, loss-curve plot (matplotlib), and per-epoch report.
Sets max_seq_len=256 (McGill sequences max out at 195 tokens, ~4x
faster attention than the 512 default).
- src/train.py: normalise --output so pretrained.pt and pretrained both
produce pretrained.pt + pretrained.log.csv (not pretrained.pt.log.csv).
Serialize Path fields as strings in checkpoint to satisfy weights_only.
- requirements.txt: drop unused pandas/music21, add mido (pretty_midi dep).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AdamW + cosine-with-warmup schedule, PAD-ignoring cross-entropy, per-epoch
CSV logging, best-val-loss checkpointing, early stopping (patience=5).
Same script handles both pre-training and fine-tuning via --init-from.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/dataset.py: ChordDataset wrapping .pt files with pad/truncate
- scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout
split, logs token length stats and style/function distributions
- src/external_converters/mcgill_to_chord.py: rewrite parser for real
McGill v2 format (2-column annotation, each bar in its own pipe group,
interval bass notation e.g. /5 and /b3)
- .gitignore: exclude data/processed/train, val, holdout subdirectories
- tests: 37 new tests for ChordDataset and converter (260 total, all pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>