generate_period() now accepts n_bars=N to stop after exactly N complete
bars. bars_completed is seeded from the prefix length so --bars counts
the full output, not just the generated tail.
scripts/generate.py exposes this as --bars (default: None = model decides).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_encode_prefix now handles hold ('.') and no-chord ('NC') tokens
alongside chord symbols, and returns (ids, n_positions) so that
pos_in_bar is tracked correctly regardless of token type.
Fixes ChordParseError when dots were passed in --prefix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
val loss 1.24 → 0.80, val perplexity 3.47 → 2.22.
Best epoch 50 (no early stop); convergence epoch 30.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Moved data/processed/{train,val,holdout}/ → data/processed/mcgill/{train,val,holdout}/
so both corpora have their own namespace under data/processed/.
Updated PRETRAIN_DATA paths in make_colab_zip.py accordingly
(path remap workaround no longer needed).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
make_colab_zip.py now accepts --mode pretrain|finetune (default: pretrain).
Finetune mode bundles scripts/train.py + data/processed/user/{train,val}/*.pt
plus an optional --include-checkpoint flag for pretrained.pt.
notebooks/colab_finetune.ipynb covers the full Colab fine-tuning workflow:
upload zip → upload pretrained.pt → verify data → train → inspect → download.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace five narrow patterns (*.pt, *.pkl, train/, val/, holdout/) with
a single data/processed/ rule that also covers data/processed/user/.
All processed tensors are reproducible from committed .chord files.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scripts/train.py: fix max_seq_len 256→320 (must match pretrained checkpoint);
increase epochs 15→50 and patience 5→10 to give the small corpus enough
gradient steps; reduce warmup 20→10 (was 22% of total steps).
scripts/generate.py: default to prepending the tonic chord when --prefix is
not given; add --no-tonic-anchor to opt out.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
src/generate.py: autoregressive generation with top-p sampling, grammar
masking (ROOT→QUAL→EXT→BASS; EOS only at bar boundary), key transposition,
and optional chord prefix. Partial bars on context truncation are padded
with HOLDs rather than discarded.
scripts/generate.py: CLI wrapping generate_period — accepts mode, key,
time, subdivision, style, function, prefix, temperature, top-p, seed,
tempo; writes .chord and optional MIDI.
src/tokenizer.py: fix docstring vocab size (81→84); normalize redundant
BASS_<note>==root to no slash in _tokens_to_symbol.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-run pre-training results with the corrected 84-token vocabulary and
max_seq_len=320. Previous checkpoint was trained on stale data with BAR
tokens and a corrupted tokenizer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.
- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
detokenize_to_period with position-counting logic; add write_chord_file;
fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
§5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
McGill data (mean=83, max=283 tokens with BAR-free tokenizer)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scripts/make_colab_zip.py packages src/, scripts/pretrain.py,
requirements.txt, and processed .pt files into hamori_colab.zip,
remapping data/processed/{train,val}/ -> data/processed/mcgill/{train,val}/
so pretrain.py finds the data without modification.
notebooks/colab_pretrain.ipynb guides through upload, extraction,
dependency install, training run, report display, and results download.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
README:
- processed/ tree now shows mcgill/ and user/ subdirs
- --style user -> --style H1K0 in quick-start prefix example
- pretrained.report.txt and finetuned.report.txt added to artifact tables
architecture.md (-> v1.1):
- remove stale music21 fallback mention from chord_parser section
- fix ChordDataset: on-demand loading, not eager; remove non-existent
make_dataloader from public interface
- fix train function name: train_model -> train
- update logging description: report goes to .report.txt, not stdout
- note that scripts use max_seq_len=256 (sequences top out at 195 tokens)
requirements.md (-> v1.1):
- FT-12: update from unified script to pretrain.py + train.py pair
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pretrain.py -> checkpoints/pretrained.report.txt
train.py -> checkpoints/finetuned.report.txt
Single-line [report] saved -> <path> printed to stdout instead.
Also fix arrow character incompatible with Windows cp1251 console.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/run_pretrain.py -> scripts/pretrain.py: pre-trains on McGill
corpus (data/processed/mcgill/), saves checkpoints/pretrained.pt.
- scripts/train.py: rewritten as high-level fine-tune wrapper; loads
pretrained.pt, trains on data/processed/user/, saves finetuned.pt.
Both scripts include timing estimate, loss-curve plot, per-epoch report,
and --skip-training flag.
- README: updated section 7 to reflect new script names and separate
data directories.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/run_pretrain.py: single-command pre-training runner with
timing estimate, loss-curve plot (matplotlib), and per-epoch report.
Sets max_seq_len=256 (McGill sequences max out at 195 tokens, ~4x
faster attention than the 512 default).
- src/train.py: normalise --output so pretrained.pt and pretrained both
produce pretrained.pt + pretrained.log.csv (not pretrained.pt.log.csv).
Serialize Path fields as strings in checkpoint to satisfy weights_only.
- requirements.txt: drop unused pandas/music21, add mido (pretty_midi dep).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AdamW + cosine-with-warmup schedule, PAD-ignoring cross-entropy, per-epoch
CSV logging, best-val-loss checkpointing, early stopping (patience=5).
Same script handles both pre-training and fine-tuning via --init-from.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds src/model.py with a weight-tied autoregressive transformer and
tests/test_model.py with shape, weight-tying, and causal-masking checks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the YYYY_NNN_kebab-case scheme with title_in_snake_case-function.chord.
Snake_case makes the title double-click-selectable; dash unambiguously
separates the title from the optional function suffix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add 5/4, 7/4, 7/8, 9/8 to _VALID_TIMES and VOCAB (TIME_* tokens).
Vocab size grows from 81 to 85 tokens. _parse_metre in the McGill
converter assigns subdivision=8 to 7/8 and 9/8. Spec bumped to v2.2.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- STYLE_user renamed to STYLE_H1K0 in VOCAB (author's personal tag)
- Style field now accepts any [A-Za-z][A-Za-z0-9_]* identifier in .chord files
- Unknown styles fall back to STYLE_other at tokenization time with a log warning
- Test fixtures updated to style: other; drop closed _VALID_STYLES frozenset
- Spec bumped to v2.1: documents open style field, fallback behaviour, and §5.7
guide on registering a new style token
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/dataset.py: ChordDataset wrapping .pt files with pad/truncate
- scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout
split, logs token length stats and style/function distributions
- src/external_converters/mcgill_to_chord.py: rewrite parser for real
McGill v2 format (2-column annotation, each bar in its own pipe group,
interval bass notation e.g. /5 and /b3)
- .gitignore: exclude data/processed/train, val, holdout subdirectories
- tests: 37 new tests for ChordDataset and converter (260 total, all pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds src/external_converters/mcgill_to_chord.py with two public functions:
- convert_song(song_dir, output_dir) — converts one salami_chords.txt to
per-section .chord files (4–16 bars each, style=other)
- convert_dataset(dataset_dir, output_dir) — batch converts all songs
Key decisions:
- Harte qualities mapped to our 18-quality vocabulary; hdim7 → m7b5,
parenthetical alterations (e.g. 7(b9)) handled via regex
- Bar duration estimated from median non-trivial chord duration
- Mode (major/minor) inferred from tonic chord quality distribution
- Sections with <4 or >16 bars are skipped with a logged reason
- Unrecognized Harte chords skip the whole section (no silent corruption)
48 new tests in tests/test_mcgill_converter.py; total suite 223 passed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>