Commit Graph

74 Commits

Author SHA1 Message Date
H1K0 dc037b0895 fix: clone grammar bias per step in generate_period
_grammar_bias returned a shared module-level singleton that the loop
mutated in place (EOS block + repetition penalty). The penalty thus
accumulated across positions within a call and persisted across calls,
collapsing output to HOLD/NC until process restart. Clone the bias each
step so edits stay local. Add regression tests guarding the invariant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 21:14:04 +03:00
H1K0 d1af7bceb8 feat(app): show generation progress and rename app heading
- Disable the run button and show a "Генерация…" status while a request is
  processing, then re-enable it. Repeated generations previously gave no
  visible feedback once the model was cached, so it was unclear whether a
  request was running.
- Append elapsed time to the final status so identical re-runs differ visibly.
- Rename the H1 heading and browser/tab title to
  "генератор аккордовых последовательностей".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 21:00:29 +03:00
H1K0 c56397df54 docs: add Russian project report notebook (notebooks/report.ipynb)
Runnable end-to-end report combining narrative, code, and inline figures:
data and .chord format, transformer working principle, two-stage training
curves, perplexity (3.58 -> 2.15), distribution-shift plot with a reading
legend, qualitative examples, and a generation demo. Written in a
first-person student voice.

- CLAUDE.md: report is now a Jupyter notebook; GOST formatting dropped
- requirements.txt: add nbconvert + ipykernel (optional, for the notebook)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 17:07:00 +03:00
H1K0 c147c47acb feat: add minimal Gradio web UI (app.py)
Single-page form wrapping src.generate.generate_period: pick model, mode,
key, style, function, time, sampling params and optional prefix; returns
the chord grid plus downloadable .chord and .mid files. Russian usage
instructions are embedded on the same page.

Auto-length output is capped at 16 bars (the period maximum) so a model
that never emits EOS can't run away into dozens of NC/hold bars.

Added per the author's explicit request — web UI was previously out of
scope; updated CLAUDE.md and README accordingly. Choices for style/
function/time are derived from VOCAB so the form can't drift from the
tokenizer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 15:38:01 +03:00
H1K0 f00a6c1b3a feat: add bigram repetition penalty to generate_period
Tracks ROOT-level bigrams (prev_root → curr_root) across chord-change events.
At each FREE position, subtracts penalty * count(prev→root) from ROOT logits,
capped at 3.0 to prevent NC/HOLD flooding at extreme values.

Practical range: 0.5 (mild, breaks loops after 2 occurrences) to 1.0
(aggressive). Default 0.0 keeps backward compatibility.

Added --repetition-penalty flag to scripts/generate.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 15:19:24 +03:00
H1K0 1a63b8e4d8 fix: raise ChordFormatError when detokenize produces empty bars
A sequence of only metadata tokens followed by EOS would silently return a
ChordPeriod with bars=[], which would later crash or produce an empty .chord
file. Now raises immediately with a descriptive message. Added a failing-then-
passing test to cover this path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 15:00:45 +03:00
H1K0 d09a08d553 feat: add src/evaluate.py and scripts/evaluate.py
Implements perplexity computation, chord distribution extraction (qualities,
extensions, inversions, root-motion intervals), 4-panel comparison plot, and
paired qualitative example generation for pretrained vs finetuned model.

Results on user val set: pretrained PPL 3.58 → finetuned PPL 2.15 (−40 %).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 14:57:49 +03:00
H1K0 b30f4c188b chore: track checkpoints via Git LFS
Removed checkpoints/*.pt from .gitignore; files are now stored as LFS
objects (pretrained.pt 17 MB, finetuned.pt 17 MB). LFS attributes were
already in place in .gitattributes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 14:37:24 +03:00
H1K0 3b87fb9e33 docs: add McGill Billboard Project hyperlink
Added the official DDMAL link at the first prose mention in README.md,
docs/architecture.md, and docs/glossary.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 21:38:37 +03:00
H1K0 555023532f scripts: update fine-tune defaults to lr=3e-5, epochs=30
Matches the configuration that produced finetuned.pt (val ppl 2.15,
best epoch 20, early stopped at 30).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 21:18:37 +03:00
H1K0 8a60a8ada9 docs: replace sampling_guide with full generation_cheatsheet
Renamed and expanded to cover the full generation workflow:
- §1 quick start examples
- §2 complete CLI reference table (all 16 parameters)
- §3 corpus coverage by key and function (from actual training data)
- §4 prefix design patterns with templates and recommendations
- §5 temperature/top-p (content from sampling_guide, condensed)
- §6 technical implementation details

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 21:10:12 +03:00
H1K0 4771498860 docs: add sampling_guide.md — temperature and top-p cheatsheet
Explains temperature and nucleus sampling mechanics, their interaction,
and provides a task-based cheatsheet with symptom/remedy table and
example CLI invocations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 20:52:39 +03:00
H1K0 d9585ec008 data: add fine-tuning run results (lr=3e-5, 30 epochs)
val loss 1.19 → 0.77, val perplexity 3.29 → 2.15.
Best epoch 20, early stop at epoch 30 (patience=10).
Improvement over previous lr=1e-5 run (best val ppl 2.22).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 20:52:39 +03:00
H1K0 7c0d147956 fix: --bars now suppresses early EOS until target bar count is reached
Previously the model could emit EOS before reaching n_bars because the
EOS-suppression was only applied via the n_bars break, not the grammar
bias. Fixed by masking EOS to -inf in the logit bias while
bars_completed < n_bars.

Added _EosHungryModel fixture and test_generate_bars_overrides_early_eos
to catch this regression class.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 20:34:42 +03:00
H1K0 9e73fa5d32 feat: add --bars arg to control output length
generate_period() now accepts n_bars=N to stop after exactly N complete
bars. bars_completed is seeded from the prefix length so --bars counts
the full output, not just the generated tail.

scripts/generate.py exposes this as --bars (default: None = model decides).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 20:29:44 +03:00
H1K0 f6ce2a41d3 fix: support '.' and 'NC' in --prefix argument
_encode_prefix now handles hold ('.') and no-chord ('NC') tokens
alongside chord symbols, and returns (ids, n_positions) so that
pos_in_bar is tracked correctly regardless of token type.

Fixes ChordParseError when dots were passed in --prefix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 20:25:41 +03:00
H1K0 2e6e934564 data: add fine-tuning run results (lr=1e-5, 50 epochs)
val loss 1.24 → 0.80, val perplexity 3.47 → 2.22.
Best epoch 50 (no early stop); convergence epoch 30.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 20:17:25 +03:00
H1K0 c98a12c4e9 data: move raw_user chord files into H1K0/ style subdir 2026-05-21 19:52:17 +03:00
H1K0 c4dd2fb690 refactor: reorganize data/processed/ into mcgill/ and user/ subdirs
Moved data/processed/{train,val,holdout}/ → data/processed/mcgill/{train,val,holdout}/
so both corpora have their own namespace under data/processed/.
Updated PRETRAIN_DATA paths in make_colab_zip.py accordingly
(path remap workaround no longer needed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 19:47:32 +03:00
H1K0 8f657ca916 scripts: add --mode finetune to make_colab_zip, add colab_finetune notebook
make_colab_zip.py now accepts --mode pretrain|finetune (default: pretrain).
Finetune mode bundles scripts/train.py + data/processed/user/{train,val}/*.pt
plus an optional --include-checkpoint flag for pretrained.pt.

notebooks/colab_finetune.ipynb covers the full Colab fine-tuning workflow:
upload zip → upload pretrained.pt → verify data → train → inspect → download.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 19:47:10 +03:00
H1K0 6bce48ddf4 chore: simplify and fix processed data gitignore rule
Replace five narrow patterns (*.pt, *.pkl, train/, val/, holdout/) with
a single data/processed/ rule that also covers data/processed/user/.
All processed tensors are reproducible from committed .chord files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 19:36:41 +03:00
H1K0 fefd3b3805 data: add "doremi" chord files (intro, verse, prechorus, chorus) 2026-05-21 19:18:27 +03:00
H1K0 d8499cb841 data: add "kolybelnaya_dlya_yeli" chord files (verse, prechorus) 2026-05-21 18:41:13 +03:00
H1K0 3ef9d5cc95 data: add "happy" chord files (verse, chorus, bridge) 2026-05-21 10:59:29 +03:00
H1K0 c379d827bd data: add "escape" chord files (intro, verse, prechorus, chorus, interlude) 2026-05-21 10:37:44 +03:00
H1K0 c7f00ea1b5 data: update user corpus — normalize CRLF, fix transcription errors
CRLF → LF normalization per .gitattributes for 24 files covering
celestial_sphere, clear_sky, looking_at_the_sky, mysterious_planet,
neon_day_beat, pastoral, reindeer_team, sparkle, summer_rain, toki,
wake_up.

Fixes:
  - reindeer_team-chorus: H7 → B7 (German/Russian H = B natural)
  - toki-verse: Fmsus2 → Fsus2 (sus chords have no 3rd)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 10:16:31 +03:00
H1K0 2a3eb1783a fix: fine-tune config and generator improvements
scripts/train.py: fix max_seq_len 256→320 (must match pretrained checkpoint);
increase epochs 15→50 and patience 5→10 to give the small corpus enough
gradient steps; reduce warmup 20→10 (was 22% of total steps).

scripts/generate.py: default to prepending the tonic chord when --prefix is
not given; add --no-tonic-anchor to opt out.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 10:15:48 +03:00
H1K0 5307d49a9e data: add "toki" chord files (intro, verse, prechorus, chorus, bridge, interlude) 2026-05-20 19:49:34 +03:00
H1K0 f3f4f097b2 data: add "mysterious_planet" chord files (chorus) 2026-05-20 19:22:51 +03:00
H1K0 f38a090565 data: add "neon_day_beat" chord files (chorus) 2026-05-20 19:16:16 +03:00
H1K0 249b125c32 data: add "celestial_sphere" chord files (verse, chorus, outro) 2026-05-20 19:13:35 +03:00
H1K0 bdff33c19f data: add "looking_at_the_sky" chord files (chorus) 2026-05-20 19:05:03 +03:00
H1K0 45650e9ab9 data: add "clear_sky" chord files (verse, chorus, bridge) 2026-05-20 18:59:25 +03:00
H1K0 e3983ca1bd data: add "pastoral" chord files (verse, bridge) 2026-05-20 18:53:48 +03:00
H1K0 c62a05012c data: add "summer_rain" chord files (verse, chorus) 2026-05-20 18:49:11 +03:00
H1K0 bcc03b04fd data: add "wake_up" chord files (verse, prechorus, chorus, interlude) 2026-05-20 18:41:28 +03:00
H1K0 8ada4a92ed data: add "sparkle" chord files (verse, chorus, interlude) 2026-05-20 18:30:42 +03:00
H1K0 e33f715d2d data: add "reindeer_team" chord files (verse, chorus, bridge) 2026-05-20 18:22:33 +03:00
H1K0 248a6f14b7 chore: add output/ to .gitignore
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:33:38 +03:00
H1K0 e657d9edb5 feat: add generate module and CLI; fix tokenizer minor issues
src/generate.py: autoregressive generation with top-p sampling, grammar
masking (ROOT→QUAL→EXT→BASS; EOS only at bar boundary), key transposition,
and optional chord prefix.  Partial bars on context truncation are padded
with HOLDs rather than discarded.

scripts/generate.py: CLI wrapping generate_period — accepts mode, key,
time, subdivision, style, function, prefix, temperature, top-p, seed,
tempo; writes .chord and optional MIDI.

src/tokenizer.py: fix docstring vocab size (81→84); normalize redundant
BASS_<note>==root to no slash in _tokens_to_symbol.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:28:44 +03:00
H1K0 8a73394df9 data: update pretrained checkpoint results (BAR-free tokenizer)
Re-run pre-training results with the corrected 84-token vocabulary and
max_seq_len=320.  Previous checkpoint was trained on stale data with BAR
tokens and a corrupted tokenizer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:28:00 +03:00
H1K0 4aead2ea20 feat: remove BAR token; bump spec to v2.3; fix max_seq_len
Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.

- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
  detokenize_to_period with position-counting logic; add write_chord_file;
  fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
  token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
  §5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
  McGill data (mean=83, max=283 tokens with BAR-free tokenizer)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:56:34 +03:00
H1K0 329952b02e data: add pre-training results from Google Colab run
Includes log CSV (50 epochs), loss-curve plot, and report.
Training ran on Colab GPU (T4).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:10:34 +03:00
H1K0 89770dd009 feat: add Colab bundle script and pre-training notebook
scripts/make_colab_zip.py packages src/, scripts/pretrain.py,
requirements.txt, and processed .pt files into hamori_colab.zip,
remapping data/processed/{train,val}/ -> data/processed/mcgill/{train,val}/
so pretrain.py finds the data without modification.

notebooks/colab_pretrain.ipynb guides through upload, extraction,
dependency install, training run, report display, and results download.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:00:03 +03:00
H1K0 0682ccc140 docs: actualize README, architecture, requirements (v1.1)
README:
- processed/ tree now shows mcgill/ and user/ subdirs
- --style user -> --style H1K0 in quick-start prefix example
- pretrained.report.txt and finetuned.report.txt added to artifact tables

architecture.md (-> v1.1):
- remove stale music21 fallback mention from chord_parser section
- fix ChordDataset: on-demand loading, not eager; remove non-existent
  make_dataloader from public interface
- fix train function name: train_model -> train
- update logging description: report goes to .report.txt, not stdout
- note that scripts use max_seq_len=256 (sequences top out at 195 tokens)

requirements.md (-> v1.1):
- FT-12: update from unified script to pretrain.py + train.py pair

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:46:09 +03:00
H1K0 03b464973a feat: write training report to file instead of stdout
pretrain.py -> checkpoints/pretrained.report.txt
train.py    -> checkpoints/finetuned.report.txt

Single-line [report] saved -> <path> printed to stdout instead.
Also fix arrow character incompatible with Windows cp1251 console.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:40:44 +03:00
H1K0 632407ebef refactor: split training scripts into pretrain.py and train.py
- scripts/run_pretrain.py -> scripts/pretrain.py: pre-trains on McGill
  corpus (data/processed/mcgill/), saves checkpoints/pretrained.pt.
- scripts/train.py: rewritten as high-level fine-tune wrapper; loads
  pretrained.pt, trains on data/processed/user/, saves finetuned.pt.
  Both scripts include timing estimate, loss-curve plot, per-epoch report,
  and --skip-training flag.
- README: updated section 7 to reflect new script names and separate
  data directories.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:35:23 +03:00
H1K0 65c3f6bf7c data: add 2-epoch smoke pre-training log (_smoke_pretrain.log.csv)
Sanity run: McGill corpus, max_seq_len=256, batch_size=32, lr=3e-4, seed=42.
Epoch 1: train=1.2603 val=0.6403 ppl=1.90
Epoch 2: train=0.5979 val=0.5809 ppl=1.80

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:16:28 +03:00
H1K0 dd4f21f17f feat: add run_pretrain.py; fix output-path naming and max_seq_len
- scripts/run_pretrain.py: single-command pre-training runner with
  timing estimate, loss-curve plot (matplotlib), and per-epoch report.
  Sets max_seq_len=256 (McGill sequences max out at 195 tokens, ~4x
  faster attention than the 512 default).
- src/train.py: normalise --output so pretrained.pt and pretrained both
  produce pretrained.pt + pretrained.log.csv (not pretrained.pt.log.csv).
  Serialize Path fields as strings in checkpoint to satisfy weights_only.
- requirements.txt: drop unused pandas/music21, add mido (pretty_midi dep).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:13:38 +03:00
H1K0 733e1fde1f feat: implement training loop and CLI (src/train.py, scripts/train.py)
AdamW + cosine-with-warmup schedule, PAD-ignoring cross-entropy, per-epoch
CSV logging, best-val-loss checkpointing, early stopping (patience=5).
Same script handles both pre-training and fine-tuning via --init-from.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 11:15:39 +03:00