hamori

Author	SHA1	Message	Date
H1K0	9e73fa5d32	feat: add --bars arg to control output length generate_period() now accepts n_bars=N to stop after exactly N complete bars. bars_completed is seeded from the prefix length so --bars counts the full output, not just the generated tail. scripts/generate.py exposes this as --bars (default: None = model decides). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 20:29:44 +03:00
H1K0	f6ce2a41d3	fix: support '.' and 'NC' in --prefix argument _encode_prefix now handles hold ('.') and no-chord ('NC') tokens alongside chord symbols, and returns (ids, n_positions) so that pos_in_bar is tracked correctly regardless of token type. Fixes ChordParseError when dots were passed in --prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 20:25:41 +03:00
H1K0	2e6e934564	data: add fine-tuning run results (lr=1e-5, 50 epochs) val loss 1.24 → 0.80, val perplexity 3.47 → 2.22. Best epoch 50 (no early stop); convergence epoch 30. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 20:17:25 +03:00
H1K0	c98a12c4e9	data: move raw_user chord files into H1K0/ style subdir	2026-05-21 19:52:17 +03:00
H1K0	c4dd2fb690	refactor: reorganize data/processed/ into mcgill/ and user/ subdirs Moved data/processed/{train,val,holdout}/ → data/processed/mcgill/{train,val,holdout}/ so both corpora have their own namespace under data/processed/. Updated PRETRAIN_DATA paths in make_colab_zip.py accordingly (path remap workaround no longer needed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 19:47:32 +03:00
H1K0	8f657ca916	scripts: add --mode finetune to make_colab_zip, add colab_finetune notebook make_colab_zip.py now accepts --mode pretrain\|finetune (default: pretrain). Finetune mode bundles scripts/train.py + data/processed/user/{train,val}/*.pt plus an optional --include-checkpoint flag for pretrained.pt. notebooks/colab_finetune.ipynb covers the full Colab fine-tuning workflow: upload zip → upload pretrained.pt → verify data → train → inspect → download. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 19:47:10 +03:00
H1K0	6bce48ddf4	chore: simplify and fix processed data gitignore rule Replace five narrow patterns (.pt, .pkl, train/, val/, holdout/) with a single data/processed/ rule that also covers data/processed/user/. All processed tensors are reproducible from committed .chord files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 19:36:41 +03:00
H1K0	fefd3b3805	data: add "doremi" chord files (intro, verse, prechorus, chorus)	2026-05-21 19:18:27 +03:00
H1K0	d8499cb841	data: add "kolybelnaya_dlya_yeli" chord files (verse, prechorus)	2026-05-21 18:41:13 +03:00
H1K0	3ef9d5cc95	data: add "happy" chord files (verse, chorus, bridge)	2026-05-21 10:59:29 +03:00
H1K0	c379d827bd	data: add "escape" chord files (intro, verse, prechorus, chorus, interlude)	2026-05-21 10:37:44 +03:00
H1K0	c7f00ea1b5	data: update user corpus — normalize CRLF, fix transcription errors CRLF → LF normalization per .gitattributes for 24 files covering celestial_sphere, clear_sky, looking_at_the_sky, mysterious_planet, neon_day_beat, pastoral, reindeer_team, sparkle, summer_rain, toki, wake_up. Fixes: - reindeer_team-chorus: H7 → B7 (German/Russian H = B natural) - toki-verse: Fmsus2 → Fsus2 (sus chords have no 3rd) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 10:16:31 +03:00
H1K0	2a3eb1783a	fix: fine-tune config and generator improvements scripts/train.py: fix max_seq_len 256→320 (must match pretrained checkpoint); increase epochs 15→50 and patience 5→10 to give the small corpus enough gradient steps; reduce warmup 20→10 (was 22% of total steps). scripts/generate.py: default to prepending the tonic chord when --prefix is not given; add --no-tonic-anchor to opt out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 10:15:48 +03:00
H1K0	5307d49a9e	data: add "toki" chord files (intro, verse, prechorus, chorus, bridge, interlude)	2026-05-20 19:49:34 +03:00
H1K0	f3f4f097b2	data: add "mysterious_planet" chord files (chorus)	2026-05-20 19:22:51 +03:00
H1K0	f38a090565	data: add "neon_day_beat" chord files (chorus)	2026-05-20 19:16:16 +03:00
H1K0	249b125c32	data: add "celestial_sphere" chord files (verse, chorus, outro)	2026-05-20 19:13:35 +03:00
H1K0	bdff33c19f	data: add "looking_at_the_sky" chord files (chorus)	2026-05-20 19:05:03 +03:00
H1K0	45650e9ab9	data: add "clear_sky" chord files (verse, chorus, bridge)	2026-05-20 18:59:25 +03:00
H1K0	e3983ca1bd	data: add "pastoral" chord files (verse, bridge)	2026-05-20 18:53:48 +03:00
H1K0	c62a05012c	data: add "summer_rain" chord files (verse, chorus)	2026-05-20 18:49:11 +03:00
H1K0	bcc03b04fd	data: add "wake_up" chord files (verse, prechorus, chorus, interlude)	2026-05-20 18:41:28 +03:00
H1K0	8ada4a92ed	data: add "sparkle" chord files (verse, chorus, interlude)	2026-05-20 18:30:42 +03:00
H1K0	e33f715d2d	data: add "reindeer_team" chord files (verse, chorus, bridge)	2026-05-20 18:22:33 +03:00
H1K0	248a6f14b7	chore: add output/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 14:33:38 +03:00
H1K0	e657d9edb5	feat: add generate module and CLI; fix tokenizer minor issues src/generate.py: autoregressive generation with top-p sampling, grammar masking (ROOT→QUAL→EXT→BASS; EOS only at bar boundary), key transposition, and optional chord prefix. Partial bars on context truncation are padded with HOLDs rather than discarded. scripts/generate.py: CLI wrapping generate_period — accepts mode, key, time, subdivision, style, function, prefix, temperature, top-p, seed, tempo; writes .chord and optional MIDI. src/tokenizer.py: fix docstring vocab size (81→84); normalize redundant BASS_<note>==root to no slash in _tokens_to_symbol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 14:28:44 +03:00
H1K0	8a73394df9	data: update pretrained checkpoint results (BAR-free tokenizer) Re-run pre-training results with the corrected 84-token vocabulary and max_seq_len=320. Previous checkpoint was trained on stale data with BAR tokens and a corrupted tokenizer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 14:28:00 +03:00
H1K0	4aead2ea20	feat: remove BAR token; bump spec to v2.3; fix max_seq_len Bar boundaries are now implicit — the detokenizer counts positions per bar using TIME × SUB, and the generator gates EOS to bar boundaries only. Removing the deterministic BAR token reduces vocab size from 85 to 84 and lets the model focus on meaningful predictions. - src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based detokenize_to_period with position-counting logic; add write_chord_file; fix _tokens_to_symbol for add9/m(add9) qualities - tests/test_tokenizer.py: update vocab-size assertions to 84, structural token test, remove bar-count test, add test_no_bar_token_in_vocab - docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2, §5.3, §5.4, §5.5, §5.6, §6.2, and changelog - CLAUDE.md: remove stale BAR reference, update vocab size to 84 - scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated McGill data (mean=83, max=283 tokens with BAR-free tokenizer) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 13:56:34 +03:00
H1K0	329952b02e	data: add pre-training results from Google Colab run Includes log CSV (50 epochs), loss-curve plot, and report. Training ran on Colab GPU (T4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 13:10:34 +03:00
H1K0	89770dd009	feat: add Colab bundle script and pre-training notebook scripts/make_colab_zip.py packages src/, scripts/pretrain.py, requirements.txt, and processed .pt files into hamori_colab.zip, remapping data/processed/{train,val}/ -> data/processed/mcgill/{train,val}/ so pretrain.py finds the data without modification. notebooks/colab_pretrain.ipynb guides through upload, extraction, dependency install, training run, report display, and results download. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 13:00:03 +03:00
H1K0	0682ccc140	docs: actualize README, architecture, requirements (v1.1) README: - processed/ tree now shows mcgill/ and user/ subdirs - --style user -> --style H1K0 in quick-start prefix example - pretrained.report.txt and finetuned.report.txt added to artifact tables architecture.md (-> v1.1): - remove stale music21 fallback mention from chord_parser section - fix ChordDataset: on-demand loading, not eager; remove non-existent make_dataloader from public interface - fix train function name: train_model -> train - update logging description: report goes to .report.txt, not stdout - note that scripts use max_seq_len=256 (sequences top out at 195 tokens) requirements.md (-> v1.1): - FT-12: update from unified script to pretrain.py + train.py pair Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 12:46:09 +03:00
H1K0	03b464973a	feat: write training report to file instead of stdout pretrain.py -> checkpoints/pretrained.report.txt train.py -> checkpoints/finetuned.report.txt Single-line [report] saved -> <path> printed to stdout instead. Also fix arrow character incompatible with Windows cp1251 console. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 12:40:44 +03:00
H1K0	632407ebef	refactor: split training scripts into pretrain.py and train.py - scripts/run_pretrain.py -> scripts/pretrain.py: pre-trains on McGill corpus (data/processed/mcgill/), saves checkpoints/pretrained.pt. - scripts/train.py: rewritten as high-level fine-tune wrapper; loads pretrained.pt, trains on data/processed/user/, saves finetuned.pt. Both scripts include timing estimate, loss-curve plot, per-epoch report, and --skip-training flag. - README: updated section 7 to reflect new script names and separate data directories. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 12:35:23 +03:00
H1K0	65c3f6bf7c	data: add 2-epoch smoke pre-training log (_smoke_pretrain.log.csv) Sanity run: McGill corpus, max_seq_len=256, batch_size=32, lr=3e-4, seed=42. Epoch 1: train=1.2603 val=0.6403 ppl=1.90 Epoch 2: train=0.5979 val=0.5809 ppl=1.80 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 12:16:28 +03:00
H1K0	dd4f21f17f	feat: add run_pretrain.py; fix output-path naming and max_seq_len - scripts/run_pretrain.py: single-command pre-training runner with timing estimate, loss-curve plot (matplotlib), and per-epoch report. Sets max_seq_len=256 (McGill sequences max out at 195 tokens, ~4x faster attention than the 512 default). - src/train.py: normalise --output so pretrained.pt and pretrained both produce pretrained.pt + pretrained.log.csv (not pretrained.pt.log.csv). Serialize Path fields as strings in checkpoint to satisfy weights_only. - requirements.txt: drop unused pandas/music21, add mido (pretty_midi dep). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 12:13:38 +03:00
H1K0	733e1fde1f	feat: implement training loop and CLI (src/train.py, scripts/train.py) AdamW + cosine-with-warmup schedule, PAD-ignoring cross-entropy, per-epoch CSV logging, best-val-loss checkpointing, early stopping (patience=5). Same script handles both pre-training and fine-tuning via --init-from. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 11:15:39 +03:00
H1K0	10229be042	feat: implement ChordTransformer (pre-norm decoder-only transformer) Adds src/model.py with a weight-tied autoregressive transformer and tests/test_model.py with shape, weight-tying, and causal-masking checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 11:09:11 +03:00
H1K0	0712eec578	data: add "la_veille_de_noel" chord files (intro, verse1, chorus1, verse2, chorus2, outro)	2026-05-20 11:01:06 +03:00
H1K0	555205b7d2	docs: actualize vocab size (81→85), spec version (2.0→2.2), style tag (user→H1K0) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 03:23:55 +03:00
H1K0	0a1dcc0ec2	data: use Japanese titles in chord file headers	2026-05-20 03:07:59 +03:00
H1K0	7c9b6c3749	data: add "golos" chord files (verse, prechorus, chorus, bridge, interlude)	2026-05-20 03:02:28 +03:00
H1K0	f910b066bb	data: extend some chord files	2026-05-20 02:33:02 +03:00
H1K0	31bc332c5c	data: add "irozuku_sekai" chord files (intro, verse, chorus, bridge)	2026-05-20 02:26:11 +03:00
H1K0	895f1df54f	data: add "ciel_dhiver" chord files (intro, verse, chorus, bridge)	2026-05-20 02:10:52 +03:00
H1K0	eee5e97194	data: add first hand-transcribed user corpus chord files Three songs (electricity, hikari_no_shizuku, okazalos) covering verse/chorus/bridge sections, all tagged style: H1K0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 01:46:47 +03:00
H1K0	f0352015cf	docs: simplify §8 filename convention to snake_case_title-function.chord Replace the YYYY_NNN_kebab-case scheme with title_in_snake_case-function.chord. Snake_case makes the title double-click-selectable; dash unambiguously separates the title from the optional function suffix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 00:51:22 +03:00
H1K0	3cd9c29d9f	feat: extend time signature support to 9 metres (5/4, 7/4, 7/8, 9/8) Add 5/4, 7/4, 7/8, 9/8 to _VALID_TIMES and VOCAB (TIME_* tokens). Vocab size grows from 81 to 85 tokens. _parse_metre in the McGill converter assigns subdivision=8 to 7/8 and 9/8. Spec bumped to v2.2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 00:37:05 +03:00
H1K0	4fd8ece170	refactor: replace fixed STYLE_user with open-ended style tag system - STYLE_user renamed to STYLE_H1K0 in VOCAB (author's personal tag) - Style field now accepts any [A-Za-z][A-Za-z0-9_]* identifier in .chord files - Unknown styles fall back to STYLE_other at tokenization time with a log warning - Test fixtures updated to style: other; drop closed _VALID_STYLES frozenset - Spec bumped to v2.1: documents open style field, fallback behaviour, and §5.7 guide on registering a new style token Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 00:29:52 +03:00
H1K0	84ba7b4743	feat: add dataset, prepare_data pipeline and fix McGill converter - src/dataset.py: ChordDataset wrapping .pt files with pad/truncate - scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout split, logs token length stats and style/function distributions - src/external_converters/mcgill_to_chord.py: rewrite parser for real McGill v2 format (2-column annotation, each bar in its own pipe group, interval bass notation e.g. /5 and /b3) - .gitignore: exclude data/processed/train, val, holdout subdirectories - tests: 37 new tests for ChordDataset and converter (260 total, all pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 18:09:46 +03:00
H1K0	ea32bf43b2	feat: implement McGill Billboard converter (Harte → .chord) Adds src/external_converters/mcgill_to_chord.py with two public functions: - convert_song(song_dir, output_dir) — converts one salami_chords.txt to per-section .chord files (4–16 bars each, style=other) - convert_dataset(dataset_dir, output_dir) — batch converts all songs Key decisions: - Harte qualities mapped to our 18-quality vocabulary; hdim7 → m7b5, parenthetical alterations (e.g. 7(b9)) handled via regex - Bar duration estimated from median non-trivial chord duration - Mode (major/minor) inferred from tonic chord quality distribution - Sections with <4 or >16 bars are skipped with a logged reason - Unrecognized Harte chords skip the whole section (no silent corruption) 48 new tests in tests/test_mcgill_converter.py; total suite 223 passed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 17:04:02 +03:00

1 2

60 Commits