hamori

Author	SHA1	Message	Date
H1K0	dc037b0895	fix: clone grammar bias per step in generate_period _grammar_bias returned a shared module-level singleton that the loop mutated in place (EOS block + repetition penalty). The penalty thus accumulated across positions within a call and persisted across calls, collapsing output to HOLD/NC until process restart. Clone the bias each step so edits stay local. Add regression tests guarding the invariant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 21:14:04 +03:00
H1K0	f00a6c1b3a	feat: add bigram repetition penalty to generate_period Tracks ROOT-level bigrams (prev_root → curr_root) across chord-change events. At each FREE position, subtracts penalty * count(prev→root) from ROOT logits, capped at 3.0 to prevent NC/HOLD flooding at extreme values. Practical range: 0.5 (mild, breaks loops after 2 occurrences) to 1.0 (aggressive). Default 0.0 keeps backward compatibility. Added --repetition-penalty flag to scripts/generate.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 15:19:24 +03:00
H1K0	1a63b8e4d8	fix: raise ChordFormatError when detokenize produces empty bars A sequence of only metadata tokens followed by EOS would silently return a ChordPeriod with bars=[], which would later crash or produce an empty .chord file. Now raises immediately with a descriptive message. Added a failing-then- passing test to cover this path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 15:00:45 +03:00
H1K0	d09a08d553	feat: add src/evaluate.py and scripts/evaluate.py Implements perplexity computation, chord distribution extraction (qualities, extensions, inversions, root-motion intervals), 4-panel comparison plot, and paired qualitative example generation for pretrained vs finetuned model. Results on user val set: pretrained PPL 3.58 → finetuned PPL 2.15 (−40 %). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 14:57:49 +03:00
H1K0	7c0d147956	fix: --bars now suppresses early EOS until target bar count is reached Previously the model could emit EOS before reaching n_bars because the EOS-suppression was only applied via the n_bars break, not the grammar bias. Fixed by masking EOS to -inf in the logit bias while bars_completed < n_bars. Added _EosHungryModel fixture and test_generate_bars_overrides_early_eos to catch this regression class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 20:34:42 +03:00
H1K0	9e73fa5d32	feat: add --bars arg to control output length generate_period() now accepts n_bars=N to stop after exactly N complete bars. bars_completed is seeded from the prefix length so --bars counts the full output, not just the generated tail. scripts/generate.py exposes this as --bars (default: None = model decides). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 20:29:44 +03:00
H1K0	f6ce2a41d3	fix: support '.' and 'NC' in --prefix argument _encode_prefix now handles hold ('.') and no-chord ('NC') tokens alongside chord symbols, and returns (ids, n_positions) so that pos_in_bar is tracked correctly regardless of token type. Fixes ChordParseError when dots were passed in --prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 20:25:41 +03:00
H1K0	e657d9edb5	feat: add generate module and CLI; fix tokenizer minor issues src/generate.py: autoregressive generation with top-p sampling, grammar masking (ROOT→QUAL→EXT→BASS; EOS only at bar boundary), key transposition, and optional chord prefix. Partial bars on context truncation are padded with HOLDs rather than discarded. scripts/generate.py: CLI wrapping generate_period — accepts mode, key, time, subdivision, style, function, prefix, temperature, top-p, seed, tempo; writes .chord and optional MIDI. src/tokenizer.py: fix docstring vocab size (81→84); normalize redundant BASS_<note>==root to no slash in _tokens_to_symbol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 14:28:44 +03:00
H1K0	4aead2ea20	feat: remove BAR token; bump spec to v2.3; fix max_seq_len Bar boundaries are now implicit — the detokenizer counts positions per bar using TIME × SUB, and the generator gates EOS to bar boundaries only. Removing the deterministic BAR token reduces vocab size from 85 to 84 and lets the model focus on meaningful predictions. - src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based detokenize_to_period with position-counting logic; add write_chord_file; fix _tokens_to_symbol for add9/m(add9) qualities - tests/test_tokenizer.py: update vocab-size assertions to 84, structural token test, remove bar-count test, add test_no_bar_token_in_vocab - docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2, §5.3, §5.4, §5.5, §5.6, §6.2, and changelog - CLAUDE.md: remove stale BAR reference, update vocab size to 84 - scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated McGill data (mean=83, max=283 tokens with BAR-free tokenizer) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 13:56:34 +03:00
H1K0	dd4f21f17f	feat: add run_pretrain.py; fix output-path naming and max_seq_len - scripts/run_pretrain.py: single-command pre-training runner with timing estimate, loss-curve plot (matplotlib), and per-epoch report. Sets max_seq_len=256 (McGill sequences max out at 195 tokens, ~4x faster attention than the 512 default). - src/train.py: normalise --output so pretrained.pt and pretrained both produce pretrained.pt + pretrained.log.csv (not pretrained.pt.log.csv). Serialize Path fields as strings in checkpoint to satisfy weights_only. - requirements.txt: drop unused pandas/music21, add mido (pretty_midi dep). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 12:13:38 +03:00
H1K0	733e1fde1f	feat: implement training loop and CLI (src/train.py, scripts/train.py) AdamW + cosine-with-warmup schedule, PAD-ignoring cross-entropy, per-epoch CSV logging, best-val-loss checkpointing, early stopping (patience=5). Same script handles both pre-training and fine-tuning via --init-from. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 11:15:39 +03:00
H1K0	10229be042	feat: implement ChordTransformer (pre-norm decoder-only transformer) Adds src/model.py with a weight-tied autoregressive transformer and tests/test_model.py with shape, weight-tying, and causal-masking checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 11:09:11 +03:00
H1K0	3cd9c29d9f	feat: extend time signature support to 9 metres (5/4, 7/4, 7/8, 9/8) Add 5/4, 7/4, 7/8, 9/8 to _VALID_TIMES and VOCAB (TIME_* tokens). Vocab size grows from 81 to 85 tokens. _parse_metre in the McGill converter assigns subdivision=8 to 7/8 and 9/8. Spec bumped to v2.2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 00:37:05 +03:00
H1K0	4fd8ece170	refactor: replace fixed STYLE_user with open-ended style tag system - STYLE_user renamed to STYLE_H1K0 in VOCAB (author's personal tag) - Style field now accepts any [A-Za-z][A-Za-z0-9_]* identifier in .chord files - Unknown styles fall back to STYLE_other at tokenization time with a log warning - Test fixtures updated to style: other; drop closed _VALID_STYLES frozenset - Spec bumped to v2.1: documents open style field, fallback behaviour, and §5.7 guide on registering a new style token Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 00:29:52 +03:00
H1K0	84ba7b4743	feat: add dataset, prepare_data pipeline and fix McGill converter - src/dataset.py: ChordDataset wrapping .pt files with pad/truncate - scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout split, logs token length stats and style/function distributions - src/external_converters/mcgill_to_chord.py: rewrite parser for real McGill v2 format (2-column annotation, each bar in its own pipe group, interval bass notation e.g. /5 and /b3) - .gitignore: exclude data/processed/train, val, holdout subdirectories - tests: 37 new tests for ChordDataset and converter (260 total, all pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 18:09:46 +03:00
H1K0	ea32bf43b2	feat: implement McGill Billboard converter (Harte → .chord) Adds src/external_converters/mcgill_to_chord.py with two public functions: - convert_song(song_dir, output_dir) — converts one salami_chords.txt to per-section .chord files (4–16 bars each, style=other) - convert_dataset(dataset_dir, output_dir) — batch converts all songs Key decisions: - Harte qualities mapped to our 18-quality vocabulary; hdim7 → m7b5, parenthetical alterations (e.g. 7(b9)) handled via regex - Bar duration estimated from median non-trivial chord duration - Mode (major/minor) inferred from tonic chord quality distribution - Sections with <4 or >16 bars are skipped with a logged reason - Unrecognized Harte chords skip the whole section (no silent corruption) 48 new tests in tests/test_mcgill_converter.py; total suite 223 passed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 17:04:02 +03:00
H1K0	54be1be9ce	feat: implement src/midi_export.py — .chord → two-track MIDI chord_file_to_midi() parses the period in the user's original key (no transposition), accumulates held-chord segments, then writes two pretty_midi tracks: chords with root anchored at octave 4 (MIDI 60–71 + intervals) and bass at octave 2 (MIDI 36–47). Extension notes are added as a fifth voice at their standard interval above the root. Tempo is parameterised; the CLI wrapper (python -m src.midi_export) supports --tempo BPM. 10 tests cover: file creation, parseability, instrument count and names, chord/bass note counts for a 4-chord C-major fixture (14 chord + 4 bass), octave placement assertions, and tempo affecting total duration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 15:56:44 +03:00
H1K0	868af4ac42	feat: add vocabulary constants and tokenize/detokenize to tokenizer.py Adds VOCAB (81 tokens), TOKEN_TO_ID, and ID_TO_TOKEN per spec §5.2. tokenize_period() transposes to C/Am then emits BOS + metadata tokens + per-bar chord/HOLD/NC tokens + BAR + EOS. detokenize_to_period() is the exact inverse, returning a ChordPeriod in canonical key. The m(add9) quality maps to QUAL_m_add9 in the vocab (parentheses not valid in token names) via _qual_token/_token_qual helpers. 36 new tests cover vocabulary integrity, token sequence structure, and full round-trip fidelity for all four valid fixture files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 15:47:28 +03:00
H1K0	a473499fac	feat: implement .chord file parser and canonical transposer; freeze requirements src/tokenizer.py: - parse_chord_file(Path) → ChordPeriod: reads header + bar body, strips // comments, validates bar position counts and chord symbols, raises ChordFormatError with filename and bar number on any violation. - transpose_to_canonical(ChordPeriod) → ChordPeriod: shifts all chord roots and bass notes by the semitone offset to C major / A minor; fast-path returns the original object when shift == 0. tests/test_chord_file_parser.py: 39 tests covering parsing of 4 valid fixtures (C major, F# major, B minor, G# minor), error messages for 2 invalid fixtures, and transposition correctness including slash chord root+bass. tests/fixtures/: 6 .chord fixture files (4 valid, 2 invalid). requirements.txt: pinned to current latest stable versions (torch 2.12.0, music21 10.1.0, pretty_midi 0.2.11, matplotlib 3.10.9, numpy 2.4.6, pandas 3.0.3, pytest 9.0.3); Python >= 3.11 noted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 15:27:57 +03:00
H1K0	dd77de00d0	feat: implement chord symbol parser with full test suite Adds src/chord_parser.py with parse_chord_symbol() → ChordTokens. Handles all 18 qualities (including Unicode °/Δ/ø variants and alternative spellings), shorthand expansion (maj9 → maj7+ext9, C9 → 7+ext9, etc.), slash chords, and flat→sharp root normalization. Raises ChordParseError with a descriptive message on bad input. Adds tests/test_chord_parser.py: 90 tests covering all qualities, all 7 extension values (including shorthands), slash chords, root normalization, all §4.6 spec examples, and 10 invalid-input cases. Adds requirements.txt with project dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 11:17:10 +03:00
H1K0	8672c10f78	chore: initialize project scaffold Add .gitignore (excludes .claude/, venv, checkpoints, processed data, external corpora), .gitattributes (LF normalization, binary markers), full directory tree with .gitkeep placeholders, and src __init__ stubs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 10:28:17 +03:00

21 Commits