Commit Graph

41 Commits

Author SHA1 Message Date
H1K0 e3983ca1bd data: add "pastoral" chord files (verse, bridge) 2026-05-20 18:53:48 +03:00
H1K0 c62a05012c data: add "summer_rain" chord files (verse, chorus) 2026-05-20 18:49:11 +03:00
H1K0 bcc03b04fd data: add "wake_up" chord files (verse, prechorus, chorus, interlude) 2026-05-20 18:41:28 +03:00
H1K0 8ada4a92ed data: add "sparkle" chord files (verse, chorus, interlude) 2026-05-20 18:30:42 +03:00
H1K0 e33f715d2d data: add "reindeer_team" chord files (verse, chorus, bridge) 2026-05-20 18:22:33 +03:00
H1K0 248a6f14b7 chore: add output/ to .gitignore
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:33:38 +03:00
H1K0 e657d9edb5 feat: add generate module and CLI; fix tokenizer minor issues
src/generate.py: autoregressive generation with top-p sampling, grammar
masking (ROOT→QUAL→EXT→BASS; EOS only at bar boundary), key transposition,
and optional chord prefix.  Partial bars on context truncation are padded
with HOLDs rather than discarded.

scripts/generate.py: CLI wrapping generate_period — accepts mode, key,
time, subdivision, style, function, prefix, temperature, top-p, seed,
tempo; writes .chord and optional MIDI.

src/tokenizer.py: fix docstring vocab size (81→84); normalize redundant
BASS_<note>==root to no slash in _tokens_to_symbol.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:28:44 +03:00
H1K0 8a73394df9 data: update pretrained checkpoint results (BAR-free tokenizer)
Re-run pre-training results with the corrected 84-token vocabulary and
max_seq_len=320.  Previous checkpoint was trained on stale data with BAR
tokens and a corrupted tokenizer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:28:00 +03:00
H1K0 4aead2ea20 feat: remove BAR token; bump spec to v2.3; fix max_seq_len
Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.

- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
  detokenize_to_period with position-counting logic; add write_chord_file;
  fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
  token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
  §5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
  McGill data (mean=83, max=283 tokens with BAR-free tokenizer)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:56:34 +03:00
H1K0 329952b02e data: add pre-training results from Google Colab run
Includes log CSV (50 epochs), loss-curve plot, and report.
Training ran on Colab GPU (T4).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:10:34 +03:00
H1K0 89770dd009 feat: add Colab bundle script and pre-training notebook
scripts/make_colab_zip.py packages src/, scripts/pretrain.py,
requirements.txt, and processed .pt files into hamori_colab.zip,
remapping data/processed/{train,val}/ -> data/processed/mcgill/{train,val}/
so pretrain.py finds the data without modification.

notebooks/colab_pretrain.ipynb guides through upload, extraction,
dependency install, training run, report display, and results download.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:00:03 +03:00
H1K0 0682ccc140 docs: actualize README, architecture, requirements (v1.1)
README:
- processed/ tree now shows mcgill/ and user/ subdirs
- --style user -> --style H1K0 in quick-start prefix example
- pretrained.report.txt and finetuned.report.txt added to artifact tables

architecture.md (-> v1.1):
- remove stale music21 fallback mention from chord_parser section
- fix ChordDataset: on-demand loading, not eager; remove non-existent
  make_dataloader from public interface
- fix train function name: train_model -> train
- update logging description: report goes to .report.txt, not stdout
- note that scripts use max_seq_len=256 (sequences top out at 195 tokens)

requirements.md (-> v1.1):
- FT-12: update from unified script to pretrain.py + train.py pair

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:46:09 +03:00
H1K0 03b464973a feat: write training report to file instead of stdout
pretrain.py -> checkpoints/pretrained.report.txt
train.py    -> checkpoints/finetuned.report.txt

Single-line [report] saved -> <path> printed to stdout instead.
Also fix arrow character incompatible with Windows cp1251 console.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:40:44 +03:00
H1K0 632407ebef refactor: split training scripts into pretrain.py and train.py
- scripts/run_pretrain.py -> scripts/pretrain.py: pre-trains on McGill
  corpus (data/processed/mcgill/), saves checkpoints/pretrained.pt.
- scripts/train.py: rewritten as high-level fine-tune wrapper; loads
  pretrained.pt, trains on data/processed/user/, saves finetuned.pt.
  Both scripts include timing estimate, loss-curve plot, per-epoch report,
  and --skip-training flag.
- README: updated section 7 to reflect new script names and separate
  data directories.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:35:23 +03:00
H1K0 65c3f6bf7c data: add 2-epoch smoke pre-training log (_smoke_pretrain.log.csv)
Sanity run: McGill corpus, max_seq_len=256, batch_size=32, lr=3e-4, seed=42.
Epoch 1: train=1.2603 val=0.6403 ppl=1.90
Epoch 2: train=0.5979 val=0.5809 ppl=1.80

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:16:28 +03:00
H1K0 dd4f21f17f feat: add run_pretrain.py; fix output-path naming and max_seq_len
- scripts/run_pretrain.py: single-command pre-training runner with
  timing estimate, loss-curve plot (matplotlib), and per-epoch report.
  Sets max_seq_len=256 (McGill sequences max out at 195 tokens, ~4x
  faster attention than the 512 default).
- src/train.py: normalise --output so pretrained.pt and pretrained both
  produce pretrained.pt + pretrained.log.csv (not pretrained.pt.log.csv).
  Serialize Path fields as strings in checkpoint to satisfy weights_only.
- requirements.txt: drop unused pandas/music21, add mido (pretty_midi dep).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:13:38 +03:00
H1K0 733e1fde1f feat: implement training loop and CLI (src/train.py, scripts/train.py)
AdamW + cosine-with-warmup schedule, PAD-ignoring cross-entropy, per-epoch
CSV logging, best-val-loss checkpointing, early stopping (patience=5).
Same script handles both pre-training and fine-tuning via --init-from.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 11:15:39 +03:00
H1K0 10229be042 feat: implement ChordTransformer (pre-norm decoder-only transformer)
Adds src/model.py with a weight-tied autoregressive transformer and
tests/test_model.py with shape, weight-tying, and causal-masking checks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 11:09:11 +03:00
H1K0 0712eec578 data: add "la_veille_de_noel" chord files (intro, verse1, chorus1, verse2, chorus2, outro) 2026-05-20 11:01:06 +03:00
H1K0 555205b7d2 docs: actualize vocab size (81→85), spec version (2.0→2.2), style tag (user→H1K0)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 03:23:55 +03:00
H1K0 0a1dcc0ec2 data: use Japanese titles in chord file headers 2026-05-20 03:07:59 +03:00
H1K0 7c9b6c3749 data: add "golos" chord files (verse, prechorus, chorus, bridge, interlude) 2026-05-20 03:02:28 +03:00
H1K0 f910b066bb data: extend some chord files 2026-05-20 02:33:02 +03:00
H1K0 31bc332c5c data: add "irozuku_sekai" chord files (intro, verse, chorus, bridge) 2026-05-20 02:26:11 +03:00
H1K0 895f1df54f data: add "ciel_dhiver" chord files (intro, verse, chorus, bridge) 2026-05-20 02:10:52 +03:00
H1K0 eee5e97194 data: add first hand-transcribed user corpus chord files
Three songs (electricity, hikari_no_shizuku, okazalos) covering verse/chorus/bridge sections, all tagged style: H1K0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 01:46:47 +03:00
H1K0 f0352015cf docs: simplify §8 filename convention to snake_case_title-function.chord
Replace the YYYY_NNN_kebab-case scheme with title_in_snake_case-function.chord.
Snake_case makes the title double-click-selectable; dash unambiguously
separates the title from the optional function suffix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 00:51:22 +03:00
H1K0 3cd9c29d9f feat: extend time signature support to 9 metres (5/4, 7/4, 7/8, 9/8)
Add 5/4, 7/4, 7/8, 9/8 to _VALID_TIMES and VOCAB (TIME_* tokens).
Vocab size grows from 81 to 85 tokens. _parse_metre in the McGill
converter assigns subdivision=8 to 7/8 and 9/8. Spec bumped to v2.2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 00:37:05 +03:00
H1K0 4fd8ece170 refactor: replace fixed STYLE_user with open-ended style tag system
- STYLE_user renamed to STYLE_H1K0 in VOCAB (author's personal tag)
- Style field now accepts any [A-Za-z][A-Za-z0-9_]* identifier in .chord files
- Unknown styles fall back to STYLE_other at tokenization time with a log warning
- Test fixtures updated to style: other; drop closed _VALID_STYLES frozenset
- Spec bumped to v2.1: documents open style field, fallback behaviour, and §5.7
  guide on registering a new style token

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 00:29:52 +03:00
H1K0 84ba7b4743 feat: add dataset, prepare_data pipeline and fix McGill converter
- src/dataset.py: ChordDataset wrapping .pt files with pad/truncate
- scripts/prepare_data.py: tokenize .chord to .pt with train/val/holdout
  split, logs token length stats and style/function distributions
- src/external_converters/mcgill_to_chord.py: rewrite parser for real
  McGill v2 format (2-column annotation, each bar in its own pipe group,
  interval bass notation e.g. /5 and /b3)
- .gitignore: exclude data/processed/train, val, holdout subdirectories
- tests: 37 new tests for ChordDataset and converter (260 total, all pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 18:09:46 +03:00
H1K0 ea32bf43b2 feat: implement McGill Billboard converter (Harte → .chord)
Adds src/external_converters/mcgill_to_chord.py with two public functions:
  - convert_song(song_dir, output_dir) — converts one salami_chords.txt to
    per-section .chord files (4–16 bars each, style=other)
  - convert_dataset(dataset_dir, output_dir) — batch converts all songs

Key decisions:
  - Harte qualities mapped to our 18-quality vocabulary; hdim7 → m7b5,
    parenthetical alterations (e.g. 7(b9)) handled via regex
  - Bar duration estimated from median non-trivial chord duration
  - Mode (major/minor) inferred from tonic chord quality distribution
  - Sections with <4 or >16 bars are skipped with a logged reason
  - Unrecognized Harte chords skip the whole section (no silent corruption)

48 new tests in tests/test_mcgill_converter.py; total suite 223 passed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 17:04:02 +03:00
H1K0 54be1be9ce feat: implement src/midi_export.py — .chord → two-track MIDI
chord_file_to_midi() parses the period in the user's original key (no
transposition), accumulates held-chord segments, then writes two pretty_midi
tracks: chords with root anchored at octave 4 (MIDI 60–71 + intervals) and
bass at octave 2 (MIDI 36–47).  Extension notes are added as a fifth voice
at their standard interval above the root.  Tempo is parameterised; the CLI
wrapper (python -m src.midi_export) supports --tempo BPM.

10 tests cover: file creation, parseability, instrument count and names,
chord/bass note counts for a 4-chord C-major fixture (14 chord + 4 bass),
octave placement assertions, and tempo affecting total duration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 15:56:44 +03:00
H1K0 868af4ac42 feat: add vocabulary constants and tokenize/detokenize to tokenizer.py
Adds VOCAB (81 tokens), TOKEN_TO_ID, and ID_TO_TOKEN per spec §5.2.
tokenize_period() transposes to C/Am then emits BOS + metadata tokens +
per-bar chord/HOLD/NC tokens + BAR + EOS.  detokenize_to_period() is the
exact inverse, returning a ChordPeriod in canonical key.  The m(add9)
quality maps to QUAL_m_add9 in the vocab (parentheses not valid in token
names) via _qual_token/_token_qual helpers.

36 new tests cover vocabulary integrity, token sequence structure,
and full round-trip fidelity for all four valid fixture files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 15:47:28 +03:00
H1K0 355341dab9 docs: revise glossary to v1.1 — fix factual errors, improve ML explanations for beginners
Six factual corrections: slash chord definition (inversions vs on-chords),
logits described as unnormalized, #11 = eleventh not fourth, duplicate
Polukadentsiya sentence removed, pre-training LR corrected to 1e-4,
unverified PAD index claim removed. All ML term explanations rewritten
with analogies accessible to a near-beginner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 15:39:25 +03:00
H1K0 0444f049c0 chore: switch binary assets to Git LFS tracking
Replace plain `binary` markers with `filter=lfs diff=lfs merge=lfs -text`
for all large binary extensions (.pt, .pth, .ckpt, .pkl, .mid, .midi,
.png, .jpg, .jpeg, .pdf, .zip, .gz, .tar).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 15:28:45 +03:00
H1K0 a473499fac feat: implement .chord file parser and canonical transposer; freeze requirements
src/tokenizer.py:
  - parse_chord_file(Path) → ChordPeriod: reads header + bar body, strips //
    comments, validates bar position counts and chord symbols, raises
    ChordFormatError with filename and bar number on any violation.
  - transpose_to_canonical(ChordPeriod) → ChordPeriod: shifts all chord roots
    and bass notes by the semitone offset to C major / A minor; fast-path
    returns the original object when shift == 0.

tests/test_chord_file_parser.py: 39 tests covering parsing of 4 valid fixtures
  (C major, F# major, B minor, G# minor), error messages for 2 invalid
  fixtures, and transposition correctness including slash chord root+bass.

tests/fixtures/: 6 .chord fixture files (4 valid, 2 invalid).

requirements.txt: pinned to current latest stable versions
  (torch 2.12.0, music21 10.1.0, pretty_midi 0.2.11, matplotlib 3.10.9,
  numpy 2.4.6, pandas 3.0.3, pytest 9.0.3); Python >= 3.11 noted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 15:27:57 +03:00
H1K0 dd77de00d0 feat: implement chord symbol parser with full test suite
Adds src/chord_parser.py with parse_chord_symbol() → ChordTokens.
Handles all 18 qualities (including Unicode °/Δ/ø variants and
alternative spellings), shorthand expansion (maj9 → maj7+ext9,
C9 → 7+ext9, etc.), slash chords, and flat→sharp root normalization.
Raises ChordParseError with a descriptive message on bad input.

Adds tests/test_chord_parser.py: 90 tests covering all qualities,
all 7 extension values (including shorthands), slash chords, root
normalization, all §4.6 spec examples, and 10 invalid-input cases.

Adds requirements.txt with project dependencies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 11:17:10 +03:00
H1K0 75fa07bf6c docs: add README, architecture, glossary, requirements; update CLAUDE.md
Add four Russian-language project documents:
- README.md: user-facing guide (install, quick start, data prep, training,
  evaluation, limitations)
- docs/architecture.md v1.0: system architecture, data flow diagrams,
  module interfaces, 7 architectural decision records, extension points
- docs/glossary.md v1.0: musical, ML, and project-specific term definitions
- docs/requirements.md v1.0: functional/non-functional requirements,
  acceptance criteria, four use-case scenarios

Update CLAUDE.md with project name etymology (hamori / ハモリ) and rename
repo root reference from chord-gen to hamori. Refine chord_format_spec.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 11:00:21 +03:00
H1K0 9929209bcf docs: update chord format spec to v2.0
Revise and expand the authoritative .chord format specification:
version bump 1.0 → 2.0, clarified period scope, updated tokenization
rules, grammar tables, and key normalization details.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:41:57 +03:00
H1K0 967257a555 docs: add chord format specification
Authoritative .chord file format spec covering header fields, body
syntax, chord symbol grammar, tokenization rules, and key normalization.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:40:39 +03:00
H1K0 8672c10f78 chore: initialize project scaffold
Add .gitignore (excludes .claude/, venv, checkpoints, processed data,
external corpora), .gitattributes (LF normalization, binary markers),
full directory tree with .gitkeep placeholders, and src __init__ stubs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:28:17 +03:00