Files
hamori/CLAUDE.md
T
H1K0 4aead2ea20 feat: remove BAR token; bump spec to v2.3; fix max_seq_len
Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.

- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
  detokenize_to_period with position-counting logic; add write_chord_file;
  fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
  token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
  §5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
  McGill data (mean=83, max=283 tokens with BAR-free tokenizer)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:56:34 +03:00

9.6 KiB
Raw Blame History

CLAUDE.md

This file gives Claude Code persistent context for the project. Read it before any non-trivial task.

Project overview

Name. hamori (Japanese ハモリ, "harmonization" in the sense of vocal harmony — adding a second voice to a melodic line). The name reflects the project's core idea: the model proposes harmonic ideas to complement a composer's existing intent, rather than writing music from scratch.

Goal. Train a small autoregressive transformer to generate harmonic periods (416 bar chord progressions) in the author's compositional style. Coursework deliverable for an ML class at RTU MIREA; also intended as a working creative tool.

Unit of generation. A single closed harmonic phrase (a "period"), not a full song.

Pipeline.

  1. Hand-transcribe own compositions from REAPER DAW projects into .chord text files.
  2. Parse .chord → factorized token sequences.
  3. Pre-train on a public corpus (McGill Billboard or similar).
  4. Fine-tune on the author's own corpus.
  5. Sample new periods conditioned on mode / time / style / function / optional chord prefix.
  6. Detokenize back to .chord + export to MIDI for use in REAPER.

Hard deadline. Less than one month, ~50 hours of work budget.

Tech stack

  • Python 3.11+
  • PyTorch (no Lightning unless complexity demands it — keep training loops readable)
  • music21 for chord symbol parsing (music21.harmony.ChordSymbol)
  • pretty_midi for MIDI generation
  • pytest for unit tests
  • matplotlib for plots in the report
  • NumPy, pandas as standard
  • Optional: Google Colab for training if local hardware is insufficient. Model is small enough that CPU is viable.

Avoid heavy abstractions. This is coursework, not a production system. Prefer simple imperative scripts over framework-style code.

Repository layout

hamori/
├── CLAUDE.md                          ← this file
├── README.md
├── requirements.txt
├── docs/
│   └── chord_format_spec.md           ← authoritative format specification
├── data/
│   ├── raw_user/                      ← hand-transcribed .chord files (own corpus)
│   ├── raw_external/                  ← public corpora (McGill Billboard etc.)
│   ├── processed/                     ← tokenized .pt files ready for training
│   └── holdout/                       ← held-out periods for evaluation
├── src/
│   ├── __init__.py
│   ├── tokenizer.py                   ← .chord ↔ token sequences
│   ├── chord_parser.py                ← chord symbol → (root, qual, ext, bass)
│   ├── midi_export.py                 ← .chord → MIDI for sanity check & user output
│   ├── dataset.py                     ← PyTorch Dataset over tokenized files
│   ├── model.py                       ← small transformer
│   ├── train.py                       ← pre-train and fine-tune entry points
│   ├── generate.py                    ← inference / sampling
│   ├── evaluate.py                    ← perplexity + distribution metrics
│   └── external_converters/
│       └── mcgill_to_chord.py         ← convert McGill Billboard to .chord
├── tests/
│   ├── test_chord_parser.py
│   ├── test_tokenizer.py
│   ├── test_midi_export.py
│   └── fixtures/
│       └── *.chord
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_training.ipynb
│   └── 03_evaluation.ipynb
└── checkpoints/
    ├── pretrained.pt
    └── finetuned.pt

The .chord format

The authoritative specification is in docs/chord_format_spec.md. Always read it before modifying anything that touches the format or the tokenizer. Critical points summarized here for context only — if anything conflicts, the spec wins.

  • One file = one harmonic period (416 bars).
  • Header lines start with #, list title, key, time, subdivision, style, optional function.
  • Body: bars separated by |, exactly subdivision positions per bar (for 4/4), positions separated by single spaces.
  • A position holds: chord symbol, . (hold previous), NC (no chord), or ? (unknown).
  • Chord symbols: <root><quality?><extension?>(/<bass>)?. 18 qualities, 7 extensions, slash inversions are mandatory and meaningful.
  • Tokenization: each new chord becomes exactly 4 tokens (ROOT_x, QUAL_x, EXT_x, BASS_x). Hold = HOLD. Bar boundaries are not tokens — the detokenizer reconstructs them by counting positions (TIME × SUB). Plus metadata tokens at the start.
  • Keys are normalized. Before tokenization, the entire period is transposed: majors → C major, minors → A minor. The model never sees absolute keys. The vocabulary contains MODE_major/MODE_minor but no KEY_x tokens.
  • Vocabulary size: 84 tokens.

Model

A small autoregressive transformer:

  • Layers: 24
  • d_model: 128256
  • Heads: 48
  • FFN dim: 4 × d_model
  • Context length: 512 tokens (more than enough for any single period)
  • Tied input/output embeddings
  • Standard causal mask, next-token prediction with cross-entropy
  • AdamW, cosine schedule, warmup ~5% of steps
  • Dropout 0.10.2

Pre-training uses the full public corpus. Fine-tuning uses the own corpus with a smaller learning rate (e.g. 1e-5 vs 1e-4 for pre-training) and few epochs (515) to avoid catastrophic forgetting of harmonic regularities learned during pre-training.

Inference

  • Top-p sampling (nucleus, p ≈ 0.9) with temperature ≈ 1.0 as defaults. Tunable.
  • No beam search — it generally hurts on generative tasks like this.
  • Generation is conditioned by feeding the BOS + metadata tokens explicitly, then optionally a chord prefix from the user.
  • After generation, transpose from C/Am to the user's requested key.
  • Output: both a .chord file and a MIDI file.

Evaluation

For the report:

  1. Perplexity on the holdout set, comparing pre-trained baseline vs fine-tuned.
  2. Distribution shift plots — histograms over chord qualities, extension presence, inversion frequency, root motion intervals — showing how fine-tuning moves the distribution toward the author's corpus.
  3. Qualitative cherry-picked generations — 3 examples with the same seed/prefix, generated by baseline vs fine-tuned, rendered to MIDI.

No formal blind listening test (out of scope for the deadline).

Working language

  • Code, identifiers, code comments, log messages, commit messages: English.
  • User-facing output, the academic report, and the README user guide: Russian (per university requirements; the report must comply with GOST).
  • Conversations with the developer (the author): Russian.

When generating commit messages or code comments, write in English. When generating the report or any user-facing text, write in Russian.

Code style & conventions

  • Type hints on all public functions.
  • Docstrings: one-line summary + Args/Returns. Keep them concise.
  • No print() in library code — use the logging module. CLI scripts may use print.
  • Constants in UPPER_SNAKE_CASE at module top.
  • Vocabulary, token IDs, and label maps live in src/tokenizer.py as module-level constants.
  • Random seeds: every training / generation script accepts a --seed flag and sets torch.manual_seed, numpy.random.seed, random.seed.
  • Reproducibility is more important than performance. If a choice is between "fast" and "deterministic", choose deterministic.

Testing policy

  • Every parser/tokenizer/MIDI module has unit tests.
  • Tests use small .chord fixtures in tests/fixtures/.
  • Round-trip property: tokenize(parse(file)) followed by detokenize(...) must reproduce the chord sequence (up to canonical normalization).
  • Don't bother unit-testing the training loop or the model. Test the data path.

Things to never do

  • Do not change the .chord format without first updating docs/chord_format_spec.md and bumping its version number. The format is the contract between the human-readable data and the model; changing one side silently breaks everything.
  • Do not modify files in data/holdout/ or use them during training. Holdout is held out.
  • Do not add new model architectures "to compare" unless explicitly asked. One model, done well, beats four half-done.
  • Do not implement bells and whistles (web UI, real-time audio synthesis, beam search, voicing models). They are explicitly out of scope.
  • Do not silently round or coerce unrecognized chord symbols. If a chord can't be parsed, raise an error with the file name, bar number, and position. Silent corruption of training data is the worst failure mode here.

Things to always do

  • When asked to add a feature, first identify which module it belongs in (src/...) and whether it requires a spec change. State this before writing code.
  • When the user describes a bug, write a failing test first, then fix.
  • When dependencies change, update requirements.txt.
  • When adding a CLI script, include --help output and a usage example in the script's docstring.
  • When generating a long answer, ask whether a CLI flag or a config file is preferred for new parameters. Default to CLI flags for simplicity.

Out-of-scope (explicit non-goals for this deliverable)

  • Melody generation
  • Voice leading / voicing inside chords above the bass
  • Rhythmic patterns inside a held chord
  • Arrangement, timbre, dynamics
  • Web interface / GUI
  • Real-time MIDI integration with REAPER
  • Modulation handling inside a single period
  • J-Pop fine-tuning experiment (future work after coursework deadline)

If the user asks for any of these, remind them it's out of scope and ask whether to proceed anyway or defer.