Files

T

H1K0 4aead2ea20 feat: remove BAR token; bump spec to v2.3; fix max_seq_len

Bar boundaries are now implicit — the detokenizer counts positions per bar
using TIME × SUB, and the generator gates EOS to bar boundaries only.
Removing the deterministic BAR token reduces vocab size from 85 to 84 and
lets the model focus on meaningful predictions.

- src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based
  detokenize_to_period with position-counting logic; add write_chord_file;
  fix _tokens_to_symbol for add9/m(add9) qualities
- tests/test_tokenizer.py: update vocab-size assertions to 84, structural
  token test, remove bar-count test, add test_no_bar_token_in_vocab
- docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2,
  §5.3, §5.4, §5.5, §5.6, §6.2, and changelog
- CLAUDE.md: remove stale BAR reference, update vocab size to 84
- scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated
  McGill data (mean=83, max=283 tokens with BAR-free tokenizer)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-20 13:56:34 +03:00

9.6 KiB

Raw Blame History

CLAUDE.md

This file gives Claude Code persistent context for the project. Read it before any non-trivial task.

Project overview

Name. hamori (Japanese ハモリ, "harmonization" in the sense of vocal harmony — adding a second voice to a melodic line). The name reflects the project's core idea: the model proposes harmonic ideas to complement a composer's existing intent, rather than writing music from scratch.

Goal. Train a small autoregressive transformer to generate harmonic periods (4–16 bar chord progressions) in the author's compositional style. Coursework deliverable for an ML class at RTU MIREA; also intended as a working creative tool.

Unit of generation. A single closed harmonic phrase (a "period"), not a full song.

Pipeline.

Hand-transcribe own compositions from REAPER DAW projects into .chord text files.
Parse .chord → factorized token sequences.
Pre-train on a public corpus (McGill Billboard or similar).
Fine-tune on the author's own corpus.
Sample new periods conditioned on mode / time / style / function / optional chord prefix.
Detokenize back to .chord + export to MIDI for use in REAPER.

Hard deadline. Less than one month, ~50 hours of work budget.

Tech stack

Python 3.11+
PyTorch (no Lightning unless complexity demands it — keep training loops readable)
music21 for chord symbol parsing (music21.harmony.ChordSymbol)
pretty_midi for MIDI generation
pytest for unit tests
matplotlib for plots in the report
NumPy, pandas as standard
Optional: Google Colab for training if local hardware is insufficient. Model is small enough that CPU is viable.

Avoid heavy abstractions. This is coursework, not a production system. Prefer simple imperative scripts over framework-style code.

Repository layout

hamori/
├── CLAUDE.md                          ← this file
├── README.md
├── requirements.txt
├── docs/
│   └── chord_format_spec.md           ← authoritative format specification
├── data/
│   ├── raw_user/                      ← hand-transcribed .chord files (own corpus)
│   ├── raw_external/                  ← public corpora (McGill Billboard etc.)
│   ├── processed/                     ← tokenized .pt files ready for training
│   └── holdout/                       ← held-out periods for evaluation
├── src/
│   ├── __init__.py
│   ├── tokenizer.py                   ← .chord ↔ token sequences
│   ├── chord_parser.py                ← chord symbol → (root, qual, ext, bass)
│   ├── midi_export.py                 ← .chord → MIDI for sanity check & user output
│   ├── dataset.py                     ← PyTorch Dataset over tokenized files
│   ├── model.py                       ← small transformer
│   ├── train.py                       ← pre-train and fine-tune entry points
│   ├── generate.py                    ← inference / sampling
│   ├── evaluate.py                    ← perplexity + distribution metrics
│   └── external_converters/
│       └── mcgill_to_chord.py         ← convert McGill Billboard to .chord
├── tests/
│   ├── test_chord_parser.py
│   ├── test_tokenizer.py
│   ├── test_midi_export.py
│   └── fixtures/
│       └── *.chord
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_training.ipynb
│   └── 03_evaluation.ipynb
└── checkpoints/
    ├── pretrained.pt
    └── finetuned.pt

The `.chord` format

The authoritative specification is in docs/chord_format_spec.md. Always read it before modifying anything that touches the format or the tokenizer. Critical points summarized here for context only — if anything conflicts, the spec wins.

One file = one harmonic period (4–16 bars).
Header lines start with #, list title, key, time, subdivision, style, optional function.
Body: bars separated by |, exactly subdivision positions per bar (for 4/4), positions separated by single spaces.
A position holds: chord symbol, . (hold previous), NC (no chord), or ? (unknown).
Chord symbols: <root><quality?><extension?>(/<bass>)?. 18 qualities, 7 extensions, slash inversions are mandatory and meaningful.
Tokenization: each new chord becomes exactly 4 tokens (ROOT_x, QUAL_x, EXT_x, BASS_x). Hold = HOLD. Bar boundaries are not tokens — the detokenizer reconstructs them by counting positions (TIME × SUB). Plus metadata tokens at the start.
Keys are normalized. Before tokenization, the entire period is transposed: majors → C major, minors → A minor. The model never sees absolute keys. The vocabulary contains MODE_major/MODE_minor but no KEY_x tokens.
Vocabulary size: 84 tokens.

Model

A small autoregressive transformer:

Layers: 2–4
d_model: 128–256
Heads: 4–8
FFN dim: 4 × d_model
Context length: 512 tokens (more than enough for any single period)
Tied input/output embeddings
Standard causal mask, next-token prediction with cross-entropy
AdamW, cosine schedule, warmup ~5% of steps
Dropout 0.1–0.2

Pre-training uses the full public corpus. Fine-tuning uses the own corpus with a smaller learning rate (e.g. 1e-5 vs 1e-4 for pre-training) and few epochs (5–15) to avoid catastrophic forgetting of harmonic regularities learned during pre-training.

Inference

Top-p sampling (nucleus, p ≈ 0.9) with temperature ≈ 1.0 as defaults. Tunable.
No beam search — it generally hurts on generative tasks like this.
Generation is conditioned by feeding the BOS + metadata tokens explicitly, then optionally a chord prefix from the user.
After generation, transpose from C/Am to the user's requested key.
Output: both a .chord file and a MIDI file.

Evaluation

For the report:

Perplexity on the holdout set, comparing pre-trained baseline vs fine-tuned.
Distribution shift plots — histograms over chord qualities, extension presence, inversion frequency, root motion intervals — showing how fine-tuning moves the distribution toward the author's corpus.
Qualitative cherry-picked generations — 3 examples with the same seed/prefix, generated by baseline vs fine-tuned, rendered to MIDI.

No formal blind listening test (out of scope for the deadline).

Working language

Code, identifiers, code comments, log messages, commit messages: English.
User-facing output, the academic report, and the README user guide: Russian (per university requirements; the report must comply with GOST).
Conversations with the developer (the author): Russian.

When generating commit messages or code comments, write in English. When generating the report or any user-facing text, write in Russian.

Code style & conventions

Type hints on all public functions.
Docstrings: one-line summary + Args/Returns. Keep them concise.
No print() in library code — use the logging module. CLI scripts may use print.
Constants in UPPER_SNAKE_CASE at module top.
Vocabulary, token IDs, and label maps live in src/tokenizer.py as module-level constants.
Random seeds: every training / generation script accepts a --seed flag and sets torch.manual_seed, numpy.random.seed, random.seed.
Reproducibility is more important than performance. If a choice is between "fast" and "deterministic", choose deterministic.

Testing policy

Every parser/tokenizer/MIDI module has unit tests.
Tests use small .chord fixtures in tests/fixtures/.
Round-trip property: tokenize(parse(file)) followed by detokenize(...) must reproduce the chord sequence (up to canonical normalization).
Don't bother unit-testing the training loop or the model. Test the data path.

Things to never do

Do not change the .chord format without first updating docs/chord_format_spec.md and bumping its version number. The format is the contract between the human-readable data and the model; changing one side silently breaks everything.
Do not modify files in data/holdout/ or use them during training. Holdout is held out.
Do not add new model architectures "to compare" unless explicitly asked. One model, done well, beats four half-done.
Do not implement bells and whistles (web UI, real-time audio synthesis, beam search, voicing models). They are explicitly out of scope.
Do not silently round or coerce unrecognized chord symbols. If a chord can't be parsed, raise an error with the file name, bar number, and position. Silent corruption of training data is the worst failure mode here.

Things to always do

When asked to add a feature, first identify which module it belongs in (src/...) and whether it requires a spec change. State this before writing code.
When the user describes a bug, write a failing test first, then fix.
When dependencies change, update requirements.txt.
When adding a CLI script, include --help output and a usage example in the script's docstring.
When generating a long answer, ask whether a CLI flag or a config file is preferred for new parameters. Default to CLI flags for simplicity.

Out-of-scope (explicit non-goals for this deliverable)

Melody generation
Voice leading / voicing inside chords above the bass
Rhythmic patterns inside a held chord
Arrangement, timbre, dynamics
Web interface / GUI
Real-time MIDI integration with REAPER
Modulation handling inside a single period
J-Pop fine-tuning experiment (future work after coursework deadline)

If the user asks for any of these, remind them it's out of scope and ask whether to proceed anyway or defer.

9.6 KiB Raw Blame History Unescape Escape