4aead2ea20
Bar boundaries are now implicit — the detokenizer counts positions per bar using TIME × SUB, and the generator gates EOS to bar boundaries only. Removing the deterministic BAR token reduces vocab size from 85 to 84 and lets the model focus on meaningful predictions. - src/tokenizer.py: drop BAR from VOCAB (85→84); replace BAR-based detokenize_to_period with position-counting logic; add write_chord_file; fix _tokens_to_symbol for add9/m(add9) qualities - tests/test_tokenizer.py: update vocab-size assertions to 84, structural token test, remove bar-count test, add test_no_bar_token_in_vocab - docs/chord_format_spec.md: bump to v2.3; document BAR removal in §5.2, §5.3, §5.4, §5.5, §5.6, §6.2, and changelog - CLAUDE.md: remove stale BAR reference, update vocab size to 84 - scripts/pretrain.py: raise max_seq_len 256→320 to cover regenerated McGill data (mean=83, max=283 tokens with BAR-free tokenizer) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
184 lines
9.6 KiB
Markdown
184 lines
9.6 KiB
Markdown
# CLAUDE.md
|
||
|
||
This file gives Claude Code persistent context for the project. Read it before any non-trivial task.
|
||
|
||
## Project overview
|
||
|
||
**Name.** _hamori_ (Japanese ハモリ, "harmonization" in the sense of vocal
|
||
harmony — adding a second voice to a melodic line). The name reflects the
|
||
project's core idea: the model proposes harmonic ideas to complement a
|
||
composer's existing intent, rather than writing music from scratch.
|
||
|
||
**Goal.** Train a small autoregressive transformer to generate harmonic
|
||
periods (4–16 bar chord progressions) in the author's compositional style.
|
||
Coursework deliverable for an ML class at RTU MIREA; also intended as a
|
||
working creative tool.
|
||
|
||
**Unit of generation.** A single closed harmonic phrase (a "period"), not a full song.
|
||
|
||
**Pipeline.**
|
||
|
||
1. Hand-transcribe own compositions from REAPER DAW projects into `.chord` text files.
|
||
2. Parse `.chord` → factorized token sequences.
|
||
3. Pre-train on a public corpus (McGill Billboard or similar).
|
||
4. Fine-tune on the author's own corpus.
|
||
5. Sample new periods conditioned on mode / time / style / function / optional chord prefix.
|
||
6. Detokenize back to `.chord` + export to MIDI for use in REAPER.
|
||
|
||
**Hard deadline.** Less than one month, ~50 hours of work budget.
|
||
|
||
## Tech stack
|
||
|
||
- **Python 3.11+**
|
||
- **PyTorch** (no Lightning unless complexity demands it — keep training loops readable)
|
||
- **music21** for chord symbol parsing (`music21.harmony.ChordSymbol`)
|
||
- **pretty_midi** for MIDI generation
|
||
- **pytest** for unit tests
|
||
- **matplotlib** for plots in the report
|
||
- **NumPy, pandas** as standard
|
||
- Optional: **Google Colab** for training if local hardware is insufficient. Model is small enough that CPU is viable.
|
||
|
||
Avoid heavy abstractions. This is coursework, not a production system. Prefer simple imperative scripts over framework-style code.
|
||
|
||
## Repository layout
|
||
|
||
```
|
||
hamori/
|
||
├── CLAUDE.md ← this file
|
||
├── README.md
|
||
├── requirements.txt
|
||
├── docs/
|
||
│ └── chord_format_spec.md ← authoritative format specification
|
||
├── data/
|
||
│ ├── raw_user/ ← hand-transcribed .chord files (own corpus)
|
||
│ ├── raw_external/ ← public corpora (McGill Billboard etc.)
|
||
│ ├── processed/ ← tokenized .pt files ready for training
|
||
│ └── holdout/ ← held-out periods for evaluation
|
||
├── src/
|
||
│ ├── __init__.py
|
||
│ ├── tokenizer.py ← .chord ↔ token sequences
|
||
│ ├── chord_parser.py ← chord symbol → (root, qual, ext, bass)
|
||
│ ├── midi_export.py ← .chord → MIDI for sanity check & user output
|
||
│ ├── dataset.py ← PyTorch Dataset over tokenized files
|
||
│ ├── model.py ← small transformer
|
||
│ ├── train.py ← pre-train and fine-tune entry points
|
||
│ ├── generate.py ← inference / sampling
|
||
│ ├── evaluate.py ← perplexity + distribution metrics
|
||
│ └── external_converters/
|
||
│ └── mcgill_to_chord.py ← convert McGill Billboard to .chord
|
||
├── tests/
|
||
│ ├── test_chord_parser.py
|
||
│ ├── test_tokenizer.py
|
||
│ ├── test_midi_export.py
|
||
│ └── fixtures/
|
||
│ └── *.chord
|
||
├── notebooks/
|
||
│ ├── 01_data_exploration.ipynb
|
||
│ ├── 02_training.ipynb
|
||
│ └── 03_evaluation.ipynb
|
||
└── checkpoints/
|
||
├── pretrained.pt
|
||
└── finetuned.pt
|
||
```
|
||
|
||
## The `.chord` format
|
||
|
||
The authoritative specification is in `docs/chord_format_spec.md`. **Always read it before modifying anything that touches the format or the tokenizer.** Critical points summarized here for context only — if anything conflicts, the spec wins.
|
||
|
||
- One file = one harmonic period (4–16 bars).
|
||
- Header lines start with `#`, list `title`, `key`, `time`, `subdivision`, `style`, optional `function`.
|
||
- Body: bars separated by `|`, exactly `subdivision` positions per bar (for 4/4), positions separated by single spaces.
|
||
- A position holds: chord symbol, `.` (hold previous), `NC` (no chord), or `?` (unknown).
|
||
- Chord symbols: `<root><quality?><extension?>(/<bass>)?`. 18 qualities, 7 extensions, slash inversions are mandatory and meaningful.
|
||
- Tokenization: each new chord becomes exactly 4 tokens (`ROOT_x`, `QUAL_x`, `EXT_x`, `BASS_x`). Hold = `HOLD`. Bar boundaries are **not tokens** — the detokenizer reconstructs them by counting positions (`TIME` × `SUB`). Plus metadata tokens at the start.
|
||
- **Keys are normalized.** Before tokenization, the entire period is transposed: majors → C major, minors → A minor. The model never sees absolute keys. The vocabulary contains `MODE_major`/`MODE_minor` but no `KEY_x` tokens.
|
||
- Vocabulary size: 84 tokens.
|
||
|
||
## Model
|
||
|
||
A small autoregressive transformer:
|
||
|
||
- Layers: 2–4
|
||
- d_model: 128–256
|
||
- Heads: 4–8
|
||
- FFN dim: 4 × d_model
|
||
- Context length: 512 tokens (more than enough for any single period)
|
||
- Tied input/output embeddings
|
||
- Standard causal mask, next-token prediction with cross-entropy
|
||
- AdamW, cosine schedule, warmup ~5% of steps
|
||
- Dropout 0.1–0.2
|
||
|
||
Pre-training uses the full public corpus. Fine-tuning uses the own corpus with a **smaller learning rate** (e.g. 1e-5 vs 1e-4 for pre-training) and **few epochs** (5–15) to avoid catastrophic forgetting of harmonic regularities learned during pre-training.
|
||
|
||
## Inference
|
||
|
||
- Top-p sampling (nucleus, p ≈ 0.9) with temperature ≈ 1.0 as defaults. Tunable.
|
||
- No beam search — it generally hurts on generative tasks like this.
|
||
- Generation is conditioned by feeding the BOS + metadata tokens explicitly, then optionally a chord prefix from the user.
|
||
- After generation, transpose from C/Am to the user's requested key.
|
||
- Output: both a `.chord` file and a MIDI file.
|
||
|
||
## Evaluation
|
||
|
||
For the report:
|
||
|
||
1. **Perplexity** on the holdout set, comparing pre-trained baseline vs fine-tuned.
|
||
2. **Distribution shift plots** — histograms over chord qualities, extension presence, inversion frequency, root motion intervals — showing how fine-tuning moves the distribution toward the author's corpus.
|
||
3. **Qualitative cherry-picked generations** — 3 examples with the same seed/prefix, generated by baseline vs fine-tuned, rendered to MIDI.
|
||
|
||
No formal blind listening test (out of scope for the deadline).
|
||
|
||
## Working language
|
||
|
||
- **Code, identifiers, code comments, log messages, commit messages: English.**
|
||
- **User-facing output, the academic report, and the README user guide: Russian** (per university requirements; the report must comply with GOST).
|
||
- **Conversations with the developer (the author): Russian.**
|
||
|
||
When generating commit messages or code comments, write in English. When generating the report or any user-facing text, write in Russian.
|
||
|
||
## Code style & conventions
|
||
|
||
- Type hints on all public functions.
|
||
- Docstrings: one-line summary + Args/Returns. Keep them concise.
|
||
- No `print()` in library code — use the `logging` module. CLI scripts may use `print`.
|
||
- Constants in `UPPER_SNAKE_CASE` at module top.
|
||
- Vocabulary, token IDs, and label maps live in `src/tokenizer.py` as module-level constants.
|
||
- Random seeds: every training / generation script accepts a `--seed` flag and sets `torch.manual_seed`, `numpy.random.seed`, `random.seed`.
|
||
- Reproducibility is more important than performance. If a choice is between "fast" and "deterministic", choose deterministic.
|
||
|
||
## Testing policy
|
||
|
||
- Every parser/tokenizer/MIDI module has unit tests.
|
||
- Tests use small `.chord` fixtures in `tests/fixtures/`.
|
||
- Round-trip property: `tokenize(parse(file))` followed by `detokenize(...)` must reproduce the chord sequence (up to canonical normalization).
|
||
- Don't bother unit-testing the training loop or the model. Test the data path.
|
||
|
||
## Things to never do
|
||
|
||
- **Do not change the `.chord` format** without first updating `docs/chord_format_spec.md` and bumping its version number. The format is the contract between the human-readable data and the model; changing one side silently breaks everything.
|
||
- **Do not modify files in `data/holdout/` or use them during training.** Holdout is held out.
|
||
- **Do not add new model architectures "to compare"** unless explicitly asked. One model, done well, beats four half-done.
|
||
- **Do not implement bells and whistles** (web UI, real-time audio synthesis, beam search, voicing models). They are explicitly out of scope.
|
||
- **Do not silently round or coerce unrecognized chord symbols.** If a chord can't be parsed, raise an error with the file name, bar number, and position. Silent corruption of training data is the worst failure mode here.
|
||
|
||
## Things to always do
|
||
|
||
- When asked to add a feature, first identify which module it belongs in (`src/...`) and whether it requires a spec change. State this before writing code.
|
||
- When the user describes a bug, write a failing test first, then fix.
|
||
- When dependencies change, update `requirements.txt`.
|
||
- When adding a CLI script, include `--help` output and a usage example in the script's docstring.
|
||
- When generating a long answer, ask whether a CLI flag or a config file is preferred for new parameters. Default to CLI flags for simplicity.
|
||
|
||
## Out-of-scope (explicit non-goals for this deliverable)
|
||
|
||
- Melody generation
|
||
- Voice leading / voicing inside chords above the bass
|
||
- Rhythmic patterns inside a held chord
|
||
- Arrangement, timbre, dynamics
|
||
- Web interface / GUI
|
||
- Real-time MIDI integration with REAPER
|
||
- Modulation handling inside a single period
|
||
- J-Pop fine-tuning experiment (future work after coursework deadline)
|
||
|
||
If the user asks for any of these, remind them it's out of scope and ask whether to proceed anyway or defer.
|