chore: initialize project scaffold

Add .gitignore (excludes .claude/, venv, checkpoints, processed data,
external corpora), .gitattributes (LF normalization, binary markers),
full directory tree with .gitkeep placeholders, and src __init__ stubs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-19 10:28:17 +03:00
commit 8672c10f78
10 changed files with 265 additions and 0 deletions
+32
View File
@@ -0,0 +1,32 @@
# Normalize line endings to LF on commit (cross-platform safety)
* text=auto eol=lf
# Python source
*.py text eol=lf
# Custom text formats
*.chord text eol=lf
*.md text eol=lf
*.txt text eol=lf
*.csv text eol=lf
*.json text eol=lf
*.yaml text eol=lf
*.yml text eol=lf
*.toml text eol=lf
*.cfg text eol=lf
*.ini text eol=lf
# Binary assets — never diff/merge
*.pt binary
*.pth binary
*.ckpt binary
*.pkl binary
*.mid binary
*.midi binary
*.png binary
*.jpg binary
*.jpeg binary
*.pdf binary
*.zip binary
*.gz binary
*.tar binary
+61
View File
@@ -0,0 +1,61 @@
# Python
__pycache__/
*.py[cod]
*.pyo
*.pyd
.Python
*.egg-info/
dist/
build/
*.egg
.eggs/
# Virtual environments
.venv/
venv/
env/
ENV/
# pytest
.pytest_cache/
.cache/
htmlcov/
.coverage
coverage.xml
# Jupyter
.ipynb_checkpoints/
*.ipynb_checkpoints
# Model checkpoints (large binaries — commit only intentionally)
checkpoints/*.pt
checkpoints/*.pth
checkpoints/*.ckpt
# Processed data (reproducible from source)
data/processed/*.pt
data/processed/*.pkl
# External corpora (download separately; too large for git)
data/raw_external/
# OS
.DS_Store
Thumbs.db
desktop.ini
# IDEs
.idea/
*.swp
*.swo
# Claude Code
.claude/
# Logs
*.log
logs/
# Misc
*.tmp
*.bak
+172
View File
@@ -0,0 +1,172 @@
# CLAUDE.md
This file gives Claude Code persistent context for the project. Read it before any non-trivial task.
## Project overview
**Goal.** Train a small autoregressive transformer to generate harmonic periods (416 bar chord progressions) in the author's compositional style. Coursework deliverable for an ML class at RTU MIREA; also intended as a working creative tool.
**Unit of generation.** A single closed harmonic phrase (a "period"), not a full song.
**Pipeline.**
1. Hand-transcribe own compositions from REAPER DAW projects into `.chord` text files.
2. Parse `.chord` → factorized token sequences.
3. Pre-train on a public corpus (McGill Billboard or similar).
4. Fine-tune on the author's own corpus.
5. Sample new periods conditioned on mode / time / style / function / optional chord prefix.
6. Detokenize back to `.chord` + export to MIDI for use in REAPER.
**Hard deadline.** Less than one month, ~50 hours of work budget.
## Tech stack
- **Python 3.11+**
- **PyTorch** (no Lightning unless complexity demands it — keep training loops readable)
- **music21** for chord symbol parsing (`music21.harmony.ChordSymbol`)
- **pretty_midi** for MIDI generation
- **pytest** for unit tests
- **matplotlib** for plots in the report
- **NumPy, pandas** as standard
- Optional: **Google Colab** for training if local hardware is insufficient. Model is small enough that CPU is viable.
Avoid heavy abstractions. This is coursework, not a production system. Prefer simple imperative scripts over framework-style code.
## Repository layout
```
chord-gen/
├── CLAUDE.md ← this file
├── README.md
├── requirements.txt
├── docs/
│ └── chord_format_spec.md ← authoritative format specification
├── data/
│ ├── raw_user/ ← hand-transcribed .chord files (own corpus)
│ ├── raw_external/ ← public corpora (McGill Billboard etc.)
│ ├── processed/ ← tokenized .pt files ready for training
│ └── holdout/ ← held-out periods for evaluation
├── src/
│ ├── __init__.py
│ ├── tokenizer.py ← .chord ↔ token sequences
│ ├── chord_parser.py ← chord symbol → (root, qual, ext, bass)
│ ├── midi_export.py ← .chord → MIDI for sanity check & user output
│ ├── dataset.py ← PyTorch Dataset over tokenized files
│ ├── model.py ← small transformer
│ ├── train.py ← pre-train and fine-tune entry points
│ ├── generate.py ← inference / sampling
│ ├── evaluate.py ← perplexity + distribution metrics
│ └── external_converters/
│ └── mcgill_to_chord.py ← convert McGill Billboard to .chord
├── tests/
│ ├── test_chord_parser.py
│ ├── test_tokenizer.py
│ ├── test_midi_export.py
│ └── fixtures/
│ └── *.chord
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_training.ipynb
│ └── 03_evaluation.ipynb
└── checkpoints/
├── pretrained.pt
└── finetuned.pt
```
## The `.chord` format
The authoritative specification is in `docs/chord_format_spec.md`. **Always read it before modifying anything that touches the format or the tokenizer.** Critical points summarized here for context only — if anything conflicts, the spec wins.
- One file = one harmonic period (416 bars).
- Header lines start with `#`, list `title`, `key`, `time`, `subdivision`, `style`, optional `function`.
- Body: bars separated by `|`, exactly `subdivision` positions per bar (for 4/4), positions separated by single spaces.
- A position holds: chord symbol, `.` (hold previous), `NC` (no chord), or `?` (unknown).
- Chord symbols: `<root><quality?><extension?>(/<bass>)?`. 18 qualities, 7 extensions, slash inversions are mandatory and meaningful.
- Tokenization: each new chord becomes exactly 4 tokens (`ROOT_x`, `QUAL_x`, `EXT_x`, `BASS_x`). Hold = `HOLD`. Bar end = `BAR`. Plus metadata tokens at the start.
- **Keys are normalized.** Before tokenization, the entire period is transposed: majors → C major, minors → A minor. The model never sees absolute keys. The vocabulary contains `MODE_major`/`MODE_minor` but no `KEY_x` tokens.
- Vocabulary size: ~81 tokens.
## Model
A small autoregressive transformer:
- Layers: 24
- d_model: 128256
- Heads: 48
- FFN dim: 4 × d_model
- Context length: 512 tokens (more than enough for any single period)
- Tied input/output embeddings
- Standard causal mask, next-token prediction with cross-entropy
- AdamW, cosine schedule, warmup ~5% of steps
- Dropout 0.10.2
Pre-training uses the full public corpus. Fine-tuning uses the own corpus with a **smaller learning rate** (e.g. 1e-5 vs 1e-4 for pre-training) and **few epochs** (515) to avoid catastrophic forgetting of harmonic regularities learned during pre-training.
## Inference
- Top-p sampling (nucleus, p ≈ 0.9) with temperature ≈ 1.0 as defaults. Tunable.
- No beam search — it generally hurts on generative tasks like this.
- Generation is conditioned by feeding the BOS + metadata tokens explicitly, then optionally a chord prefix from the user.
- After generation, transpose from C/Am to the user's requested key.
- Output: both a `.chord` file and a MIDI file.
## Evaluation
For the report:
1. **Perplexity** on the holdout set, comparing pre-trained baseline vs fine-tuned.
2. **Distribution shift plots** — histograms over chord qualities, extension presence, inversion frequency, root motion intervals — showing how fine-tuning moves the distribution toward the author's corpus.
3. **Qualitative cherry-picked generations** — 3 examples with the same seed/prefix, generated by baseline vs fine-tuned, rendered to MIDI.
No formal blind listening test (out of scope for the deadline).
## Working language
- **Code, identifiers, code comments, log messages, commit messages: English.**
- **User-facing output, the academic report, and the README user guide: Russian** (per university requirements; the report must comply with GOST).
- **Conversations with the developer (the author): Russian.**
When generating commit messages or code comments, write in English. When generating the report or any user-facing text, write in Russian.
## Code style & conventions
- Type hints on all public functions.
- Docstrings: one-line summary + Args/Returns. Keep them concise.
- No `print()` in library code — use the `logging` module. CLI scripts may use `print`.
- Constants in `UPPER_SNAKE_CASE` at module top.
- Vocabulary, token IDs, and label maps live in `src/tokenizer.py` as module-level constants.
- Random seeds: every training / generation script accepts a `--seed` flag and sets `torch.manual_seed`, `numpy.random.seed`, `random.seed`.
- Reproducibility is more important than performance. If a choice is between "fast" and "deterministic", choose deterministic.
## Testing policy
- Every parser/tokenizer/MIDI module has unit tests.
- Tests use small `.chord` fixtures in `tests/fixtures/`.
- Round-trip property: `tokenize(parse(file))` followed by `detokenize(...)` must reproduce the chord sequence (up to canonical normalization).
- Don't bother unit-testing the training loop or the model. Test the data path.
## Things to never do
- **Do not change the `.chord` format** without first updating `docs/chord_format_spec.md` and bumping its version number. The format is the contract between the human-readable data and the model; changing one side silently breaks everything.
- **Do not modify files in `data/holdout/` or use them during training.** Holdout is held out.
- **Do not add new model architectures "to compare"** unless explicitly asked. One model, done well, beats four half-done.
- **Do not implement bells and whistles** (web UI, real-time audio synthesis, beam search, voicing models). They are explicitly out of scope.
- **Do not silently round or coerce unrecognized chord symbols.** If a chord can't be parsed, raise an error with the file name, bar number, and position. Silent corruption of training data is the worst failure mode here.
## Things to always do
- When asked to add a feature, first identify which module it belongs in (`src/...`) and whether it requires a spec change. State this before writing code.
- When the user describes a bug, write a failing test first, then fix.
- When dependencies change, update `requirements.txt`.
- When adding a CLI script, include `--help` output and a usage example in the script's docstring.
- When generating a long answer, ask whether a CLI flag or a config file is preferred for new parameters. Default to CLI flags for simplicity.
## Out-of-scope (explicit non-goals for this deliverable)
- Melody generation
- Voice leading / voicing inside chords above the bass
- Rhythmic patterns inside a held chord
- Arrangement, timbre, dynamics
- Web interface / GUI
- Real-time MIDI integration with REAPER
- Modulation handling inside a single period
- J-Pop fine-tuning experiment (future work after coursework deadline)
If the user asks for any of these, remind them it's out of scope and ask whether to proceed anyway or defer.
View File
View File
View File
View File
View File
View File
View File