chore: initialize project scaffold

Add .gitignore (excludes .claude/, venv, checkpoints, processed data, external corpora), .gitattributes (LF normalization, binary markers), full directory tree with .gitkeep placeholders, and src __init__ stubs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:28:17 +03:00
commit 8672c10f78
10 changed files with 265 additions and 0 deletions
@@ -0,0 +1,32 @@
 # Normalize line endings to LF on commit (cross-platform safety)
 * text=auto eol=lf
 # Python source
 *.py text eol=lf
 # Custom text formats
 *.chord text eol=lf
 *.md text eol=lf
 *.txt text eol=lf
 *.csv text eol=lf
 *.json text eol=lf
 *.yaml text eol=lf
 *.yml text eol=lf
 *.toml text eol=lf
 *.cfg text eol=lf
 *.ini text eol=lf
 # Binary assets — never diff/merge
 *.pt binary
 *.pth binary
 *.ckpt binary
 *.pkl binary
 *.mid binary
 *.midi binary
 *.png binary
 *.jpg binary
 *.jpeg binary
 *.pdf binary
 *.zip binary
 *.gz binary
 *.tar binary
@@ -0,0 +1,61 @@
 # Python
 __pycache__/
 *.py[cod]
 *.pyo
 *.pyd
 .Python
 *.egg-info/
 dist/
 build/
 *.egg
 .eggs/
 # Virtual environments
 .venv/
 venv/
 env/
 ENV/
 # pytest
 .pytest_cache/
 .cache/
 htmlcov/
 .coverage
 coverage.xml
 # Jupyter
 .ipynb_checkpoints/
 *.ipynb_checkpoints
 # Model checkpoints (large binaries — commit only intentionally)
 checkpoints/*.pt
 checkpoints/*.pth
 checkpoints/*.ckpt
 # Processed data (reproducible from source)
 data/processed/*.pt
 data/processed/*.pkl
 # External corpora (download separately; too large for git)
 data/raw_external/
 # OS
 .DS_Store
 Thumbs.db
 desktop.ini
 # IDEs
 .idea/
 *.swp
 *.swo
 # Claude Code
 .claude/
 # Logs
 *.log
 logs/
 # Misc
 *.tmp
 *.bak
@@ -0,0 +1,172 @@
 # CLAUDE.md
 This file gives Claude Code persistent context for the project. Read it before any non-trivial task.
 ## Project overview
 **Goal.** Train a small autoregressive transformer to generate harmonic periods (4–16 bar chord progressions) in the author's compositional style. Coursework deliverable for an ML class at RTU MIREA; also intended as a working creative tool.
 **Unit of generation.** A single closed harmonic phrase (a "period"), not a full song.
 **Pipeline.**
 1. Hand-transcribe own compositions from REAPER DAW projects into `.chord` text files.
 2. Parse `.chord` → factorized token sequences.
 3. Pre-train on a public corpus (McGill Billboard or similar).
 4. Fine-tune on the author's own corpus.
 5. Sample new periods conditioned on mode / time / style / function / optional chord prefix.
 6. Detokenize back to `.chord` + export to MIDI for use in REAPER.
 **Hard deadline.** Less than one month, ~50 hours of work budget.
 ## Tech stack
 - **Python 3.11+**
 - **PyTorch** (no Lightning unless complexity demands it — keep training loops readable)
 - **music21** for chord symbol parsing (`music21.harmony.ChordSymbol`)
 - **pretty_midi** for MIDI generation
 - **pytest** for unit tests
 - **matplotlib** for plots in the report
 - **NumPy, pandas** as standard
 - Optional: **Google Colab** for training if local hardware is insufficient. Model is small enough that CPU is viable.
 Avoid heavy abstractions. This is coursework, not a production system. Prefer simple imperative scripts over framework-style code.
 ## Repository layout
 ```
 chord-gen/
 ├── CLAUDE.md                          ← this file
 ├── README.md
 ├── requirements.txt
 ├── docs/
 │   └── chord_format_spec.md           ← authoritative format specification
 ├── data/
 │   ├── raw_user/                      ← hand-transcribed .chord files (own corpus)
 │   ├── raw_external/                  ← public corpora (McGill Billboard etc.)
 │   ├── processed/                     ← tokenized .pt files ready for training
 │   └── holdout/                       ← held-out periods for evaluation
 ├── src/
 │   ├── __init__.py
 │   ├── tokenizer.py                   ← .chord ↔ token sequences
 │   ├── chord_parser.py                ← chord symbol → (root, qual, ext, bass)
 │   ├── midi_export.py                 ← .chord → MIDI for sanity check & user output
 │   ├── dataset.py                     ← PyTorch Dataset over tokenized files
 │   ├── model.py                       ← small transformer
 │   ├── train.py                       ← pre-train and fine-tune entry points
 │   ├── generate.py                    ← inference / sampling
 │   ├── evaluate.py                    ← perplexity + distribution metrics
 │   └── external_converters/
 │       └── mcgill_to_chord.py         ← convert McGill Billboard to .chord
 ├── tests/
 │   ├── test_chord_parser.py
 │   ├── test_tokenizer.py
 │   ├── test_midi_export.py
 │   └── fixtures/
 │       └── *.chord
 ├── notebooks/
 │   ├── 01_data_exploration.ipynb
 │   ├── 02_training.ipynb
 │   └── 03_evaluation.ipynb
 └── checkpoints/
    ├── pretrained.pt
    └── finetuned.pt
 ```
 ## The `.chord` format
 The authoritative specification is in `docs/chord_format_spec.md`. **Always read it before modifying anything that touches the format or the tokenizer.** Critical points summarized here for context only — if anything conflicts, the spec wins.
 - One file = one harmonic period (4–16 bars).
 - Header lines start with `#`, list `title`, `key`, `time`, `subdivision`, `style`, optional `function`.
 - Body: bars separated by `|`, exactly `subdivision` positions per bar (for 4/4), positions separated by single spaces.
 - A position holds: chord symbol, `.` (hold previous), `NC` (no chord), or `?` (unknown).
 - Chord symbols: `<root><quality?><extension?>(/<bass>)?`. 18 qualities, 7 extensions, slash inversions are mandatory and meaningful.
 - Tokenization: each new chord becomes exactly 4 tokens (`ROOT_x`, `QUAL_x`, `EXT_x`, `BASS_x`). Hold = `HOLD`. Bar end = `BAR`. Plus metadata tokens at the start.
 - **Keys are normalized.** Before tokenization, the entire period is transposed: majors → C major, minors → A minor. The model never sees absolute keys. The vocabulary contains `MODE_major`/`MODE_minor` but no `KEY_x` tokens.
 - Vocabulary size: ~81 tokens.
 ## Model
 A small autoregressive transformer:
 - Layers: 2–4
 - d_model: 128–256
 - Heads: 4–8
 - FFN dim: 4 × d_model
 - Context length: 512 tokens (more than enough for any single period)
 - Tied input/output embeddings
 - Standard causal mask, next-token prediction with cross-entropy
 - AdamW, cosine schedule, warmup ~5% of steps
 - Dropout 0.1–0.2
 Pre-training uses the full public corpus. Fine-tuning uses the own corpus with a **smaller learning rate** (e.g. 1e-5 vs 1e-4 for pre-training) and **few epochs** (5–15) to avoid catastrophic forgetting of harmonic regularities learned during pre-training.
 ## Inference
 - Top-p sampling (nucleus, p ≈ 0.9) with temperature ≈ 1.0 as defaults. Tunable.
 - No beam search — it generally hurts on generative tasks like this.
 - Generation is conditioned by feeding the BOS + metadata tokens explicitly, then optionally a chord prefix from the user.
 - After generation, transpose from C/Am to the user's requested key.
 - Output: both a `.chord` file and a MIDI file.
 ## Evaluation
 For the report:
 1. **Perplexity** on the holdout set, comparing pre-trained baseline vs fine-tuned.
 2. **Distribution shift plots** — histograms over chord qualities, extension presence, inversion frequency, root motion intervals — showing how fine-tuning moves the distribution toward the author's corpus.
 3. **Qualitative cherry-picked generations** — 3 examples with the same seed/prefix, generated by baseline vs fine-tuned, rendered to MIDI.
 No formal blind listening test (out of scope for the deadline).
 ## Working language
 - **Code, identifiers, code comments, log messages, commit messages: English.**
 - **User-facing output, the academic report, and the README user guide: Russian** (per university requirements; the report must comply with GOST).
 - **Conversations with the developer (the author): Russian.**
 When generating commit messages or code comments, write in English. When generating the report or any user-facing text, write in Russian.
 ## Code style & conventions
 - Type hints on all public functions.
 - Docstrings: one-line summary + Args/Returns. Keep them concise.
 - No `print()` in library code — use the `logging` module. CLI scripts may use `print`.
 - Constants in `UPPER_SNAKE_CASE` at module top.
 - Vocabulary, token IDs, and label maps live in `src/tokenizer.py` as module-level constants.
 - Random seeds: every training / generation script accepts a `--seed` flag and sets `torch.manual_seed`, `numpy.random.seed`, `random.seed`.
 - Reproducibility is more important than performance. If a choice is between "fast" and "deterministic", choose deterministic.
 ## Testing policy
 - Every parser/tokenizer/MIDI module has unit tests.
 - Tests use small `.chord` fixtures in `tests/fixtures/`.
 - Round-trip property: `tokenize(parse(file))` followed by `detokenize(...)` must reproduce the chord sequence (up to canonical normalization).
 - Don't bother unit-testing the training loop or the model. Test the data path.
 ## Things to never do
 - **Do not change the `.chord` format** without first updating `docs/chord_format_spec.md` and bumping its version number. The format is the contract between the human-readable data and the model; changing one side silently breaks everything.
 - **Do not modify files in `data/holdout/` or use them during training.** Holdout is held out.
 - **Do not add new model architectures "to compare"** unless explicitly asked. One model, done well, beats four half-done.
 - **Do not implement bells and whistles** (web UI, real-time audio synthesis, beam search, voicing models). They are explicitly out of scope.
 - **Do not silently round or coerce unrecognized chord symbols.** If a chord can't be parsed, raise an error with the file name, bar number, and position. Silent corruption of training data is the worst failure mode here.
 ## Things to always do
 - When asked to add a feature, first identify which module it belongs in (`src/...`) and whether it requires a spec change. State this before writing code.
 - When the user describes a bug, write a failing test first, then fix.
 - When dependencies change, update `requirements.txt`.
 - When adding a CLI script, include `--help` output and a usage example in the script's docstring.
 - When generating a long answer, ask whether a CLI flag or a config file is preferred for new parameters. Default to CLI flags for simplicity.
 ## Out-of-scope (explicit non-goals for this deliverable)
 - Melody generation
 - Voice leading / voicing inside chords above the bass
 - Rhythmic patterns inside a held chord
 - Arrangement, timbre, dynamics
 - Web interface / GUI
 - Real-time MIDI integration with REAPER
 - Modulation handling inside a single period
 - J-Pop fine-tuning experiment (future work after coursework deadline)
 If the user asks for any of these, remind them it's out of scope and ask whether to proceed anyway or defer.