Commit Graph

7 Commits

Author SHA1 Message Date
H1K0 b30f4c188b chore: track checkpoints via Git LFS
Removed checkpoints/*.pt from .gitignore; files are now stored as LFS
objects (pretrained.pt 17 MB, finetuned.pt 17 MB). LFS attributes were
already in place in .gitattributes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 14:37:24 +03:00
H1K0 d9585ec008 data: add fine-tuning run results (lr=3e-5, 30 epochs)
val loss 1.19 → 0.77, val perplexity 3.29 → 2.15.
Best epoch 20, early stop at epoch 30 (patience=10).
Improvement over previous lr=1e-5 run (best val ppl 2.22).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 20:52:39 +03:00
H1K0 2e6e934564 data: add fine-tuning run results (lr=1e-5, 50 epochs)
val loss 1.24 → 0.80, val perplexity 3.47 → 2.22.
Best epoch 50 (no early stop); convergence epoch 30.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 20:17:25 +03:00
H1K0 8a73394df9 data: update pretrained checkpoint results (BAR-free tokenizer)
Re-run pre-training results with the corrected 84-token vocabulary and
max_seq_len=320.  Previous checkpoint was trained on stale data with BAR
tokens and a corrupted tokenizer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:28:00 +03:00
H1K0 329952b02e data: add pre-training results from Google Colab run
Includes log CSV (50 epochs), loss-curve plot, and report.
Training ran on Colab GPU (T4).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 13:10:34 +03:00
H1K0 65c3f6bf7c data: add 2-epoch smoke pre-training log (_smoke_pretrain.log.csv)
Sanity run: McGill corpus, max_seq_len=256, batch_size=32, lr=3e-4, seed=42.
Epoch 1: train=1.2603 val=0.6403 ppl=1.90
Epoch 2: train=0.5979 val=0.5809 ppl=1.80

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 12:16:28 +03:00
H1K0 8672c10f78 chore: initialize project scaffold
Add .gitignore (excludes .claude/, venv, checkpoints, processed data,
external corpora), .gitattributes (LF normalization, binary markers),
full directory tree with .gitkeep placeholders, and src __init__ stubs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:28:17 +03:00