hamori

Author	SHA1	Message	Date
H1K0	8a73394df9	data: update pretrained checkpoint results (BAR-free tokenizer) Re-run pre-training results with the corrected 84-token vocabulary and max_seq_len=320. Previous checkpoint was trained on stale data with BAR tokens and a corrupted tokenizer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 14:28:00 +03:00
H1K0	329952b02e	data: add pre-training results from Google Colab run Includes log CSV (50 epochs), loss-curve plot, and report. Training ran on Colab GPU (T4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 13:10:34 +03:00
H1K0	65c3f6bf7c	data: add 2-epoch smoke pre-training log (_smoke_pretrain.log.csv) Sanity run: McGill corpus, max_seq_len=256, batch_size=32, lr=3e-4, seed=42. Epoch 1: train=1.2603 val=0.6403 ppl=1.90 Epoch 2: train=0.5979 val=0.5809 ppl=1.80 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-20 12:16:28 +03:00
H1K0	8672c10f78	chore: initialize project scaffold Add .gitignore (excludes .claude/, venv, checkpoints, processed data, external corpora), .gitattributes (LF normalization, binary markers), full directory tree with .gitkeep placeholders, and src __init__ stubs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 10:28:17 +03:00