chore(scripts): add legacy data migration

One-time migration from the old Python/Flask Tanabata DB into the new core/data/acl/activity schema. - transform.sql: reads a `legacy` schema and writes the new one in a single, idempotent transaction. Remaps user/mime ids (uuid -> smallint by name), inverts is_private -> is_public, lifts EXIF out of files.metadata into the exif column, preserves pool hierarchy/created under metadata, synthesises file_pool ordering, derives acl object types, sanitises colors/notes. - migrate.sh: links the new DB to the old one via postgres_fdw, imports the old public schema as `legacy`, runs the transform, tears the link down. - README.md: mapping table, decisions/lossy points, and the separate physical-blob copy step. - docs/reference/schema.sql: the old DB schema the migration is built from (referenced by the README). Verified end-to-end on PostgreSQL 16 (synthetic legacy data, all transformations and idempotency checked). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 12:42:06 +03:00
parent 437b66e73a
commit 76942721ad
4 changed files with 2701 additions and 0 deletions
@@ -0,0 +1,103 @@
 # Legacy data migration
 Moves data from the **old** Tanabata database (the Python/Flask version, schema
 in [`docs/reference/schema.sql`](../../docs/reference/schema.sql)) into the
 **new** `core` / `data` / `acl` / `activity` schema.
 - [`transform.sql`](transform.sql) — the actual data transformation. Reads a
  `legacy` schema (the old tables) and writes the new schema, in one
  transaction. Idempotent.
 - [`migrate.sh`](migrate.sh) — links the new DB to the old one via
  `postgres_fdw`, imports the old `public` schema as `legacy`, runs
  `transform.sql`, then removes the link. The old DB is only **read**.
 Tested end-to-end against PostgreSQL 16 (schema applied, synthetic legacy data,
 all transformations + idempotency verified).
 ## Prerequisites
 1. The **new** schema exists and is seeded — start the app once (it runs the
   goose migrations incl. `007_seed_data`), or run goose manually.
 2. `NEW_DSN` connects as a role allowed to `CREATE EXTENSION postgres_fdw`
   (a superuser — the compose Postgres' `POSTGRES_USER` is one).
 3. The new Postgres server can reach the old DB host over the network.
 4. `psql` on PATH.
 ## Run
 ```bash
 cd scripts/migrate-legacy
 NEW_DSN='postgres://tanabata:PASS@localhost:42777/tanabata' \
 OLD_HOST=192.168.1.10 OLD_PORT=5432 OLD_DB=tfm \
 OLD_USER=hiko OLD_PASSWORD=SECRET \
 ./migrate.sh
 ```
 It prints the source (legacy) row counts, then the resulting new-schema counts.
 Re-running is safe — `ON CONFLICT DO NOTHING` everywhere means a second run only
 fills in what is missing.
 ### Without postgres_fdw
 `transform.sql` only needs the old tables to be visible as a `legacy` schema. If
 you'd rather not use fdw, load the old dump into a schema named `legacy` in the
 new database by whatever means, then run just the transform:
 ```bash
 psql "$NEW_DSN" -v ON_ERROR_STOP=1 -f transform.sql
 ```
 ## What gets migrated, and how
 | Old (`public`)        | New                     | Notes |
 |-----------------------|-------------------------|-------|
 | `users`               | `core.users`            | id **uuid → smallint** (remapped by unique `name`); `can_edit` → `can_create`; `is_blocked` = false |
 | `mime`                | `core.mime_types`       | id **uuid → smallint** (remapped by `name`); types not already seeded are added |
 | `categories`          | `data.categories`       | id kept; `is_private` → **`is_public`** (inverted) |
 | `tags`                | `data.tags`             | id + `category_id` kept; inverted privacy |
 | `autotags`            | `data.tag_rules`        | `parent_id` → `when_tag_id`, `child_id` → `then_tag_id` |
 | `files`               | `data.files`            | id kept; `datetime` → `content_datetime`; `orig_name` → `original_name`; **EXIF** lifted from `metadata->'exif'` into the `exif` column, the rest stays as user `metadata` |
 | `file_tag`            | `data.file_tag`         | orphan rows skipped |
 | `pools`               | `data.pools`            | id kept; `parent_id` + `created` preserved under `metadata` (see below) |
 | `file_pool`           | `data.file_pool`        | `position` synthesised (gapped 1000s, ordered by file id) |
 | `acl`                 | `acl.permissions`       | object type **derived** by locating the object; `read`/`write` → `can_view`/`can_edit` |
 | `file_views`          | `activity.file_views`   | `datetime` → `viewed_at` |
 Throughout: empty `notes` (`''`) → `NULL`; colours that aren't 6-hex are set to
 `NULL` (the old `CHECK` was `NOT VALID`, so bad values could exist).
 ### Decisions / lossy points
 - **Passwords** are copied verbatim. If the old hashes are bcrypt (as the new
  app expects) logins keep working; otherwise affected users need a reset.
 - **`created` timestamps** on categories/tags/files are dropped — their UUIDv7
  ids already encode creation time. Pools use random v4 ids, so their `created`
  (and the dropped **pool hierarchy** `parent_id`) are preserved under
  `data.pools.metadata` as `legacy_created` / `legacy_parent_id`.
 - **`file_pool` ordering**: the old schema stored none, so position is generated
  from file-id order (≈ chronological) with gaps of 1000.
 - **Not migrated**: `sessions` / `user_agents` — the new app uses JWTs, so users
  simply log in again. There were no audit-log / pool-view / tag-use tables in
  the old schema, so those start empty. `phash` and `is_deleted` are new
  (`NULL` / `false`).
 ## Physical files (separate, manual)
 The script migrates the **database only**. File blobs must be copied too. The
 new layout stores originals at `FILES_PATH/{uuid}` with **no extension**;
 thumbnails/previews are regenerated on demand, so don't copy those. Because ids
 are preserved, the old `{uuid}.{ext}` files map 1:1 — just strip the extension:
 ```bash
 OLD_FILES=/srv/old-tanabata/files     # old originals ({uuid}.{ext})
 NEW_FILES=/var/lib/tanabata/files     # new FILES_PATH
 for src in "$OLD_FILES"/*; do
  id="$(basename "$src")"; id="${id%.*}"   # uuids contain no dots
  cp -n "$src" "$NEW_FILES/$id"
 done
 # Make them readable by the container user (uid/gid 42776):
 chown -R 42776:42776 "$NEW_FILES"
 ```
@@ -0,0 +1,92 @@
 #!/usr/bin/env bash
 # =============================================================================
 # Tanabata legacy -> new schema migration (orchestrator)
 #
 # Connects the NEW database to the OLD one via postgres_fdw, imports the old
 # `public` schema as `legacy`, runs transform.sql (the actual data move, in one
 # transaction), then tears the foreign link down again. The OLD database is
 # only read.
 #
 # Prerequisites:
 #   - The NEW schema already exists and is seeded (start the app once, or run
 #     goose, so all migrations incl. 007_seed_data have applied).
 #   - NEW_DSN connects as a role allowed to CREATE EXTENSION postgres_fdw
 #     (a superuser; the compose Postgres' POSTGRES_USER is one).
 #   - The NEW Postgres server can reach OLD_HOST:OLD_PORT over the network.
 #   - `psql` is on PATH.
 #
 # Usage:
 #   NEW_DSN='postgres://tanabata:pass@localhost:42777/tanabata' \
 #   OLD_HOST=192.168.1.10 OLD_DB=tfm OLD_USER=hiko OLD_PASSWORD=secret \
 #   ./migrate.sh
 # =============================================================================
 set -euo pipefail
 # --- Config from the environment --------------------------------------------
 NEW_DSN="${NEW_DSN:?set NEW_DSN to the new database connection string}"
 OLD_HOST="${OLD_HOST:?set OLD_HOST}"
 OLD_PORT="${OLD_PORT:-5432}"
 OLD_DB="${OLD_DB:?set OLD_DB (old database name)}"
 OLD_USER="${OLD_USER:?set OLD_USER}"
 OLD_PASSWORD="${OLD_PASSWORD:?set OLD_PASSWORD}"
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 TRANSFORM_SQL="$SCRIPT_DIR/transform.sql"
 psql_new() { psql "$NEW_DSN" -v ON_ERROR_STOP=1 "$@"; }
 # --- Always remove the foreign link on exit, success or failure -------------
 teardown() {
    psql "$NEW_DSN" -q >/dev/null 2>&1 <<'SQL' || true
 DROP SCHEMA IF EXISTS legacy CASCADE;
 DROP SERVER IF EXISTS legacy_src CASCADE;
 SQL
 }
 trap teardown EXIT
 echo ">> Linking NEW database to OLD ($OLD_USER@$OLD_HOST:$OLD_PORT/$OLD_DB) via postgres_fdw ..."
 psql_new \
    -v old_host="$OLD_HOST" \
    -v old_port="$OLD_PORT" \
    -v old_db="$OLD_DB" \
    -v old_user="$OLD_USER" \
    -v old_pw="$OLD_PASSWORD" <<'SQL'
 CREATE EXTENSION IF NOT EXISTS postgres_fdw;
 -- Start clean in case a previous run was interrupted.
 DROP SCHEMA IF EXISTS legacy CASCADE;
 DROP SERVER IF EXISTS legacy_src CASCADE;
 CREATE SERVER legacy_src FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host :'old_host', port :'old_port', dbname :'old_db');
 -- :'old_user' / :'old_pw' are quoted+escaped by psql, so passwords with
 -- special characters are safe.
 CREATE USER MAPPING FOR CURRENT_USER SERVER legacy_src
    OPTIONS (user :'old_user', password :'old_pw');
 CREATE SCHEMA legacy;
 IMPORT FOREIGN SCHEMA public LIMIT TO (
    users, mime, categories, tags, autotags, files, file_tag, pools, file_pool, acl, file_views
 ) FROM SERVER legacy_src INTO legacy;
 SQL
 echo ">> Source (legacy) row counts:"
 psql_new -P pager=off -c "
 SELECT 'users' AS table, count(*) FROM legacy.users
 UNION ALL SELECT 'mime',       count(*) FROM legacy.mime
 UNION ALL SELECT 'categories', count(*) FROM legacy.categories
 UNION ALL SELECT 'tags',       count(*) FROM legacy.tags
 UNION ALL SELECT 'autotags',   count(*) FROM legacy.autotags
 UNION ALL SELECT 'files',      count(*) FROM legacy.files
 UNION ALL SELECT 'file_tag',   count(*) FROM legacy.file_tag
 UNION ALL SELECT 'pools',      count(*) FROM legacy.pools
 UNION ALL SELECT 'file_pool',  count(*) FROM legacy.file_pool
 UNION ALL SELECT 'acl',        count(*) FROM legacy.acl
 UNION ALL SELECT 'file_views', count(*) FROM legacy.file_views
 ORDER BY 1;"
 echo ">> Running transform (single transaction) ..."
 psql_new -P pager=off -f "$TRANSFORM_SQL"
 echo ">> Done. The foreign link will be removed now."
@@ -0,0 +1,220 @@
 -- =============================================================================
 -- Tanabata legacy -> new schema data migration (transform step)
 --
 -- Reads the OLD database (exposed as the `legacy` schema — see migrate.sh, which
 -- imports it via postgres_fdw) and inserts the transformed rows into the new
 -- core / data / acl / activity schemas.
 --
 -- Assumes the new schema already exists (goose migrations applied) and is seeded
 -- (core.mime_types, core.object_types from 007_seed_data.sql).
 --
 -- Idempotent: ON CONFLICT DO NOTHING everywhere + preserved UUID PKs, so a
 -- re-run inserts only what is missing. Runs as one transaction — all or nothing.
 --
 -- Run with:  psql "<new-dsn>" -v ON_ERROR_STOP=1 -f transform.sql
 -- (migrate.sh does this for you after setting up the `legacy` schema.)
 -- =============================================================================
 \set ON_ERROR_STOP on
 -- Fail early and clearly if the legacy data hasn't been made available.
 DO $$
 BEGIN
    IF to_regclass('legacy.users') IS NULL THEN
        RAISE EXCEPTION
            'legacy.* tables not found. Populate the "legacy" schema first '
            '(run migrate.sh, or load the old dump into a schema named legacy).';
    END IF;
 END $$;
 BEGIN;
 -- ---------------------------------------------------------------------------
 -- 1. Users.  Old PK is uuid; the new table uses a smallint identity. Insert by
 --    the unique `name`, then build a uuid -> smallint map used by every FK below.
 --    Old `can_edit` becomes the new `can_create`; nobody is blocked on import.
 -- ---------------------------------------------------------------------------
 INSERT INTO core.users (name, password, is_admin, can_create, is_blocked)
 SELECT name, password, is_admin, can_edit, false
 FROM legacy.users
 ON CONFLICT (name) DO NOTHING;
 CREATE TEMP TABLE user_id_map ON COMMIT DROP AS
 SELECT lu.id AS old_id, nu.id AS new_id
 FROM legacy.users lu
 JOIN core.users nu ON nu.name = lu.name;
 -- ---------------------------------------------------------------------------
 -- 2. MIME types.  Same uuid -> smallint remap, keyed by the MIME name. The new
 --    DB is pre-seeded with the common types; add any legacy ones not seeded.
 -- ---------------------------------------------------------------------------
 INSERT INTO core.mime_types (name, extension)
 SELECT name, extension
 FROM legacy.mime
 ON CONFLICT (name) DO NOTHING;
 CREATE TEMP TABLE mime_id_map ON COMMIT DROP AS
 SELECT lm.id AS old_id, nm.id AS new_id
 FROM legacy.mime lm
 JOIN core.mime_types nm ON nm.name = lm.name;
 -- ---------------------------------------------------------------------------
 -- 3. Categories.  UUID PK preserved. is_private -> is_public (inverted),
 --    '' notes -> NULL, non-hex colors -> NULL (to satisfy the hex CHECK that the
 --    old NOT VALID constraint may not have enforced on existing rows).
 -- ---------------------------------------------------------------------------
 INSERT INTO data.categories (id, name, notes, color, metadata, creator_id, is_public)
 SELECT c.id,
       c.name,
       NULLIF(c.notes, ''),
       CASE WHEN c.color ~* '^[A-Fa-f0-9]{6}$' THEN c.color END,
       NULL,
       um.new_id,
       NOT c.is_private
 FROM legacy.categories c
 JOIN user_id_map um ON um.old_id = c.creator_id
 ON CONFLICT (id) DO NOTHING;
 -- ---------------------------------------------------------------------------
 -- 4. Tags.  UUID PK + category_id preserved.
 -- ---------------------------------------------------------------------------
 INSERT INTO data.tags (id, name, notes, color, category_id, metadata, creator_id, is_public)
 SELECT t.id,
       t.name,
       NULLIF(t.notes, ''),
       CASE WHEN t.color ~* '^[A-Fa-f0-9]{6}$' THEN t.color END,
       t.category_id,
       NULL,
       um.new_id,
       NOT t.is_private
 FROM legacy.tags t
 JOIN user_id_map um ON um.old_id = t.creator_id
 ON CONFLICT (id) DO NOTHING;
 -- ---------------------------------------------------------------------------
 -- 5. Tag rules (old `autotags`): parent -> when_tag, child -> then_tag.
 --    Skip rules whose tags didn't migrate.
 -- ---------------------------------------------------------------------------
 INSERT INTO data.tag_rules (when_tag_id, then_tag_id, is_active)
 SELECT a.parent_id, a.child_id, a.is_active
 FROM legacy.autotags a
 WHERE EXISTS (SELECT 1 FROM data.tags t WHERE t.id = a.parent_id)
  AND EXISTS (SELECT 1 FROM data.tags t WHERE t.id = a.child_id)
 ON CONFLICT (when_tag_id, then_tag_id) DO NOTHING;
 -- ---------------------------------------------------------------------------
 -- 6. Files.  UUID PK preserved. old `datetime` -> content_datetime,
 --    `orig_name` -> original_name. EXIF is lifted out of the old metadata blob
 --    into its own column; whatever else was in metadata stays as user metadata
 --    (NULL if nothing remains). No phash / soft-delete existed before.
 -- ---------------------------------------------------------------------------
 INSERT INTO data.files (id, original_name, mime_id, content_datetime, notes,
                        metadata, exif, phash, creator_id, is_public, is_deleted)
 SELECT f.id,
       f.orig_name,
       mm.new_id,
       f.datetime,
       NULLIF(f.notes, ''),
       NULLIF(f.metadata - 'exif', '{}'::jsonb),
       f.metadata -> 'exif',
       NULL,
       um.new_id,
       NOT f.is_private,
       false
 FROM legacy.files f
 JOIN user_id_map um ON um.old_id = f.creator_id
 JOIN mime_id_map mm ON mm.old_id = f.mime_id
 ON CONFLICT (id) DO NOTHING;
 -- ---------------------------------------------------------------------------
 -- 7. File <-> tag.  Skip orphan junction rows.
 -- ---------------------------------------------------------------------------
 INSERT INTO data.file_tag (file_id, tag_id)
 SELECT ft.file_id, ft.tag_id
 FROM legacy.file_tag ft
 WHERE EXISTS (SELECT 1 FROM data.files f WHERE f.id = ft.file_id)
  AND EXISTS (SELECT 1 FROM data.tags  t WHERE t.id = ft.tag_id)
 ON CONFLICT DO NOTHING;
 -- ---------------------------------------------------------------------------
 -- 8. Pools.  UUID PK preserved. The new schema has neither pool hierarchy nor a
 --    `created` column, so the legacy parent_id and created timestamp are kept
 --    under metadata (pool ids are random v4, so created isn't otherwise
 --    recoverable). is_private -> is_public.
 -- ---------------------------------------------------------------------------
 INSERT INTO data.pools (id, name, notes, metadata, creator_id, is_public)
 SELECT p.id,
       p.name,
       NULLIF(p.notes, ''),
       jsonb_strip_nulls(jsonb_build_object(
           'legacy_parent_id', p.parent_id,
           'legacy_created',   p.created)),
       um.new_id,
       NOT p.is_private
 FROM legacy.pools p
 JOIN user_id_map um ON um.old_id = p.creator_id
 ON CONFLICT (id) DO NOTHING;
 -- ---------------------------------------------------------------------------
 -- 9. File <-> pool.  The old table has no ordering column; synthesise a stable
 --    gapped position per pool, ordered by file id (UUID v7 ≈ chronological), so
 --    the app's gap-based reordering keeps working.
 -- ---------------------------------------------------------------------------
 INSERT INTO data.file_pool (file_id, pool_id, position)
 SELECT fp.file_id,
       fp.pool_id,
       (row_number() OVER (PARTITION BY fp.pool_id ORDER BY fp.file_id))::int * 1000
 FROM legacy.file_pool fp
 WHERE EXISTS (SELECT 1 FROM data.files f WHERE f.id = fp.file_id)
  AND EXISTS (SELECT 1 FROM data.pools p WHERE p.id = fp.pool_id)
 ON CONFLICT DO NOTHING;
 -- ---------------------------------------------------------------------------
 -- 10. ACL.  The old table stored no object type; derive it by locating the
 --     object among files/tags/categories/pools. read/write -> can_view/can_edit.
 --     Rows whose object no longer exists are skipped.
 -- ---------------------------------------------------------------------------
 INSERT INTO acl.permissions (user_id, object_type_id, object_id, can_view, can_edit)
 SELECT um.new_id, ot.id, a.object_id, a.read, a.write
 FROM legacy.acl a
 JOIN user_id_map um ON um.old_id = a.user_id
 JOIN LATERAL (
    SELECT CASE
        WHEN EXISTS (SELECT 1 FROM data.files      f WHERE f.id = a.object_id) THEN 'file'
        WHEN EXISTS (SELECT 1 FROM data.tags       t WHERE t.id = a.object_id) THEN 'tag'
        WHEN EXISTS (SELECT 1 FROM data.categories c WHERE c.id = a.object_id) THEN 'category'
        WHEN EXISTS (SELECT 1 FROM data.pools      p WHERE p.id = a.object_id) THEN 'pool'
    END AS type_name
 ) k ON true
 JOIN core.object_types ot ON ot.name = k.type_name
 ON CONFLICT (user_id, object_type_id, object_id) DO NOTHING;
 -- ---------------------------------------------------------------------------
 -- 11. File view history.  old `datetime` -> viewed_at.
 -- ---------------------------------------------------------------------------
 INSERT INTO activity.file_views (file_id, user_id, viewed_at)
 SELECT fv.file_id, um.new_id, fv.datetime
 FROM legacy.file_views fv
 JOIN user_id_map um ON um.old_id = fv.user_id
 WHERE EXISTS (SELECT 1 FROM data.files f WHERE f.id = fv.file_id)
 ON CONFLICT DO NOTHING;
 COMMIT;
 -- ---------------------------------------------------------------------------
 -- Summary of what now lives in the new schema.
 -- ---------------------------------------------------------------------------
 \echo ''
 \echo 'Migration committed. New row counts:'
 SELECT 'core.users'            AS table, count(*) FROM core.users
 UNION ALL SELECT 'core.mime_types',        count(*) FROM core.mime_types
 UNION ALL SELECT 'data.categories',        count(*) FROM data.categories
 UNION ALL SELECT 'data.tags',              count(*) FROM data.tags
 UNION ALL SELECT 'data.tag_rules',         count(*) FROM data.tag_rules
 UNION ALL SELECT 'data.files',             count(*) FROM data.files
 UNION ALL SELECT 'data.file_tag',          count(*) FROM data.file_tag
 UNION ALL SELECT 'data.pools',             count(*) FROM data.pools
 UNION ALL SELECT 'data.file_pool',         count(*) FROM data.file_pool
 UNION ALL SELECT 'acl.permissions',        count(*) FROM acl.permissions
 UNION ALL SELECT 'activity.file_views',    count(*) FROM activity.file_views
 ORDER BY 1;