feat(backend): duplicate pairs, dismissals, and merge resolution

Adds the duplicate-detection backend on top of perceptual hashing:

- Two tables (edited into the original migrations): data.duplicate_pairs holds
  precomputed near-duplicate candidates (rebuilt wholesale by the rescan), and
  data.duplicate_dismissals is a global "not a duplicate" overlay that survives
  rescans. New audit actions file_merge / duplicate_dismiss.
- DuplicateService:
  - Rescan builds every pair within DUPLICATE_HASH_THRESHOLD via a BK-tree over
    the perceptual hashes and replaces the pairs table. This is the only thing
    that populates pairs, so GET never compares all-vs-all (scales to 110k+).
  - Clusters reads the precomputed pairs (ACL-filtered, non-trashed, non-
    dismissed), groups them into connected components via union-find, and
    paginates whole clusters.
  - Resolve merges a pair field-by-field: each scalar from keep or discard,
    metadata keep/discard/shallow-merge, tags/pools keep or union; then trashes
    the discarded file. Enforces edit ACL on both.
  - Dismiss records a canonical pair (view ACL on both).
- Endpoints under /files: GET /files/duplicates, POST /files/duplicates/dismiss,
  POST /files/duplicates/resolve (registered before /:id to avoid collision).
  Plain delete reuses /files/bulk/delete.
- Repo support: ListMissingPHash, ListAllPHashes, CopyPoolMemberships, plus the
  DuplicatePairRepo (ReplaceAll via COPY, ListVisible) and DismissalRepo.

Unit tests cover the BK-tree pairing, union-find clustering, metadata merge and
field validation; an integration test covers rescan -> list -> merge -> dismiss
(including that a dismissal survives a re-rescan).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 12:42:37 +03:00
parent 88849cc16b
commit 9216a8687f
15 changed files with 1214 additions and 4 deletions
+6 -1
View File
@@ -26,6 +26,7 @@ func NewRouter(
auth *AuthMiddleware,
authHandler *AuthHandler,
fileHandler *FileHandler,
duplicateHandler *DuplicateHandler,
tagHandler *TagHandler,
categoryHandler *CategoryHandler,
poolHandler *PoolHandler,
@@ -80,7 +81,11 @@ func NewRouter(
files.GET("", fileHandler.List)
files.POST("", fileHandler.Upload)
// Bulk + import routes registered before /:id to prevent param collision.
// Bulk + import + duplicates routes registered before /:id to prevent
// param collision (e.g. "duplicates" being captured as :id).
files.GET("/duplicates", duplicateHandler.List)
files.POST("/duplicates/dismiss", duplicateHandler.Dismiss)
files.POST("/duplicates/resolve", duplicateHandler.Resolve)
files.POST("/bulk/tags", fileHandler.BulkSetTags)
files.POST("/bulk/delete", fileHandler.BulkDelete)
files.POST("/bulk/review", fileHandler.BulkReview)