commit 87307a6d9aa6a4bd4045e8a0ed01523a77a2113c Author: Tomas Date: Tue Nov 4 18:51:05 2025 +0100 Add README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..a791c97 --- /dev/null +++ b/README.md @@ -0,0 +1,133 @@ +# BooruDex + +Abbreviation for "Booru Index" is a booru scraper that supports finding simillar images. +It can optionally download images and serve them locally. + +## Initial thought + +### Hub (BooruDex) + +The main server that has access to the database and file storage. It has no access to the internet, +but workers can connect to the hub in order to receive tasks. + +The hub is responsible for the following tasks: +- Image hashing +- Media storage + +#### File Storage + +##### Media + +Media files will have UUIDs as their filenames with MIME subtype as their extension. +UUIDs will be have a pair of their first two digits split into directories. + +For example: A JPEG image with UUID f81d4fae-7dec-11d0-a765-00a0c91e6bf6, would be stored as: +`media/f8/1d/4fae7dec11d0a76500a0c91e6bf6.jpeg` + +##### Thumbnails + +Thumbnails are stored exactly like media files, except thumbnails are always formatted as jpeg, +and are placed in a different directory to media files. + +E.g.: `thumbnails/f8/1d/4fae7dec11d0a76500a0c91e6bf6.jpeg` + +#### Database + +##### Tasks + +A table containing tasks that the hub wants executed + +- id - Task ID +- domain - Booru domain of the task +- type - Type of the task (scraping, download, etc.) +- data - Task data (some URL, ID range, etc.) +- pending - Is it pending? If so, sence when? +- assignee - Is it assigned? If so, to who? + +##### Tags + +A table containing known tags and optionally their category. Combination of label and category must be unique. + +- id - Tag ID +- label - Label on the tag +- category - Optionall tag category + +##### Boorus + +A table containing a list of boorus being handled by BooruDex. + +- id - Booru ID +- domain - The domain of the booru +- posts - The name of the table that contains booru posts +- tags - The name of the table that contains tag relations +- categories - The name of the table that contains tag categories +- latest - Known latest post in the booru + +##### Booru_[id]_posts + +A table containing post data for it's booru. + +- id - Post ID +- image - Media ID (referencing media table) +- thumb - Thumbnail ID (referencing thumb table) +- update - Last time the post entry was updated/tagged + +##### Booru_[id]_tags + +A table containing tag relations for it's booru. + +- tag - Tag ID (referencing tags table) +- post - Post ID (referencing booru_[id]_posts table) + +##### Booru_[id]_categories + +A table containing tag categories as they are represented by the booru. + +- label - Tag label (unique) +- category - Tag category + +##### Media + +A table containing data about media. + +- id - Media ID +- uuid - Unique v4 UUID for referencing the actuall media file +- size - The size of media file +- width - Media width +- height - Media height +- mime - Media mime type +- dhash - Difference hash of the media +- phash - Perspective hash of the media + +##### Thumb + +A table containing data about thumbnails + +- id - Thumbnail ID +- uuid - Unique v4 UUID for referencing the actuall thumbnail file +- size - The size of thumbnail +- width - Thumbnail width +- height - Thumbnail height +- media - Media ID (referencing media table) + +##### Workers + +A table containing a list of known workers and their statistics. + +- id - Worker ID +- uuid - Unique v4 UUid for referencing actuall workers +- ip - Latest IP address the worker has connected with +- seen - Latest date the worker has connected at +- scraped - The amount of posts the worker has scraped +- thumbs - The amount of thumbnails the worker has downloaded +- media - The amount of media the worker has downloaded + +### Workers + +Workers request a number of tasks from the hub, providing supported types of tasks and supported +booru domains. + +Current thoughts of types of workers: + +- Scraper - Scrapes a range of post given their ids, returns their tags/metadata, media URL and optionally a thumbnail. +- Downloader - Downloads media and their mime-type given their URLs. \ No newline at end of file