BooruDex
Abbreviation for "Booru Index" is a booru scraper that supports finding simillar images. It can optionally download images and serve them locally.
Initial thought
Hub (BooruDex)
The main server that has access to the database and file storage. It has no access to the internet, but workers can connect to the hub in order to receive tasks.
The hub is responsible for the following tasks:
- Image hashing
- Media storage
File Storage
Media
Media files will have UUIDs as their filenames with MIME subtype as their extension. UUIDs will be have a pair of their first two digits split into directories.
For example: A JPEG image with UUID f81d4fae-7dec-11d0-a765-00a0c91e6bf6, would be stored as:
media/f8/1d/4fae7dec11d0a76500a0c91e6bf6.jpeg
Thumbnails
Thumbnails are stored exactly like media files, except thumbnails are always formatted as jpeg, and are placed in a different directory to media files.
E.g.: thumbnails/f8/1d/4fae7dec11d0a76500a0c91e6bf6.jpeg
Database
Tasks
A table containing tasks that the hub wants executed
- id - Task ID
- domain - Booru domain of the task
- type - Type of the task (scraping, download, etc.)
- data - Task data (some URL, ID range, etc.)
- pending - Is it pending? If so, sence when?
- assignee - Is it assigned? If so, to who?
Tags
A table containing known tags and optionally their category. Combination of label and category must be unique.
- id - Tag ID
- label - Label on the tag
- category - Optionall tag category
Boorus
A table containing a list of boorus being handled by BooruDex.
- id - Booru ID
- domain - The domain of the booru
- posts - The name of the table that contains booru posts
- tags - The name of the table that contains tag relations
- categories - The name of the table that contains tag categories
- latest - Known latest post in the booru
Booru_[id]_posts
A table containing post data for it's booru.
- id - Post ID
- image - Media ID (referencing media table)
- thumb - Thumbnail ID (referencing thumb table)
- purity - A single character describing the purity of the post
- update - Last time the post entry was updated/tagged
Booru_[id]_tags
A table containing tag relations for it's booru.
- tag - Tag ID (referencing tags table)
- post - Post ID (referencing booru_[id]_posts table)
Booru_[id]_categories
A table containing tag categories as they are represented by the booru.
- label - Tag label (unique)
- category - Tag category
Media
A table containing data about media.
- id - Media ID
- uuid - Unique v4 UUID for referencing the actuall media file
- size - The size of media file
- width - Media width
- height - Media height
- mime - Media mime type
- dhash - Difference hash of the media
- phash - Perspective hash of the media
Thumb
A table containing data about thumbnails
- id - Thumbnail ID
- uuid - Unique v4 UUID for referencing the actuall thumbnail file
- size - The size of thumbnail
- width - Thumbnail width
- height - Thumbnail height
- media - Media ID (referencing media table)
Workers
A table containing a list of known workers and their statistics.
- id - Worker ID
- uuid - Unique v4 UUid for referencing actuall workers
- ip - Latest IP address the worker has connected with
- seen - Latest date the worker has connected at
- scraped - The amount of posts the worker has scraped
- thumbs - The amount of thumbnails the worker has downloaded
- media - The amount of media the worker has downloaded
Workers
Workers request a number of tasks from the hub, providing supported types of tasks and supported booru domains.
Current thoughts of types of workers:
- Scraper - Scrapes a range of post given their ids, returns their tags/metadata, media URL and optionally a thumbnail.
- Downloader - Downloads media and their mime-type given their URLs.