AniSearchModel¶
AniSearchModel leverages Sentence-BERT (SBERT) models to generate embeddings for anime and manga synopses, enabling the calculation of semantic similarities between descriptions. This project facilitates the preprocessing, merging, and analysis of various anime and manga datasets to identify the most similar synopses.
Table of Contents¶
- Overview
- Datasets Used
- Setup
- Usage
- Merging Datasets
- Generating Embeddings
- Testing Embeddings
- Running the Flask Application
- Project Structure
- Dependencies
- Contributing
- License
Overview¶
AniSearchModel performs the following operations:
- Data Loading and Preprocessing: Loads multiple anime and manga datasets, cleans synopses, consolidates titles, and removes duplicates.
- Data Merging: Merges datasets based on common identifiers to create unified anime and manga datasets.
- Embedding Generation: Utilizes SBERT models to generate embeddings for synopses, facilitating semantic similarity calculations.
- Similarity Analysis: Calculates cosine similarities between embeddings to identify the most similar synopses or descriptions.
- API Integration: Provides a Flask-based API to interact with the model and retrieve similarity results.
- Testing: Implements a comprehensive test suite using
pytest
to ensure the reliability and correctness of all components.
Datasets Used¶
Anime Datasets¶
- MyAnimeList Dataset (
Anime.csv
): Kaggle - Anime Dataset 2023 (
anime-dataset-2023.csv
): Kaggle - Anime Database 2022 (
Anime-2022.csv
): Kaggle - Anime Dataset (
animes.csv
): Kaggle - Anime DataSet (
anime4500.csv
): Kaggle - Anime Data (
anime_data.csv
): Kaggle - Anime2 (
anime2.csv
): Kaggle - MAL Anime (
mal_anime.csv
): Kaggle - Anime 270: Hugging Face
- Wykonos Anime: Hugging Face
Manga Datasets¶
- MyAnimeList Manga Dataset (
Manga.csv
): Kaggle - MyAnimeList Jikan Database (
jikan.csv
): Kaggle - Manga, Manhwa and Manhua Dataset (
data.csv
): Kaggle
Setup¶
- Clone the repository:
- Create and activate a virtual environment:
- Ensure
setuptools
is installed:
Before running the setup script, make sure setuptools
is installed in your virtual environment. This is typically included with Python, but you can update it with:
- Install the package and dependencies:
Use the setup.py
script to install the package along with its dependencies. This will also handle the installation of PyTorch with CUDA support:
This command will:
- Install all required Python packages listed in install_requires
.
- Execute the PostInstallCommand
to install PyTorch with CUDA support.
- Verify the installation:
After installation, you can verify that PyTorch is using CUDA by running:
This should print True
if CUDA is available and correctly configured.
Usage¶
Merging Datasets¶
The repository already contains the merged datasets, but if you want to merge additional datasets, edit the merge_datasets.py
file and run:
Generating Embeddings¶
To generate SBERT embeddings for the anime and manga datasets, you can use the provided scripts.
For a Specific Model¶
Replace <model_name>
with the desired SBERT model, e.g., all-mpnet-base-v1
. Replace <dataset_type>
with anime
or manga
.
Generating Embeddings for All Models¶
You can use the provided scripts to generate embeddings for all models listed in models.txt
.
Linux¶
The generate_models.sh
script is available for Linux users. To run the script, follow these steps:
- Make the script executable:
- Run the script:
- Optionally, specify a starting model:
Windows (Batch Script)¶
- Open Command Prompt and navigate to the directory containing the script.
- Run the script:
- Optionally, specify a starting model:
Windows (PowerShell Script)¶
- Open PowerShell and navigate to the directory containing the script.
- Run the script:
- Optionally, specify a starting model:
Notes¶
- The starting model parameter is optional. If not provided, the script will process all models from the beginning of the list.
- For PowerShell, you may need to adjust the execution policy to allow script execution. You can do this by running
Set-ExecutionPolicy RemoteSigned
in an elevated PowerShell session.
Testing Embeddings¶
Testing¶
To ensure the reliability and correctness of the project, a comprehensive suite of tests has been implemented using pytest
. The tests cover various components of the project, including:
Unit Tests¶
tests/test_model.py
:- Purpose: Tests the functionality of model loading, similarity calculations, and evaluation result saving.
-
Key Functions Tested:
test_anime_model
: Verifies that the anime model loads correctly, calculates similarities, and saves evaluation results as expected.test_manga_model
: Similar totest_anime_model
but for the manga dataset.
-
tests/test_merge_datasets.py
: - Purpose: Validates the data preprocessing and merging functions, ensuring that names are correctly processed, synopses are cleaned, titles are consolidated, and duplicates are removed or handled appropriately.
-
Key Functions Tested:
test_preprocess_name
: Ensures that names are preprocessed correctly by converting them to lowercase and stripping whitespace.test_clean_synopsis
: Checks that unwanted phrases are removed from synopses.test_consolidate_titles
: Verifies that multiple title columns are consolidated into a single 'title' column.test_remove_duplicate_infos
: Confirms that duplicate synopses are handled correctly.test_add_additional_info
: Tests the addition of additional synopsis information to the merged DataFrame.
-
tests/test_sbert.py
: - Purpose: Checks the SBERT embedding generation process, verifying that embeddings are correctly created and saved for both anime and manga datasets.
- Key Functions Tested:
run_sbert_command_and_verify
: Runs the SBERT command-line script and verifies that embeddings and evaluation results are generated as expected.- Parameterized tests for different dataset types (
anime
,manga
) and their corresponding expected embedding files.
API Tests¶
tests/test_api.py
:- Purpose: Tests the Flask API endpoints, ensuring that the
/anisearchmodel/manga
endpoint behaves as expected with valid inputs, handles missing fields gracefully, and correctly responds to internal server errors. - Key Functions Tested:
test_get_manga_similarities_success
: Verifies successful retrieval of similarities with valid inputs.test_get_manga_similarities_missing_model
: Checks the API's response when the model name is missing.test_get_manga_similarities_missing_description
: Ensures appropriate handling when the description is missing.- Tests for internal server errors by simulating exceptions during processing.
Test Configuration¶
tests/conftest.py
:- Purpose: Configures
pytest
options and fixtures, including command-line options for specifying the model name during tests. - Key Features:
- Adds a command-line option
--model-name
to specify the model used in tests. - Provides a fixture
model_name
that retrieves the model name from the command-line options.
- Adds a command-line option
Running the Tests¶
To run all the tests, navigate to the project's root directory and execute:
Running Specific Tests¶
You can run specific tests or test modules. For example, to run only the API tests:
To run tests for a specific model, use:
Replace <model_name>
with the name of the model you want to test.
Note¶
--model-name
can be used when running all tests or specific tests.
Running the Flask Application¶
To run the Flask application, use the run_server.py
script. This script automatically determines the operating system and uses the appropriate server. You can also specify whether to use CUDA or CPU for processing:
- On Linux, it uses Gunicorn.
- On Windows, it uses Waitress.
Run the script with:
Replace [cuda|cpu]
with your desired device. If no device is specified, it defaults to cpu
.
The application will be accessible at http://0.0.0.0:5000/anisearchmodel
.
Project Structure¶
This includes files and directories generated by the project which are not part of the source code.
AniSearchModel
├── .github
│ └── workflows
│ ├── codeql.yml
│ └── ruff.yml
├── data
│ ├── anime
│ │ ├── Anime_data.csv
│ │ ├── Anime-2022.csv
│ │ ├── anime-dataset-2023.csv
│ │ ├── anime.csv
│ │ ├── Anime2.csv
│ │ ├── anime4500.csv
│ │ ├── animes.csv
│ │ └── mal_anime.csv
│ └── manga
│ ├── data.csv
│ ├── jikan.csv
│ └── manga.csv
├── logs
│ └── <filename>.log.<#>
├── models
│ ├── anime
│ │ └── <model_name>
│ │ ├── embeddings_Synopsis_anime_270_Dataset.npy
│ │ ├── embeddings_Synopsis_Anime_data_Dataset.npy
│ │ ├── embeddings_Synopsis_anime_dataset_2023.npy
│ │ ├── embeddings_Synopsis_Anime-2022_Dataset.npy
│ │ ├── embeddings_Synopsis_anime2_Dataset.npy
│ │ ├── embeddings_Synopsis_anime4500_Dataset.npy
│ │ ├── embeddings_Synopsis_animes_dataset.npy
│ │ ├── embeddings_Synopsis_mal_anime_Dataset.npy
│ │ ├── embeddings_Synopsis_wykonos_Dataset.npy
│ │ └── embeddings_synopsis.npy
│ ├── manga
│ │ └── <model_name>
│ │ ├── embeddings_Synopsis_data_Dataset.npy
│ │ ├── embeddings_Synopsis_jikan_Dataset.npy
│ │ └── embeddings_synopsis.npy
│ ├── evaluation_results_anime.json
│ ├── evaluation_results_manga.json
│ ├── evaluation_results.json
│ ├── merged_anime_dataset.csv
│ └── merged_manga_dataset.csv
├── scripts
│ ├── generate_models.bat
│ ├── generate_models.ps1
│ └── generate_models.sh
├── src
│ ├── __init__.py
│ ├── api.py
│ ├── common.py
│ ├── merge_datasets.py
│ ├── run_server.py
│ ├── sbert.py
│ └── test.py
├── tests
│ ├── __init__.py
│ ├── conftest.py
│ ├── test_api.py
│ ├── test_merge_datasets.py
│ ├── test_model.py
│ └── test_sbert.py
├── .gitignore
├── architecture.txt
├── datasets.txt
├── LICENSE
├── models.txt
├── pytest.ini
├── README.md
├── requirements.txt
└── setup.py
Dependencies¶
- Python 3.6+
- Python Packages:
- pandas
- numpy
- torch
- transformers
- sentence-transformers
- tqdm
- datasets
- flask
- flask-limiter
- waitress
- gunicorn
- pytest
- pytest-order
Install all dependencies using:
Contributing¶
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
License¶
This project is licensed under the MIT License. See the LICENSE file for details.