AniSearchModel¶

AniSearchModel leverages Sentence-BERT (SBERT) models to generate embeddings for anime and manga synopses, enabling the calculation of semantic similarities between descriptions. This project facilitates the preprocessing, merging, and analysis of various anime and manga datasets to identify the most similar synopses.

Table of Contents¶

Overview
Datasets Used
Setup
Usage
Merging Datasets
Generating Embeddings
- For a Specific Model
- Generating Embeddings for All Models
Testing Embeddings
Running the Flask Application
Project Structure
Dependencies
Contributing
License

Overview¶

AniSearchModel performs the following operations:

Data Loading and Preprocessing: Loads multiple anime and manga datasets, cleans synopses, consolidates titles, and removes duplicates.
Data Merging: Merges datasets based on common identifiers to create unified anime and manga datasets.
Embedding Generation: Utilizes SBERT models to generate embeddings for synopses, facilitating semantic similarity calculations.
Similarity Analysis: Calculates cosine similarities between embeddings to identify the most similar synopses or descriptions.
API Integration: Provides a Flask-based API to interact with the model and retrieve similarity results.
Testing: Implements a comprehensive test suite using pytest to ensure the reliability and correctness of all components.

Datasets Used¶

Anime Datasets¶

MyAnimeList Dataset (Anime.csv): Kaggle
Anime Dataset 2023 (anime-dataset-2023.csv): Kaggle
Anime Database 2022 (Anime-2022.csv): Kaggle
Anime Dataset (animes.csv): Kaggle
Anime DataSet (anime4500.csv): Kaggle
Anime Data (anime_data.csv): Kaggle
Anime2 (anime2.csv): Kaggle
MAL Anime (mal_anime.csv): Kaggle
Anime 270: Hugging Face
Wykonos Anime: Hugging Face

Manga Datasets¶

MyAnimeList Manga Dataset (Manga.csv): Kaggle
MyAnimeList Jikan Database (jikan.csv): Kaggle
Manga, Manhwa and Manhua Dataset (data.csv): Kaggle

Setup¶

Clone the repository:

git clone https://github.com/RLAlpha49/AniSearchModel.git
cd AniSearchModel

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Linux/Mac
venv\Scripts\activate     # On Windows

Ensure setuptools is installed:

Before running the setup script, make sure setuptools is installed in your virtual environment. This is typically included with Python, but you can update it with:

pip install --upgrade setuptools

Install the package and dependencies:

Use the setup.py script to install the package along with its dependencies. This will also handle the installation of PyTorch with CUDA support:

python setup.py install

This command will: - Install all required Python packages listed in install_requires. - Execute the PostInstallCommand to install PyTorch with CUDA support.

Verify the installation:

After installation, you can verify that PyTorch is using CUDA by running:

python -c "import torch; print(torch.cuda.is_available())"

This should print True if CUDA is available and correctly configured.

Usage¶

Merging Datasets¶

The repository already contains the merged datasets, but if you want to merge additional datasets, edit the merge_datasets.py file and run:

python merge_datasets.py --type anime
python merge_datasets.py --type manga

Generating Embeddings¶

To generate SBERT embeddings for the anime and manga datasets, you can use the provided scripts.

For a Specific Model¶

python sbert.py --model <model_name> --type <dataset_type>

Replace <model_name> with the desired SBERT model, e.g., all-mpnet-base-v1. Replace <dataset_type> with anime or manga.

Generating Embeddings for All Models¶

You can use the provided scripts to generate embeddings for all models listed in models.txt.

Linux¶

The generate_models.sh script is available for Linux users. To run the script, follow these steps:

Make the script executable:

chmod +x generate_models.sh

Run the script:

./scripts/generate_models.sh

Optionally, specify a starting model:

./scripts/generate_models.sh sentence-transformers/all-MiniLM-L6-v1

Windows (Batch Script)¶

Open Command Prompt and navigate to the directory containing the script.
Run the script:

scripts\generate_models.bat

Optionally, specify a starting model:

scripts\generate_models.bat sentence-transformers/all-MiniLM-L6-v1

Windows (PowerShell Script)¶

Open PowerShell and navigate to the directory containing the script.
Run the script:

.\scripts\generate_models.ps1

Optionally, specify a starting model:

.\scripts\generate_models.ps1 -StartModel "sentence-transformers/all-MiniLM-L6-v1"

Notes¶

The starting model parameter is optional. If not provided, the script will process all models from the beginning of the list.
For PowerShell, you may need to adjust the execution policy to allow script execution. You can do this by running Set-ExecutionPolicy RemoteSigned in an elevated PowerShell session.

Testing Embeddings¶

Testing¶

To ensure the reliability and correctness of the project, a comprehensive suite of tests has been implemented using pytest. The tests cover various components of the project, including:

Unit Tests¶

tests/test_model.py:
Purpose: Tests the functionality of model loading, similarity calculations, and evaluation result saving.
Key Functions Tested:
- test_anime_model: Verifies that the anime model loads correctly, calculates similarities, and saves evaluation results as expected.
- test_manga_model: Similar to test_anime_model but for the manga dataset.
tests/test_merge_datasets.py:
Purpose: Validates the data preprocessing and merging functions, ensuring that names are correctly processed, synopses are cleaned, titles are consolidated, and duplicates are removed or handled appropriately.
Key Functions Tested:
- test_preprocess_name: Ensures that names are preprocessed correctly by converting them to lowercase and stripping whitespace.
- test_clean_synopsis: Checks that unwanted phrases are removed from synopses.
- test_consolidate_titles: Verifies that multiple title columns are consolidated into a single 'title' column.
- test_remove_duplicate_infos: Confirms that duplicate synopses are handled correctly.
- test_add_additional_info: Tests the addition of additional synopsis information to the merged DataFrame.
tests/test_sbert.py:
Purpose: Checks the SBERT embedding generation process, verifying that embeddings are correctly created and saved for both anime and manga datasets.
Key Functions Tested:
- run_sbert_command_and_verify: Runs the SBERT command-line script and verifies that embeddings and evaluation results are generated as expected.
- Parameterized tests for different dataset types (anime, manga) and their corresponding expected embedding files.

API Tests¶

tests/test_api.py:
Purpose: Tests the Flask API endpoints, ensuring that the /anisearchmodel/manga endpoint behaves as expected with valid inputs, handles missing fields gracefully, and correctly responds to internal server errors.
Key Functions Tested:
- test_get_manga_similarities_success: Verifies successful retrieval of similarities with valid inputs.
- test_get_manga_similarities_missing_model: Checks the API's response when the model name is missing.
- test_get_manga_similarities_missing_description: Ensures appropriate handling when the description is missing.
- Tests for internal server errors by simulating exceptions during processing.

Test Configuration¶

tests/conftest.py:
Purpose: Configures pytest options and fixtures, including command-line options for specifying the model name during tests.
Key Features:
- Adds a command-line option --model-name to specify the model used in tests.
- Provides a fixture model_name that retrieves the model name from the command-line options.

Running the Tests¶

To run all the tests, navigate to the project's root directory and execute:

pytest

Running Specific Tests¶

You can run specific tests or test modules. For example, to run only the API tests:

pytest tests/test_api.py

To run tests for a specific model, use:

pytest tests/test_sbert.py --model-name <model_name>

Replace <model_name> with the name of the model you want to test.

Note¶

--model-name can be used when running all tests or specific tests.

Running the Flask Application¶

To run the Flask application, use the run_server.py script. This script automatically determines the operating system and uses the appropriate server. You can also specify whether to use CUDA or CPU for processing:

On Linux, it uses Gunicorn.
On Windows, it uses Waitress.

Run the script with:

python src/run_server.py [cuda|cpu]

Replace [cuda|cpu] with your desired device. If no device is specified, it defaults to cpu.

The application will be accessible at http://0.0.0.0:5000/anisearchmodel.

Project Structure¶

This includes files and directories generated by the project which are not part of the source code.

AniSearchModel
├── .github
│   └── workflows
│       ├── codeql.yml
│       └── ruff.yml
├── data
│   ├── anime
│   │   ├── Anime_data.csv
│   │   ├── Anime-2022.csv
│   │   ├── anime-dataset-2023.csv
│   │   ├── anime.csv
│   │   ├── Anime2.csv
│   │   ├── anime4500.csv
│   │   ├── animes.csv
│   │   └── mal_anime.csv
│   └── manga
│       ├── data.csv
│       ├── jikan.csv
│       └── manga.csv
├── logs
│   └── <filename>.log.<#>
├── models
│   ├── anime
│   │   └── <model_name>
│   │       ├── embeddings_Synopsis_anime_270_Dataset.npy
│   │       ├── embeddings_Synopsis_Anime_data_Dataset.npy
│   │       ├── embeddings_Synopsis_anime_dataset_2023.npy
│   │       ├── embeddings_Synopsis_Anime-2022_Dataset.npy
│   │       ├── embeddings_Synopsis_anime2_Dataset.npy
│   │       ├── embeddings_Synopsis_anime4500_Dataset.npy
│   │       ├── embeddings_Synopsis_animes_dataset.npy
│   │       ├── embeddings_Synopsis_mal_anime_Dataset.npy
│   │       ├── embeddings_Synopsis_wykonos_Dataset.npy
│   │       └── embeddings_synopsis.npy
│   ├── manga
│   │   └── <model_name>
│   │       ├── embeddings_Synopsis_data_Dataset.npy
│   │       ├── embeddings_Synopsis_jikan_Dataset.npy
│   │       └── embeddings_synopsis.npy
│   ├── evaluation_results_anime.json
│   ├── evaluation_results_manga.json
│   ├── evaluation_results.json
│   ├── merged_anime_dataset.csv
│   └── merged_manga_dataset.csv
├── scripts
│   ├── generate_models.bat
│   ├── generate_models.ps1
│   └── generate_models.sh
├── src
│   ├── __init__.py
│   ├── api.py
│   ├── common.py
│   ├── merge_datasets.py
│   ├── run_server.py
│   ├── sbert.py
│   └── test.py
├── tests
│   ├── __init__.py
│   ├── conftest.py
│   ├── test_api.py
│   ├── test_merge_datasets.py
│   ├── test_model.py
│   └── test_sbert.py
├── .gitignore
├── architecture.txt
├── datasets.txt
├── LICENSE
├── models.txt
├── pytest.ini
├── README.md
├── requirements.txt
└── setup.py

Dependencies¶

Python 3.6+
Python Packages:
pandas
numpy
torch
transformers
sentence-transformers
tqdm
datasets
flask
flask-limiter
waitress
gunicorn
pytest
pytest-order

Install all dependencies using:

python setup.py install

Contributing¶

Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.

License¶

This project is licensed under the MIT License. See the LICENSE file for details.