Skip to content

API

This module implements a Flask application that provides API endpoints for finding similar anime or manga descriptions.

The application uses Sentence Transformers and custom models to encode descriptions and calculate cosine similarities. It supports multiple synopsis columns from different datasets and returns paginated results of the most similar items.

Key Features
  • Supports multiple pre-trained and custom Sentence Transformer models
  • Handles both anime and manga similarity searches
  • Implements rate limiting and CORS
  • Provides memory management for GPU resources
  • Includes comprehensive logging
  • Returns paginated results with similarity scores
The API endpoints are
  • POST /anisearchmodel/anime: Find similar anime based on description
  • POST /anisearchmodel/manga: Find similar manga based on description

CONSOLE_LOGGING_LEVEL module-attribute

CONSOLE_LOGGING_LEVEL = INFO

FILE_LOGGING_LEVEL module-attribute

FILE_LOGGING_LEVEL = DEBUG

allowed_models module-attribute

allowed_models = ['sentence-transformers/all-distilroberta-v1', 'sentence-transformers/all-MiniLM-L6-v1', 'sentence-transformers/all-MiniLM-L12-v1', 'sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-MiniLM-L12-v2', 'sentence-transformers/all-mpnet-base-v1', 'sentence-transformers/all-mpnet-base-v2', 'sentence-transformers/all-roberta-large-v1', 'sentence-transformers/gtr-t5-base', 'sentence-transformers/gtr-t5-large', 'sentence-transformers/gtr-t5-xl', 'sentence-transformers/multi-qa-distilbert-dot-v1', 'sentence-transformers/multi-qa-mpnet-base-cos-v1', 'sentence-transformers/multi-qa-mpnet-base-dot-v1', 'sentence-transformers/paraphrase-distilroberta-base-v2', 'sentence-transformers/paraphrase-mpnet-base-v2', 'sentence-transformers/sentence-t5-base', 'sentence-transformers/sentence-t5-large', 'sentence-transformers/sentence-t5-xl', 'sentence-transformers/sentence-t5-xxl', 'toobi/anime', 'sentence-transformers/fine_tuned_sbert_anime_model', 'fine_tuned_sbert_anime_model', 'fine_tuned_sbert_model_anime']

anime_df module-attribute

anime_df = read_csv('model/merged_anime_dataset.csv')

anime_synopsis_columns module-attribute

anime_synopsis_columns = ['synopsis', 'Synopsis anime_dataset_2023', 'Synopsis animes dataset', 'Synopsis anime_270 Dataset', 'Synopsis Anime-2022 Dataset', 'Synopsis anime4500 Dataset', 'Synopsis wykonos Dataset', 'Synopsis Anime_data Dataset', 'Synopsis anime2 Dataset', 'Synopsis mal_anime Dataset']

app module-attribute

app = Flask(__name__)

debug_mode module-attribute

debug_mode = lower() in ['true', '1']

device module-attribute

device = 'cuda' if getenv('DEVICE', 'cpu') == 'cuda' and is_available() else 'cpu'

file_formatter module-attribute

file_formatter = Formatter('%(asctime)s - %(levelname)s - %(message)s')

file_handler module-attribute

file_handler = ConcurrentRotatingFileHandler('./logs/api.log', maxBytes=10 * 1024 * 1024, backupCount=10, encoding='utf-8')

last_request_time module-attribute

last_request_time = time()

last_request_time_lock module-attribute

last_request_time_lock = Lock()

limiter module-attribute

limiter = Limiter(get_remote_address, app=app, default_limits=['1 per second'])

manga_df module-attribute

manga_df = read_csv('model/merged_manga_dataset.csv')

manga_synopsis_columns module-attribute

manga_synopsis_columns = ['synopsis', 'Synopsis jikan Dataset', 'Synopsis data Dataset']

stream_formatter module-attribute

stream_formatter = Formatter('%(asctime)s - %(levelname)s - %(message)s')

stream_handler module-attribute

stream_handler = StreamHandler(stdout)

calculate_cosine_similarities

calculate_cosine_similarities(model: SentenceTransformer | CustomT5EncoderModel, model_name: str, new_embedding: ndarray, col: str, dataset_type: str) -> ndarray

Calculates cosine similarities between a new embedding and existing embeddings.

This function:

  1. Loads pre-computed embeddings for the specified column

  2. Verifies embedding dimensions match

  3. Computes cosine similarity scores using GPU if available

PARAMETER DESCRIPTION
model

The transformer model used for encoding

TYPE: SentenceTransformer | CustomT5EncoderModel

model_name

Name of the model

TYPE: str

new_embedding

Embedding vector of the input description

TYPE: ndarray

col

Name of the synopsis column

TYPE: str

dataset_type

Type of dataset ('anime' or 'manga')

TYPE: str

RETURNS DESCRIPTION
ndarray

Array of cosine similarity scores between the new embedding and all existing embeddings

RAISES DESCRIPTION
ValueError

If embedding dimensions don't match

Source code in src/api.py
def calculate_cosine_similarities(
    model: SentenceTransformer | CustomT5EncoderModel,
    model_name: str,
    new_embedding: np.ndarray,
    col: str,
    dataset_type: str,
) -> np.ndarray:
    """
    Calculates cosine similarities between a new embedding and existing embeddings.

    This function:

    1. Loads pre-computed embeddings for the specified column

    2. Verifies embedding dimensions match

    3. Computes cosine similarity scores using GPU if available

    Args:
        model: The transformer model used for encoding
        model_name: Name of the model
        new_embedding: Embedding vector of the input description
        col: Name of the synopsis column
        dataset_type: Type of dataset ('anime' or 'manga')

    Returns:
        Array of cosine similarity scores between the new embedding and all existing embeddings

    Raises:
        ValueError: If embedding dimensions don't match
    """
    model_name = model_name.replace("sentence-transformers/", "")
    model_name = model_name.replace("toobi/", "")
    existing_embeddings = load_embeddings(model_name, col, dataset_type)
    if existing_embeddings.shape[1] != model.get_sentence_embedding_dimension():
        raise ValueError(f"Incompatible dimension for embeddings in {col}")
    new_embedding_tensor = torch.tensor(new_embedding).to(device)
    existing_embeddings_tensor = torch.tensor(existing_embeddings).to(device)
    return (
        util.pytorch_cos_sim(new_embedding_tensor, existing_embeddings_tensor)
        .flatten()
        .cpu()
        .numpy()
    )

clear_memory

clear_memory() -> None

Frees up system memory and GPU cache.

This function performs two cleanup operations:

  1. Empties the GPU cache if CUDA is being used

  2. Runs Python's garbage collector to free memory

Source code in src/api.py
def clear_memory() -> None:
    """
    Frees up system memory and GPU cache.

    This function performs two cleanup operations:

    1. Empties the GPU cache if CUDA is being used

    2. Runs Python's garbage collector to free memory
    """
    torch.cuda.empty_cache()
    gc.collect()

find_top_similarities

find_top_similarities(cosine_similarities_dict: Dict[str, ndarray], num_similarities: int = 10) -> List[Tuple[int, str]]

Finds the top N most similar descriptions across all synopsis columns.

This function:

  1. Processes similarity scores from all columns

  2. Sorts them in descending order

  3. Returns indices and column names for the top matches

PARAMETER DESCRIPTION
cosine_similarities_dict

Dictionary mapping column names to arrays of similarity scores

TYPE: Dict[str, ndarray]

num_similarities

Number of top similarities to return (default: 10)

TYPE: int DEFAULT: 10

RETURNS DESCRIPTION
List[Tuple[int, str]]

List of tuples containing (index, column_name) for the top similar descriptions,

List[Tuple[int, str]]

sorted by similarity score in descending order

Source code in src/api.py
def find_top_similarities(
    cosine_similarities_dict: Dict[str, np.ndarray], num_similarities: int = 10
) -> List[Tuple[int, str]]:
    """
    Finds the top N most similar descriptions across all synopsis columns.

    This function:

    1. Processes similarity scores from all columns

    2. Sorts them in descending order

    3. Returns indices and column names for the top matches

    Args:
        cosine_similarities_dict: Dictionary mapping column names to arrays of similarity scores
        num_similarities: Number of top similarities to return (default: 10)

    Returns:
        List of tuples containing (index, column_name) for the top similar descriptions,
        sorted by similarity score in descending order
    """
    all_top_indices = []
    for col, cosine_similarities in cosine_similarities_dict.items():
        top_indices_unsorted = np.argsort(cosine_similarities)[-num_similarities:]
        top_indices = top_indices_unsorted[
            np.argsort(cosine_similarities[top_indices_unsorted])[::-1]
        ]
        all_top_indices.extend([(idx, col) for idx in top_indices])
    all_top_indices.sort(
        key=lambda x: cosine_similarities_dict[x[1]][x[0]],  # type: ignore
        reverse=True,
    )  # type: ignore
    return all_top_indices

get_anime_similarities

get_anime_similarities() -> Response

API endpoint for finding similar anime based on a description.

This endpoint:

  1. Validates the request payload

  2. Processes the description using the specified model

  3. Returns paginated results of similar anime

Expected JSON payload:

{
    "model": str,          # Name of the model to use
    "description": str,    # Input description to find similarities for
    "page": int,           # Optional: Page number (default: 1)
    "resultsPerPage": int  # Optional: Results per page (default: 10)
}

RETURNS DESCRIPTION
Response

JSON response containing:

Response
  • List of similar anime with metadata
Response
  • Similarity scores
Response
  • Pagination information
RAISES DESCRIPTION
400

If request validation fails

500

If internal processing error occurs

Source code in src/api.py
@app.route("/anisearchmodel/anime", methods=["POST"])
@limiter.limit("1 per second")
def get_anime_similarities() -> Response:
    """
    API endpoint for finding similar anime based on a description.

    This endpoint:

    1. Validates the request payload

    2. Processes the description using the specified model

    3. Returns paginated results of similar anime

    Expected JSON payload:
    ```
    {
        "model": str,          # Name of the model to use
        "description": str,    # Input description to find similarities for
        "page": int,           # Optional: Page number (default: 1)
        "resultsPerPage": int  # Optional: Results per page (default: 10)
    }
    ```

    Returns:
        JSON response containing:
        - List of similar anime with metadata
        - Similarity scores
        - Pagination information

    Raises:
        400: If request validation fails
        500: If internal processing error occurs
    """
    try:
        clear_memory()
        data = request.json
        if data is None:
            raise ValueError("Request payload is missing or not in JSON format")
        validate_input(data)
        model_name = data.get("model")
        if model_name == "sentence-transformers/fine_tuned_sbert_anime_model":
            model_name = "fine_tuned_sbert_model_anime"
        description = data.get("description")
        page = data.get("page", 1)
        results_per_page = data.get("resultsPerPage", 10)

        # Get the client's IP address
        client_ip = request.headers.get("X-Forwarded-For", request.remote_addr)
        logging.info(
            "Received anime request from IP: %s with model: %s, "
            "description: %s, page: %d, resultsPerPage: %d",
            client_ip,
            model_name,
            description,
            page,
            results_per_page,
        )

        results = get_similarities(
            model_name, description, "anime", page, results_per_page
        )
        logging.info("Returning %d anime results", len(results))
        clear_memory()
        return jsonify(results)

    except ValueError as e:
        logging.error("Validation error: %s", e)
        return make_response(jsonify({"error": "Bad Request"}), 400)
    except Exception as e:  # pylint: disable=broad-exception-caught
        logging.error("Internal server error: %s", e)
        return make_response(jsonify({"error": "Internal server error"}), 500)

get_manga_similarities

get_manga_similarities() -> Response

API endpoint for finding similar manga based on a description.

This endpoint:

  1. Validates the request payload

  2. Processes the description using the specified model

  3. Returns paginated results of similar manga

Expected JSON payload:

{
    "model": str,          # Name of the model to use
    "description": str,    # Input description to find similarities for
    "page": int,           # Optional: Page number (default: 1)
    "resultsPerPage": int  # Optional: Results per page (default: 10)
}

RETURNS DESCRIPTION
Response

JSON response containing:

Response
  • List of similar manga with metadata
Response
  • Similarity scores
Response
  • Pagination information
RAISES DESCRIPTION
400

If request validation fails

500

If internal processing error occurs

Source code in src/api.py
@app.route("/anisearchmodel/manga", methods=["POST"])  # type: ignore
@limiter.limit("1 per second")
def get_manga_similarities() -> Response:
    """
    API endpoint for finding similar manga based on a description.

    This endpoint:

    1. Validates the request payload

    2. Processes the description using the specified model

    3. Returns paginated results of similar manga

    Expected JSON payload:
    ```
    {
        "model": str,          # Name of the model to use
        "description": str,    # Input description to find similarities for
        "page": int,           # Optional: Page number (default: 1)
        "resultsPerPage": int  # Optional: Results per page (default: 10)
    }
    ```

    Returns:
        JSON response containing:
        - List of similar manga with metadata
        - Similarity scores
        - Pagination information

    Raises:
        400: If request validation fails
        500: If internal processing error occurs
    """
    try:
        clear_memory()
        data = request.json
        if data is None:
            raise ValueError("Request payload is missing or not in JSON format")
        validate_input(data)
        model_name = data.get("model")
        if model_name == "sentence-transformers/fine_tuned_sbert_anime_model":
            model_name = "fine_tuned_sbert_model_anime"
        description = data.get("description")
        page = data.get("page", 1)
        results_per_page = data.get("resultsPerPage", 10)

        # Get the client's IP address
        client_ip = request.headers.get("X-Forwarded-For", request.remote_addr)

        logging.info(
            "Manga request - IP: %s, model: %s, desc: %s, "
            "page: %d, results/page: %d",
            client_ip,
            model_name,
            description,
            page,
            results_per_page,
        )

        results = get_similarities(
            model_name, description, "manga", page, results_per_page
        )
        logging.info("Returning %d manga results", len(results))
        clear_memory()
        return jsonify(results)

    except HTTPException as e:
        logging.error("HTTP error: %s", e)
        return make_response(jsonify({"error": e.description}), e.code)
    except Exception as e:  # pylint: disable=broad-exception-caught
        logging.error("Internal server error: %s", e)
        return make_response(jsonify({"error": "Internal server error"}), 500)

get_similarities

get_similarities(model_name: str, description: str, dataset_type: str, page: int = 1, results_per_page: int = 10) -> List[Dict[str, Any]]

Finds the most similar descriptions in the specified dataset.

This function:

  1. Loads and validates the appropriate model

  2. Encodes the input description

  3. Calculates similarities with all stored descriptions

  4. Returns paginated results with metadata

PARAMETER DESCRIPTION
model_name

Name of the model to use

TYPE: str

description

Input description to find similarities for

TYPE: str

dataset_type

Type of dataset ('anime' or 'manga')

TYPE: str

page

Page number for pagination (default: 1)

TYPE: int DEFAULT: 1

results_per_page

Number of results per page (default: 10)

TYPE: int DEFAULT: 10

RETURNS DESCRIPTION
List[Dict[str, Any]]

List of dictionaries containing similar items with metadata and similarity scores

RAISES DESCRIPTION
ValueError

If model name is invalid or model loading fails

Source code in src/api.py
def get_similarities(
    model_name: str,
    description: str,
    dataset_type: str,
    page: int = 1,
    results_per_page: int = 10,
) -> List[Dict[str, Any]]:
    """
    Finds the most similar descriptions in the specified dataset.

    This function:

    1. Loads and validates the appropriate model

    2. Encodes the input description

    3. Calculates similarities with all stored descriptions

    4. Returns paginated results with metadata

    Args:
        model_name: Name of the model to use
        description: Input description to find similarities for
        dataset_type: Type of dataset ('anime' or 'manga')
        page: Page number for pagination (default: 1)
        results_per_page: Number of results per page (default: 10)

    Returns:
        List of dictionaries containing similar items with metadata and similarity scores

    Raises:
        ValueError: If model name is invalid or model loading fails
    """
    update_last_request_time()

    # Validate model name
    if model_name not in allowed_models:
        raise ValueError("Invalid model name")

    # Select the appropriate dataset and synopsis columns
    if dataset_type == "anime":
        df = anime_df
        synopsis_columns = anime_synopsis_columns
    else:
        df = manga_df
        synopsis_columns = manga_synopsis_columns

    if (
        model_name == "fine_tuned_sbert_anime_model"
        or model_name == "fine_tuned_sbert_model_anime"
    ):
        load_model_name = f"model/{model_name}"
    else:
        load_model_name = model_name

    # Load the complete SentenceTransformer model
    try:
        model = SentenceTransformer(load_model_name, device=device)
    except Exception as e:
        raise ValueError(f"Failed to load model '{load_model_name}': {e}") from e

    processed_description = description.strip()
    new_pooled_embedding = model.encode([processed_description])

    cosine_similarities_dict = {
        col: calculate_cosine_similarities(
            model, model_name, new_pooled_embedding, col, dataset_type
        )
        for col in synopsis_columns
    }

    all_top_indices = find_top_similarities(
        cosine_similarities_dict, num_similarities=page * results_per_page
    )

    seen_names = set()
    results: List[Dict[str, Any]] = []

    for idx, col in all_top_indices:
        name = df.iloc[idx]["title"]
        relevant_synopsis = df.iloc[idx][col]

        # Check if the relevant synopsis is valid
        if pd.isna(relevant_synopsis) or relevant_synopsis.strip() == "":
            continue

        if name not in seen_names:
            row_data = df.iloc[idx].to_dict()  # Convert the entire row to a dictionary
            # Keep only the relevant synopsis column
            row_data = {
                k: v
                for k, v in row_data.items()
                if k not in synopsis_columns or k == col
            }
            row_data.update(
                {
                    "rank": len(results) + 1,
                    "similarity": float(cosine_similarities_dict[col][idx]),
                    "synopsis": relevant_synopsis,  # Ensure the correct synopsis is included
                }
            )
            results.append(row_data)
            seen_names.add(name)
            if len(results) >= page * results_per_page:
                break

    # Clear memory
    del model, new_pooled_embedding, cosine_similarities_dict
    clear_memory()

    # Calculate start and end indices for pagination
    start_index = (page - 1) * results_per_page
    end_index = start_index + results_per_page

    return results[start_index:end_index]

load_embeddings

load_embeddings(model_name: str, col: str, dataset_type: str) -> ndarray

Loads pre-computed embeddings for a specific model and dataset column.

PARAMETER DESCRIPTION
model_name

Name of the model used to generate the embeddings

TYPE: str

col

Name of the synopsis column

TYPE: str

dataset_type

Type of dataset ('anime' or 'manga')

TYPE: str

RETURNS DESCRIPTION
ndarray

NumPy array containing the pre-computed embeddings

RAISES DESCRIPTION
FileNotFoundError

If the embeddings file doesn't exist

Source code in src/api.py
def load_embeddings(model_name: str, col: str, dataset_type: str) -> np.ndarray:
    """
    Loads pre-computed embeddings for a specific model and dataset column.

    Args:
        model_name: Name of the model used to generate the embeddings
        col: Name of the synopsis column
        dataset_type: Type of dataset ('anime' or 'manga')

    Returns:
        NumPy array containing the pre-computed embeddings

    Raises:
        FileNotFoundError: If the embeddings file doesn't exist
    """
    embeddings_file = (
        f"model/{dataset_type}/{model_name}/embeddings_{col.replace(' ', '_')}.npy"
    )
    return np.load(embeddings_file)

periodic_memory_clear

periodic_memory_clear() -> None

Runs a background thread that periodically cleans up memory.

The thread monitors the time since the last API request. If no requests have been made for over 300 seconds (5 minutes), it triggers memory cleanup to free resources.

The function runs indefinitely until the application is shut down.

Source code in src/api.py
def periodic_memory_clear() -> None:
    """
    Runs a background thread that periodically cleans up memory.

    The thread monitors the time since the last API request. If no requests have been
    made for over 300 seconds (5 minutes), it triggers memory cleanup to free resources.

    The function runs indefinitely until the application is shut down.
    """
    logging.info("Starting the periodic memory clear thread.")
    while True:
        with last_request_time_lock:
            current_time = time.time()
            if current_time - last_request_time > 300:
                logging.debug("Clearing memory due to inactivity.")
                clear_memory()
        time.sleep(300)

update_last_request_time

update_last_request_time() -> None

Updates the last request time to the current time in a thread-safe manner.

This function is used to track when the last API request was made, which helps with memory management and cleanup of unused resources.

Source code in src/api.py
def update_last_request_time() -> None:
    """
    Updates the last request time to the current time in a thread-safe manner.

    This function is used to track when the last API request was made, which helps
    with memory management and cleanup of unused resources.
    """
    with last_request_time_lock:
        global last_request_time
        last_request_time = time.time()

validate_input

validate_input(data: Dict[str, Any]) -> None

Validates the input data for API requests.

This function checks that:

  1. Both model name and description are provided

  2. The description length is within acceptable limits

  3. The specified model is in the list of allowed models

PARAMETER DESCRIPTION
data

Dictionary containing the request data with 'model' and 'description' keys

TYPE: Dict[str, Any]

RAISES DESCRIPTION
HTTPException

If any validation check fails, with appropriate error message and status code

Source code in src/api.py
def validate_input(data: Dict[str, Any]) -> None:
    """
    Validates the input data for API requests.

    This function checks that:

    1. Both model name and description are provided

    2. The description length is within acceptable limits

    3. The specified model is in the list of allowed models

    Args:
        data: Dictionary containing the request data with 'model' and 'description' keys

    Raises:
        HTTPException: If any validation check fails, with appropriate error message and status code
    """
    model_name = data.get("model")
    description = data.get("description")

    if not model_name or not description:
        logging.error("Model name or description missing in the request.")
        abort(400, description="Model name and description are required")

    if len(description) > 2000:
        logging.error("Description too long.")
        abort(400, description="Description is too long")

    if model_name not in allowed_models:
        logging.error("Invalid model name.")
        abort(400, description="Invalid model name")