Skip to content

Test

Provides functionality for semantic similarity search in anime and manga datasets.

This module handles loading pre-trained models and embeddings, calculating semantic similarities between descriptions, and saving evaluation results. It supports both anime and manga datasets and uses sentence transformers for embedding generation.

Key Features
  • Model and embedding loading with automatic device selection
  • Batched similarity calculation using cosine similarity
  • Deduplication of results based on titles
  • Comprehensive evaluation result logging
  • Support for multiple synopsis/description columns

The module is designed to work with pre-computed embeddings stored in numpy arrays and uses efficient tensor operations for similarity calculations.

FUNCTION DESCRIPTION
load_model_and_embeddings

Loads model, dataset and embeddings for similarity search

calculate_similarities

Computes semantic similarities between descriptions

save_evaluation_results

Logs evaluation results with timestamps and metadata

calculate_similarities

calculate_similarities(model: SentenceTransformer, df: DataFrame, synopsis_columns: List[str], embeddings_save_dir: str, new_description: str, top_n: int = 10) -> List[Dict[str, Any]]

Find semantically similar titles by comparing embeddings.

Calculates cosine similarities between a new description's embedding and pre-computed embeddings from the dataset. Returns the top-N most similar titles, removing duplicates across different synopsis columns.

PARAMETER DESCRIPTION
model

Model to encode the new description

TYPE: SentenceTransformer

df

Dataset containing titles and synopses

TYPE: DataFrame

synopsis_columns

Columns containing synopsis text

TYPE: List[str]

embeddings_save_dir

Directory containing pre-computed embeddings

TYPE: str

new_description

Description to find similar titles for

TYPE: str

top_n

Number of similar titles to return. Defaults to 10.

TYPE: int DEFAULT: 10

RETURNS DESCRIPTION
List[Dict[str, Any]]

List[Dict[str, Any]]: Top similar titles, each containing: - rank: Position in results (1-based) - title: Title of the anime/manga - synopsis: Plot description/synopsis - similarity: Cosine similarity score - source_column: Column the synopsis came from

RAISES DESCRIPTION
ValueError

If no valid embeddings are found in embeddings_save_dir

Source code in src/test.py
def calculate_similarities(
    model: SentenceTransformer,
    df: pd.DataFrame,
    synopsis_columns: List[str],
    embeddings_save_dir: str,
    new_description: str,
    top_n: int = 10,
) -> List[Dict[str, Any]]:
    """
    Find semantically similar titles by comparing embeddings.

    Calculates cosine similarities between a new description's embedding and
    pre-computed embeddings from the dataset. Returns the top-N most similar
    titles, removing duplicates across different synopsis columns.

    Args:
        model (SentenceTransformer): Model to encode the new description
        df (pd.DataFrame): Dataset containing titles and synopses
        synopsis_columns (List[str]): Columns containing synopsis text
        embeddings_save_dir (str): Directory containing pre-computed embeddings
        new_description (str): Description to find similar titles for
        top_n (int, optional): Number of similar titles to return. Defaults to 10.

    Returns:
        List[Dict[str, Any]]: Top similar titles, each containing:
            - rank: Position in results (1-based)
            - title: Title of the anime/manga
            - synopsis: Plot description/synopsis
            - similarity: Cosine similarity score
            - source_column: Column the synopsis came from

    Raises:
        ValueError: If no valid embeddings are found in embeddings_save_dir
    """
    processed_description = common.preprocess_text(new_description)
    new_pooled_embedding = model.encode(
        [processed_description], convert_to_tensor=True, device="cpu"
    )

    cosine_similarities_dict = {}
    for col in synopsis_columns:
        embeddings_file = os.path.join(
            embeddings_save_dir, f"embeddings_{col.replace(' ', '_')}.npy"
        )
        if not os.path.exists(embeddings_file):
            print(f"Embeddings file not found for column '{col}': {embeddings_file}")
            continue

        existing_embeddings = np.load(embeddings_file)
        existing_embeddings_tensor = torch.tensor(existing_embeddings).to("cpu")

        with torch.no_grad():
            cosine_similarities = (
                util.pytorch_cos_sim(new_pooled_embedding, existing_embeddings_tensor)
                .squeeze(0)
                .cpu()
                .numpy()
            )

        cosine_similarities_dict[col] = cosine_similarities

    if not cosine_similarities_dict:
        raise ValueError(
            "No valid embeddings were loaded. Please check your embeddings directory and files."
        )

    all_top_indices = []
    for col, cosine_similarities in cosine_similarities_dict.items():
        top_indices_unsorted = np.argsort(cosine_similarities)[-top_n:]
        top_indices = top_indices_unsorted[
            np.argsort(cosine_similarities[top_indices_unsorted])[::-1]
        ]
        all_top_indices.extend([(idx, col) for idx in top_indices])

    all_top_indices.sort(
        key=lambda x: cosine_similarities_dict[x[1]][x[0]], reverse=True
    )

    seen_names = set()
    top_results: List[Dict[str, Any]] = []
    for idx, col in all_top_indices:
        if len(top_results) >= top_n:
            break
        name = df.iloc[idx]["title"]
        if name in seen_names:
            continue
        synopsis = df.iloc[idx][col]
        similarity = cosine_similarities_dict[col][idx]
        top_results.append(
            {
                "rank": len(top_results) + 1,
                "title": name,
                "synopsis": synopsis,
                "similarity": float(similarity),
                "source_column": col,
            }
        )
        seen_names.add(name)

    return top_results

load_model_and_embeddings

load_model_and_embeddings(model_name: str, dataset_type: str) -> Tuple[SentenceTransformer, DataFrame, List[str], str]

Load the model, dataset and pre-computed embeddings for similarity search.

Handles loading of the appropriate sentence transformer model, dataset and pre-computed embeddings based on the specified dataset type. Supports both anime and manga datasets with their respective synopsis columns.

PARAMETER DESCRIPTION
model_name

Name of the sentence transformer model to load. Will prepend 'sentence-transformers/' if not already present.

TYPE: str

dataset_type

Type of dataset to load ('anime' or 'manga'). Determines which dataset and embeddings to load.

TYPE: str

RETURNS DESCRIPTION
tuple
  • SentenceTransformer: Loaded model instance
  • pd.DataFrame: Dataset containing titles and synopses
  • List[str]: Names of synopsis columns in the dataset
  • str: Directory path containing pre-computed embeddings

TYPE: Tuple[SentenceTransformer, DataFrame, List[str], str]

RAISES DESCRIPTION
ValueError

If dataset_type is not 'anime' or 'manga'

Source code in src/test.py
def load_model_and_embeddings(
    model_name: str, dataset_type: str
) -> Tuple[SentenceTransformer, pd.DataFrame, List[str], str]:
    """
    Load the model, dataset and pre-computed embeddings for similarity search.

    Handles loading of the appropriate sentence transformer model, dataset and
    pre-computed embeddings based on the specified dataset type. Supports both
    anime and manga datasets with their respective synopsis columns.

    Args:
        model_name (str): Name of the sentence transformer model to load.
            Will prepend 'sentence-transformers/' if not already present.
        dataset_type (str): Type of dataset to load ('anime' or 'manga').
            Determines which dataset and embeddings to load.

    Returns:
        tuple:
            - SentenceTransformer: Loaded model instance
            - pd.DataFrame: Dataset containing titles and synopses
            - List[str]: Names of synopsis columns in the dataset
            - str: Directory path containing pre-computed embeddings

    Raises:
        ValueError: If dataset_type is not 'anime' or 'manga'
    """
    if not model_name.startswith("sentence-transformers/"):
        model_name = f"sentence-transformers/{model_name}"

    if dataset_type == "anime":
        dataset_path = "model/merged_anime_dataset.csv"
        synopsis_columns = [
            "synopsis",
            "Synopsis anime_dataset_2023",
            "Synopsis animes dataset",
            "Synopsis anime_270 Dataset",
            "Synopsis Anime-2022 Dataset",
            "Synopsis anime4500 Dataset",
            "Synopsis wykonos Dataset",
            "Synopsis Anime_data Dataset",
            "Synopsis anime2 Dataset",
            "Synopsis mal_anime Dataset",
        ]
        embeddings_save_dir = f"model/anime/{model_name.split('/')[-1]}"
    elif dataset_type == "manga":
        dataset_path = "model/merged_manga_dataset.csv"
        synopsis_columns = [
            "synopsis",
            "Synopsis jikan Dataset",
            "Synopsis data Dataset",
        ]
        embeddings_save_dir = f"model/manga/{model_name.split('/')[-1]}"
    else:
        raise ValueError("Invalid dataset type specified. Use 'anime' or 'manga'.")

    df = common.load_dataset(dataset_path)
    model = SentenceTransformer(model_name, device="cpu")
    return model, df, synopsis_columns, embeddings_save_dir

save_evaluation_results

save_evaluation_results(evaluation_file: str, model_name: str, dataset_type: str, new_description: str, top_results: List[Dict[str, Any]]) -> str

Save similarity search results with metadata for evaluation.

Appends the search results and metadata to a JSON file for later analysis. Creates a new file if it doesn't exist. Each entry includes a timestamp, model information, dataset type, query description, and similarity results.

PARAMETER DESCRIPTION
evaluation_file

Path to save/append results

TYPE: str

model_name

Name of model used for embeddings

TYPE: str

dataset_type

Type of dataset searched ('anime' or 'manga')

TYPE: str

new_description

Query description used for search

TYPE: str

top_results

Similarity search results

TYPE: List[Dict[str, Any]]

RETURNS DESCRIPTION
str

Path to the evaluation file

TYPE: str

The saved JSON structure includes
  • timestamp: When the search was performed
  • model_name: Model used for embeddings
  • dataset_type: Type of dataset searched
  • new_description: Query description
  • top_similarities: List of similar titles and their scores
Source code in src/test.py
def save_evaluation_results(
    evaluation_file: str,
    model_name: str,
    dataset_type: str,
    new_description: str,
    top_results: List[Dict[str, Any]],
) -> str:
    """
    Save similarity search results with metadata for evaluation.

    Appends the search results and metadata to a JSON file for later analysis.
    Creates a new file if it doesn't exist. Each entry includes a timestamp,
    model information, dataset type, query description, and similarity results.

    Args:
        evaluation_file (str): Path to save/append results
        model_name (str): Name of model used for embeddings
        dataset_type (str): Type of dataset searched ('anime' or 'manga')
        new_description (str): Query description used for search
        top_results (List[Dict[str, Any]]): Similarity search results

    Returns:
        str: Path to the evaluation file

    The saved JSON structure includes:
        - timestamp: When the search was performed
        - model_name: Model used for embeddings
        - dataset_type: Type of dataset searched
        - new_description: Query description
        - top_similarities: List of similar titles and their scores
    """
    if os.path.exists(evaluation_file):
        with open(evaluation_file, "r", encoding="utf-8") as f:
            try:
                evaluation_data = json.load(f)
            except json.JSONDecodeError:
                evaluation_data = []
    else:
        evaluation_data = []

    test_result = {
        "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        "model_name": model_name,
        "dataset_type": dataset_type,
        "new_description": new_description,
        "top_similarities": top_results,
    }

    evaluation_data.append(test_result)

    with open(evaluation_file, "w", encoding="utf-8") as f:
        json.dump(evaluation_data, f, indent=4)

    return evaluation_file