Test
Provides functionality for semantic similarity search in anime and manga datasets.
This module handles loading pre-trained models and embeddings, calculating semantic similarities between descriptions, and saving evaluation results. It supports both anime and manga datasets and uses sentence transformers for embedding generation.
Key Features
- Model and embedding loading with automatic device selection
- Batched similarity calculation using cosine similarity
- Deduplication of results based on titles
- Comprehensive evaluation result logging
- Support for multiple synopsis/description columns
The module is designed to work with pre-computed embeddings stored in numpy arrays and uses efficient tensor operations for similarity calculations.
FUNCTION | DESCRIPTION |
---|---|
load_model_and_embeddings |
Loads model, dataset and embeddings for similarity search |
calculate_similarities |
Computes semantic similarities between descriptions |
save_evaluation_results |
Logs evaluation results with timestamps and metadata |
calculate_similarities
¶
calculate_similarities(model: SentenceTransformer, df: DataFrame, synopsis_columns: List[str], embeddings_save_dir: str, new_description: str, top_n: int = 10) -> List[Dict[str, Any]]
Find semantically similar titles by comparing embeddings.
Calculates cosine similarities between a new description's embedding and pre-computed embeddings from the dataset. Returns the top-N most similar titles, removing duplicates across different synopsis columns.
PARAMETER | DESCRIPTION |
---|---|
model
|
Model to encode the new description
TYPE:
|
df
|
Dataset containing titles and synopses
TYPE:
|
synopsis_columns
|
Columns containing synopsis text
TYPE:
|
embeddings_save_dir
|
Directory containing pre-computed embeddings
TYPE:
|
new_description
|
Description to find similar titles for
TYPE:
|
top_n
|
Number of similar titles to return. Defaults to 10.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[Dict[str, Any]]
|
List[Dict[str, Any]]: Top similar titles, each containing: - rank: Position in results (1-based) - title: Title of the anime/manga - synopsis: Plot description/synopsis - similarity: Cosine similarity score - source_column: Column the synopsis came from |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If no valid embeddings are found in embeddings_save_dir |
Source code in src/test.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
|
load_model_and_embeddings
¶
load_model_and_embeddings(model_name: str, dataset_type: str) -> Tuple[SentenceTransformer, DataFrame, List[str], str]
Load the model, dataset and pre-computed embeddings for similarity search.
Handles loading of the appropriate sentence transformer model, dataset and pre-computed embeddings based on the specified dataset type. Supports both anime and manga datasets with their respective synopsis columns.
PARAMETER | DESCRIPTION |
---|---|
model_name
|
Name of the sentence transformer model to load. Will prepend 'sentence-transformers/' if not already present.
TYPE:
|
dataset_type
|
Type of dataset to load ('anime' or 'manga'). Determines which dataset and embeddings to load.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple
|
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ValueError
|
If dataset_type is not 'anime' or 'manga' |
Source code in src/test.py
save_evaluation_results
¶
save_evaluation_results(evaluation_file: str, model_name: str, dataset_type: str, new_description: str, top_results: List[Dict[str, Any]]) -> str
Save similarity search results with metadata for evaluation.
Appends the search results and metadata to a JSON file for later analysis. Creates a new file if it doesn't exist. Each entry includes a timestamp, model information, dataset type, query description, and similarity results.
PARAMETER | DESCRIPTION |
---|---|
evaluation_file
|
Path to save/append results
TYPE:
|
model_name
|
Name of model used for embeddings
TYPE:
|
dataset_type
|
Type of dataset searched ('anime' or 'manga')
TYPE:
|
new_description
|
Query description used for search
TYPE:
|
top_results
|
Similarity search results
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
Path to the evaluation file
TYPE:
|
The saved JSON structure includes
- timestamp: When the search was performed
- model_name: Model used for embeddings
- dataset_type: Type of dataset searched
- new_description: Query description
- top_similarities: List of similar titles and their scores