Models API¶
This page documents the search model components of AniSearch Model.
Overview¶
The models package contains the classes responsible for loading datasets, initializing cross-encoder models, and performing semantic search operations.
The package is structured as follows:
BaseSearchModel
: Abstract base class providing common functionalityAnimeSearchModel
: Implementation for anime searchMangaSearchModel
: Implementation for manga search (with optional light novel support)
Model Search Workflow¶
The search process works by comparing a user query against all entries in the dataset:
flowchart TD
A[User Query] --> B[Search Model]
B --> C[Generate Query Variations]
C --> D[Batch Process Queries]
E[(Dataset)] --> D
D --> F[Calculate Relevance Scores]
F --> G[Sort Results]
G --> H[Return Top-K Results]
style A fill:#e1f5fe,stroke:#0288d1
style E fill:#fff3e0,stroke:#ff9800
style H fill:#e8f5e9,stroke:#4caf50
This process ensures efficient and accurate retrieval of relevant content based on semantic understanding rather than simple keyword matching.
BaseSearchModel¶
The foundation class with core functionality:
src.models.base_search_model.BaseSearchModel ¶
BaseSearchModel(dataset_path: str, id_column: str, model_name: str = MODEL_NAME, device: Optional[str] = None, dataset_type: str = 'base')
Base class for cross-encoder powered semantic search models.
This class provides the foundation for building specialized search models that use cross-encoder architectures to compute semantic similarity between user queries and content descriptions (synopses). It handles the common functionality such as dataset loading, model initialization, and search computation.
The class is designed to be extended by specialized search models for different content types (e.g., anime, manga) that can implement additional domain-specific functionality while reusing the core search capabilities.
ATTRIBUTE | DESCRIPTION |
---|---|
model_name | Name or path of the cross-encoder model being used TYPE: |
dataset_path | Path to the dataset file TYPE: |
id_col | Name of the ID column in the dataset TYPE: |
dataset_type | Type of dataset ("anime" or "manga") TYPE: |
device | Device being used for computation ('cpu', 'cuda', etc.) TYPE: |
model | The loaded cross-encoder model TYPE: |
df | The loaded and preprocessed dataset TYPE: |
synopsis_cols | List of column names containing synopsis text TYPE: |
normalize_scores | Whether model scores need normalization TYPE: |
This constructor sets up the search model by:
- Initializing configuration parameters
- Detecting or setting the compute device (CPU/CUDA)
- Loading the cross-encoder model
- Loading and preprocessing the dataset
PARAMETER | DESCRIPTION |
---|---|
dataset_path | Path to the dataset CSV file containing entries to search. The file should contain at minimum ID, title, and synopsis columns. TYPE: |
id_column | Name of the column containing unique identifiers in the dataset. This will be used to reference specific entries in search results. TYPE: |
model_name | Name or path of the cross-encoder model to use. Can be a Hugging Face model name or local path to a fine-tuned model. Defaults to the value specified in constants.MODEL_NAME. TYPE: |
device | Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.). If None, automatically selects the best available device. TYPE: |
dataset_type | Type of dataset being loaded, used for logging. Common values are "anime" or "manga". TYPE: |
RAISES | DESCRIPTION |
---|---|
FileNotFoundError | If the dataset file cannot be found |
ValueError | If the model_name is invalid or the model cannot be loaded |
Example
# Create a basic search model with default settings
basic_search = BaseSearchModel(
dataset_path="data/merged_anime_dataset.csv",
id_column="anime_id",
dataset_type="anime"
)
# Create a search model with custom model and device
custom_search = BaseSearchModel(
dataset_path="data/merged_manga_dataset.csv",
id_column="manga_id",
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
device="cuda:0",
dataset_type="manga"
)
Source code in src/models/base_search_model.py
list_available_models staticmethod
¶
List available pre-trained cross-encoder models categorized by type.
This static method returns a dictionary of model categories and their
corresponding model recommendations that can be used with the search system.
These models are defined in the ALTERNATIVE_MODELS constant.
Returns:
Mapping[str, Dict[str, str]]: A dictionary where:
- Keys are model categories (e.g., "Semantic Search", "Question Answering")
- Values are dictionaries mapping model names to descriptions
Example:
```python
# Get a dictionary of available models by category
available_models = BaseSearchModel.list_available_models()
# Print model categories and models
for category, models in available_models.items():
print(f"
{category}:") for model_name, description in models.items(): print(f" - {model_name}: {description}") ```
Source code in src/models/base_search_model.py
list_fine_tuned_models staticmethod
¶
List locally available fine-tuned models that can be used for search.
This static method scans the fine-tuned model directory to find models that have been fine-tuned specifically for anime/manga search. It identifies valid models by checking for the presence of a config.json file.
RETURNS | DESCRIPTION |
---|---|
Dict[str, str] | Dict[str, str]: A dictionary mapping: - Keys: Model directory names - Values: Full paths to the model directories |
Notes
- Searches in the "model/fine-tuned" directory by default
- Only directories containing a config.json file are included
- Returns an empty dictionary if no fine-tuned models are found
Example
Source code in src/models/base_search_model.py
search ¶
search(query: str, num_results: int = NUM_RESULTS, batch_size: int = DEFAULT_BATCH_SIZE) -> List[Dict[str, Any]]
Search for entries matching the provided description or query.
This method performs semantic search across the dataset by computing similarity scores between the user query and all synopses in the dataset. It returns the top matches sorted by relevance score.
The search process includes:
- Creating sentence pairs between the query and all synopses
- Computing relevance scores using the cross-encoder model in batches
- Sorting results by score and returning the top matches
PARAMETER | DESCRIPTION |
---|---|
query | The search query or description to match against synopses. This should be a descriptive text that captures the content the user is looking for. TYPE: |
num_results | Number of top matches to return, sorted by relevance score. Defaults to the value specified in constants.NUM_RESULTS. TYPE: |
batch_size | Number of sentence pairs to process at once with the model. Using batches helps manage memory usage with large datasets. Defaults to the value specified in constants.DEFAULT_BATCH_SIZE. TYPE: |
RETURNS | DESCRIPTION |
---|---|
List[Dict[str, Any]] | List[Dict[str, Any]]: A list of dictionaries, each containing: - id: The entry ID from the id_column specified during initialization - title: The entry title - score: The relevance score (higher is better) - synopsis: A preview of the entry synopsis (truncated to 500 chars) The list is sorted by score in descending order. |
RAISES | DESCRIPTION |
---|---|
ValueError | If the query is empty or consists only of whitespace |
Example
# Initialize a search model
search_model = BaseSearchModel(
dataset_path="data/merged_anime_dataset.csv",
id_column="anime_id"
)
# Search for content about time travel
results = search_model.search(
query="A story where characters travel through time and change history",
num_results=5,
batch_size=64
)
# Process the top results
for result in results:
print(f"{result['title']} (Score: {result['score']:.2f})")
print(f"Synopsis: {result['synopsis'][:100]}...")
Source code in src/models/base_search_model.py
365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 |
|
AnimeSearchModel¶
Specialized model for anime search:
src.models.anime_search_model.AnimeSearchModel ¶
AnimeSearchModel(model_name: str = MODEL_NAME, device: Optional[str] = None)
Bases: BaseSearchModel
A specialized search model for finding anime based on textual descriptions.
This class extends BaseSearchModel to provide anime-specific search functionality. It loads a comprehensive dataset of anime entries and uses a cross-encoder model to compute semantic similarity between user queries and anime synopses, returning the most relevant matches.
The model uses the merged anime dataset to provide search capabilities across a wide range of anime titles with rich metadata and synopses information.
ATTRIBUTE | DESCRIPTION |
---|---|
df | The loaded anime dataset TYPE: |
id_col | Column name for the anime ID in the dataset TYPE: |
model | The cross-encoder model used for scoring TYPE: |
device | The device being used ('cpu', 'cuda', etc.) TYPE: |
This constructor sets up the anime search model by loading the anime dataset and initializing the cross-encoder model.
PARAMETER | DESCRIPTION |
---|---|
model_name | Name or path of the cross-encoder model to use. Defaults to the value specified in constants.MODEL_NAME. TYPE: |
device | Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.). If None, automatically selects the best available device. TYPE: |
RAISES | DESCRIPTION |
---|---|
FileNotFoundError | If the anime dataset cannot be found |
ValueError | If the model_name is invalid or the model cannot be loaded |
Example
Source code in src/models/anime_search_model.py
MangaSearchModel¶
Specialized model for manga search:
src.models.manga_search_model.MangaSearchModel ¶
MangaSearchModel(model_name: str = MODEL_NAME, device: Optional[str] = None, include_light_novels: bool = False)
Bases: BaseSearchModel
A specialized search model for finding manga based on textual descriptions.
This class extends BaseSearchModel to provide manga-specific search functionality. It loads a comprehensive dataset of manga entries and uses a cross-encoder model to compute semantic similarity between user queries and manga synopses, returning the most relevant matches.
The model provides additional functionality over the base class:
- Optional filtering of light novels
- Customized search parameters for manga content
- Batch processing for efficient memory usage
- Progress tracking during search operations
ATTRIBUTE | DESCRIPTION |
---|---|
include_light_novels | Whether to include light novels in search results TYPE: |
df | The loaded manga dataset TYPE: |
id_col | Column name for the manga ID in the dataset TYPE: |
model | The cross-encoder model used for scoring TYPE: |
device | The device being used ('cpu', 'cuda', etc.) TYPE: |
This constructor sets up the manga search model by loading the manga dataset and initializing the cross-encoder model. It also configures whether light novels should be included in search results.
PARAMETER | DESCRIPTION |
---|---|
model_name | Name or path of the cross-encoder model to use. Defaults to the value specified in constants.MODEL_NAME. TYPE: |
device | Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.). If None, automatically selects the best available device. TYPE: |
include_light_novels | Whether to include light novels in search results. When False, entries with type 'light_novel' will be filtered out. Defaults to False. TYPE: |
RAISES | DESCRIPTION |
---|---|
FileNotFoundError | If the manga dataset cannot be found |
ValueError | If the model_name is invalid or the model cannot be loaded |
Example
Source code in src/models/manga_search_model.py
search ¶
search(query: str, num_results: int = NUM_RESULTS, batch_size: int = DEFAULT_BATCH_SIZE) -> List[Dict[str, Any]]
Search for manga entries matching the provided description or query.
This method computes semantic similarity scores between the provided query and all manga synopses in the dataset (after optional filtering), returning the top matches sorted by relevance.
The search process includes:
- Optional filtering of the dataset (e.g., removing light novels)
- Creating sentence pairs between the query and all manga synopses
- Computing relevance scores using the cross-encoder model in batches
- Sorting results by score and returning the top matches
PARAMETER | DESCRIPTION |
---|---|
query | The search query or description to match against manga synopses. This should be a descriptive text that captures the manga content the user is looking for. TYPE: |
num_results | Number of top matches to return, sorted by relevance score. Defaults to the value specified in constants.NUM_RESULTS. TYPE: |
batch_size | Number of sentence pairs to process at once with the model. Using batches helps manage memory usage with large datasets. Defaults to the value specified in constants.DEFAULT_BATCH_SIZE. TYPE: |
RETURNS | DESCRIPTION |
---|---|
List[Dict[str, Any]] | List[Dict[str, Any]]: A list of dictionaries, each containing: - id (int): The manga ID - title (str): The manga title - score (float): The relevance score (higher is better) - synopsis (str): A preview of the manga synopsis (truncated to 500 chars) The list is sorted by score in descending order. |
RAISES | DESCRIPTION |
---|---|
ValueError | If the query is empty or consists only of whitespace |
Example
Source code in src/models/manga_search_model.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 |
|