Models API¶

This page documents the search model components of AniSearch Model.

Overview¶

The models package contains the classes responsible for loading datasets, initializing cross-encoder models, and performing semantic search operations.

The package is structured as follows:

BaseSearchModel: Abstract base class providing common functionality
AnimeSearchModel: Implementation for anime search
MangaSearchModel: Implementation for manga search (with optional light novel support)

Model Search Workflow¶

The search process works by comparing a user query against all entries in the dataset:

flowchart TD
    A[User Query] --> B[Search Model]
    B --> C[Generate Query Variations]
    C --> D[Batch Process Queries]
    E[(Dataset)] --> D
    D --> F[Calculate Relevance Scores]
    F --> G[Sort Results]
    G --> H[Return Top-K Results]

    style A fill:#e1f5fe,stroke:#0288d1
    style E fill:#fff3e0,stroke:#ff9800
    style H fill:#e8f5e9,stroke:#4caf50

Press "Alt" / "Option" to enable Pan & Zoom

This process ensures efficient and accurate retrieval of relevant content based on semantic understanding rather than simple keyword matching.

BaseSearchModel¶

The foundation class with core functionality:

src.models.base_search_model.BaseSearchModel ¶

BaseSearchModel(dataset_path: str, id_column: str, model_name: str = MODEL_NAME, device: Optional[str] = None, dataset_type: str = 'base')

Base class for cross-encoder powered semantic search models.

This class provides the foundation for building specialized search models that use cross-encoder architectures to compute semantic similarity between user queries and content descriptions (synopses). It handles the common functionality such as dataset loading, model initialization, and search computation.

The class is designed to be extended by specialized search models for different content types (e.g., anime, manga) that can implement additional domain-specific functionality while reusing the core search capabilities.

ATTRIBUTE	DESCRIPTION
`model_name`	Name or path of the cross-encoder model being used TYPE: `str`
`dataset_path`	Path to the dataset file TYPE: `str`
`id_col`	Name of the ID column in the dataset TYPE: `str`
`dataset_type`	Type of dataset ("anime" or "manga") TYPE: `str`
`device`	Device being used for computation ('cpu', 'cuda', etc.) TYPE: `str`
`model`	The loaded cross-encoder model TYPE: `CrossEncoder`
`df`	The loaded and preprocessed dataset TYPE: `DataFrame`
`synopsis_cols`	List of column names containing synopsis text TYPE: `List[str]`
`normalize_scores`	Whether model scores need normalization TYPE: `bool`

This constructor sets up the search model by:

Initializing configuration parameters
Detecting or setting the compute device (CPU/CUDA)
Loading the cross-encoder model
Loading and preprocessing the dataset

PARAMETER	DESCRIPTION
`dataset_path`	Path to the dataset CSV file containing entries to search. The file should contain at minimum ID, title, and synopsis columns. TYPE: `str`
`id_column`	Name of the column containing unique identifiers in the dataset. This will be used to reference specific entries in search results. TYPE: `str`
`model_name`	Name or path of the cross-encoder model to use. Can be a Hugging Face model name or local path to a fine-tuned model. Defaults to the value specified in constants.MODEL_NAME. TYPE: `str` DEFAULT: `MODEL_NAME`
`device`	Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.). If None, automatically selects the best available device. TYPE: `Optional[str]` DEFAULT: `None`
`dataset_type`	Type of dataset being loaded, used for logging. Common values are "anime" or "manga". TYPE: `str` DEFAULT: `'base'`

RAISES	DESCRIPTION
`FileNotFoundError`	If the dataset file cannot be found
`ValueError`	If the model_name is invalid or the model cannot be loaded

Example

# Create a basic search model with default settings
basic_search = BaseSearchModel(
    dataset_path="data/merged_anime_dataset.csv",
    id_column="anime_id",
    dataset_type="anime"
)

# Create a search model with custom model and device
custom_search = BaseSearchModel(
    dataset_path="data/merged_manga_dataset.csv",
    id_column="manga_id",
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    device="cuda:0",
    dataset_type="manga"
)

Source code in src/models/base_search_model.py

def __init__(  # pylint: disable=too-many-arguments, too-many-positional-arguments
    self,
    dataset_path: str,
    id_column: str,
    model_name: str = MODEL_NAME,
    device: Optional[str] = None,
    dataset_type: str = "base",
):
    """
    Initialize the base search model with dataset and model configuration.

    This constructor sets up the search model by:

    1. Initializing configuration parameters
    2. Detecting or setting the compute device (CPU/CUDA)
    3. Loading the cross-encoder model
    4. Loading and preprocessing the dataset

    Args:
        dataset_path: Path to the dataset CSV file containing entries to search.
            The file should contain at minimum ID, title, and synopsis columns.
        id_column: Name of the column containing unique identifiers in the dataset.
            This will be used to reference specific entries in search results.
        model_name: Name or path of the cross-encoder model to use.
            Can be a Hugging Face model name or local path to a fine-tuned model.
            Defaults to the value specified in constants.MODEL_NAME.
        device: Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.).
            If None, automatically selects the best available device.
        dataset_type: Type of dataset being loaded, used for logging.
            Common values are "anime" or "manga".

    Raises:
        FileNotFoundError: If the dataset file cannot be found
        ValueError: If the model_name is invalid or the model cannot be loaded

    Example:
        ```python
        # Create a basic search model with default settings
        basic_search = BaseSearchModel(
            dataset_path="data/merged_anime_dataset.csv",
            id_column="anime_id",
            dataset_type="anime"
        )

        # Create a search model with custom model and device
        custom_search = BaseSearchModel(
            dataset_path="data/merged_manga_dataset.csv",
            id_column="manga_id",
            model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
            device="cuda:0",
            dataset_type="manga"
        )
        ```
    """
    # Store the model name and dataset info for later use
    self.model_name = model_name
    self.dataset_path = dataset_path
    self.id_col = id_column
    self.dataset_type = dataset_type

    # Auto-detect device if not specified
    if device is None:
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
    else:
        self.device = device

    # Load the cross-encoder model
    logger.info("Loading cross-encoder model: %s", model_name)
    self._load_model()

    # Load the dataset
    self._load_dataset()

dataset_path `instance-attribute` ¶

dataset_path = dataset_path

dataset_type `instance-attribute` ¶

dataset_type = dataset_type

device `instance-attribute` ¶

device = 'cuda' if is_available() else 'cpu'

id_col `instance-attribute` ¶

id_col = id_column

model_name `instance-attribute` ¶

model_name = model_name

list_available_models `staticmethod` ¶

list_available_models() -> Mapping[str, Dict[str, str]]

    List available pre-trained cross-encoder models categorized by type.

    This static method returns a dictionary of model categories and their
    corresponding model recommendations that can be used with the search system.
    These models are defined in the ALTERNATIVE_MODELS constant.

    Returns:
        Mapping[str, Dict[str, str]]: A dictionary where:
            - Keys are model categories (e.g., "Semantic Search", "Question Answering")
            - Values are dictionaries mapping model names to descriptions

    Example:
        ```python
        # Get a dictionary of available models by category
        available_models = BaseSearchModel.list_available_models()

        # Print model categories and models
        for category, models in available_models.items():
            print(f"

{category}:") for model_name, description in models.items(): print(f" - {model_name}: {description}") ```

Source code in src/models/base_search_model.py

@staticmethod
@handle_exceptions(log_exceptions=True, include_exc_info=True)
def list_available_models() -> Mapping[str, Dict[str, str]]:
    """
    List available pre-trained cross-encoder models categorized by type.

    This static method returns a dictionary of model categories and their
    corresponding model recommendations that can be used with the search system.
    These models are defined in the ALTERNATIVE_MODELS constant.

    Returns:
        Mapping[str, Dict[str, str]]: A dictionary where:
            - Keys are model categories (e.g., "Semantic Search", "Question Answering")
            - Values are dictionaries mapping model names to descriptions

    Example:
        ```python
        # Get a dictionary of available models by category
        available_models = BaseSearchModel.list_available_models()

        # Print model categories and models
        for category, models in available_models.items():
            print(f"\n{category}:")
            for model_name, description in models.items():
                print(f"  - {model_name}: {description}")
        ```
    """
    return ALTERNATIVE_MODELS

list_fine_tuned_models `staticmethod` ¶

list_fine_tuned_models() -> Dict[str, str]

List locally available fine-tuned models that can be used for search.

This static method scans the fine-tuned model directory to find models that have been fine-tuned specifically for anime/manga search. It identifies valid models by checking for the presence of a config.json file.

RETURNS	DESCRIPTION
`Dict[str, str]`	Dict[str, str]: A dictionary mapping: - Keys: Model directory names - Values: Full paths to the model directories

Notes

Searches in the "model/fine-tuned" directory by default
Only directories containing a config.json file are included
Returns an empty dictionary if no fine-tuned models are found

Example

# Get a dictionary of available fine-tuned models
fine_tuned_models = BaseSearchModel.list_fine_tuned_models()

if fine_tuned_models:
    print("Available fine-tuned models:")
    for name, path in fine_tuned_models.items():
        print(f"- {name}: {path}")
else:
    print("No fine-tuned models found.")

Source code in src/models/base_search_model.py

@staticmethod
@handle_exceptions(log_exceptions=True, include_exc_info=True)
def list_fine_tuned_models() -> Dict[str, str]:
    """
    List locally available fine-tuned models that can be used for search.

    This static method scans the fine-tuned model directory to find models
    that have been fine-tuned specifically for anime/manga search. It identifies
    valid models by checking for the presence of a config.json file.

    Returns:
        Dict[str, str]: A dictionary mapping:
            - Keys: Model directory names
            - Values: Full paths to the model directories

    Notes:
        - Searches in the "model/fine-tuned" directory by default
        - Only directories containing a config.json file are included
        - Returns an empty dictionary if no fine-tuned models are found

    Example:
        ```python
        # Get a dictionary of available fine-tuned models
        fine_tuned_models = BaseSearchModel.list_fine_tuned_models()

        if fine_tuned_models:
            print("Available fine-tuned models:")
            for name, path in fine_tuned_models.items():
                print(f"- {name}: {path}")
        else:
            print("No fine-tuned models found.")
        ```
    """
    fine_tuned_models: Dict[str, str] = {}
    model_dir = "model/fine-tuned"

    if not os.path.exists(model_dir):
        logger.warning("Fine-tuned model directory not found: %s", model_dir)
        return fine_tuned_models

    for model_name in os.listdir(model_dir):
        model_path = os.path.join(model_dir, model_name)
        if os.path.isdir(model_path) and os.path.exists(
            os.path.join(model_path, "config.json")
        ):
            fine_tuned_models[model_name] = model_path

    return fine_tuned_models

search ¶

search(query: str, num_results: int = NUM_RESULTS, batch_size: int = DEFAULT_BATCH_SIZE) -> List[Dict[str, Any]]

Search for entries matching the provided description or query.

This method performs semantic search across the dataset by computing similarity scores between the user query and all synopses in the dataset. It returns the top matches sorted by relevance score.

The search process includes:

Creating sentence pairs between the query and all synopses
Computing relevance scores using the cross-encoder model in batches
Sorting results by score and returning the top matches

PARAMETER	DESCRIPTION
`query`	The search query or description to match against synopses. This should be a descriptive text that captures the content the user is looking for. TYPE: `str`
`num_results`	Number of top matches to return, sorted by relevance score. Defaults to the value specified in constants.NUM_RESULTS. TYPE: `int` DEFAULT: `NUM_RESULTS`
`batch_size`	Number of sentence pairs to process at once with the model. Using batches helps manage memory usage with large datasets. Defaults to the value specified in constants.DEFAULT_BATCH_SIZE. TYPE: `int` DEFAULT: `DEFAULT_BATCH_SIZE`

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List[Dict[str, Any]]: A list of dictionaries, each containing: - id: The entry ID from the id_column specified during initialization - title: The entry title - score: The relevance score (higher is better) - synopsis: A preview of the entry synopsis (truncated to 500 chars) The list is sorted by score in descending order.

RAISES	DESCRIPTION
`ValueError`	If the query is empty or consists only of whitespace

Example

# Initialize a search model
search_model = BaseSearchModel(
    dataset_path="data/merged_anime_dataset.csv",
    id_column="anime_id"
)

# Search for content about time travel
results = search_model.search(
    query="A story where characters travel through time and change history",
    num_results=5,
    batch_size=64
)

# Process the top results
for result in results:
    print(f"{result['title']} (Score: {result['score']:.2f})")
    print(f"Synopsis: {result['synopsis'][:100]}...")

Source code in src/models/base_search_model.py

@handle_exceptions(log_exceptions=True, include_exc_info=True)
def search(
    self,
    query: str,
    num_results: int = NUM_RESULTS,
    batch_size: int = DEFAULT_BATCH_SIZE,
) -> List[Dict[str, Any]]:
    """
    Search for entries matching the provided description or query.

    This method performs semantic search across the dataset by computing similarity
    scores between the user query and all synopses in the dataset. It returns the
    top matches sorted by relevance score.

    The search process includes:

    1. Creating sentence pairs between the query and all synopses
    2. Computing relevance scores using the cross-encoder model in batches
    3. Sorting results by score and returning the top matches

    Args:
        query: The search query or description to match against synopses.
            This should be a descriptive text that captures the content
            the user is looking for.
        num_results: Number of top matches to return, sorted by relevance score.
            Defaults to the value specified in constants.NUM_RESULTS.
        batch_size: Number of sentence pairs to process at once with the model.
            Using batches helps manage memory usage with large datasets.
            Defaults to the value specified in constants.DEFAULT_BATCH_SIZE.

    Returns:
        List[Dict[str, Any]]: A list of dictionaries, each containing:
            - id: The entry ID from the id_column specified during initialization
            - title: The entry title
            - score: The relevance score (higher is better)
            - synopsis: A preview of the entry synopsis (truncated to 500 chars)

            The list is sorted by score in descending order.

    Raises:
        ValueError: If the query is empty or consists only of whitespace

    Example:
        ```python
        # Initialize a search model
        search_model = BaseSearchModel(
            dataset_path="data/merged_anime_dataset.csv",
            id_column="anime_id"
        )

        # Search for content about time travel
        results = search_model.search(
            query="A story where characters travel through time and change history",
            num_results=5,
            batch_size=64
        )

        # Process the top results
        for result in results:
            print(f"{result['title']} (Score: {result['score']:.2f})")
            print(f"Synopsis: {result['synopsis'][:100]}...")
        ```
    """
    if not query.strip():
        raise ValueError("Search query cannot be empty")

    logger.info("Searching for: %s", query)

    # Prepare pairs for cross-encoder scoring
    all_synopses = self.df["combined_synopsis"].tolist()
    sentence_pairs = [(query, text) for text in all_synopses]

    # Calculate total number of pairs and batches
    total_pairs = len(sentence_pairs)
    if batch_size <= 0:
        batch_size = DEFAULT_BATCH_SIZE
        logger.warning(
            "Invalid batch size provided, using default: %d", DEFAULT_BATCH_SIZE
        )

    # Compute relevance scores in batches
    logger.info(
        "Computing relevance scores with cross-encoder (batch size: %d)", batch_size
    )
    scores: List[Any] = []

    # Use tqdm to display progress for all datasets
    with tqdm(
        total=total_pairs,
        desc="Scoring",
        disable=False,
    ) as pbar:
        for i in range(0, total_pairs, batch_size):
            batch = sentence_pairs[i : i + batch_size]
            # Disable progress_bar in the model's predict method to avoid multiple progress bars
            batch_scores = self.model.predict(batch, show_progress_bar=False)

            if isinstance(batch_scores, np.ndarray):
                scores.extend(batch_scores.tolist())
            else:
                scores.extend(batch_scores)

            pbar.update(len(batch))

    # Convert scores to numpy array
    scores_array = np.array(scores)

    # Get indices of top scores
    top_indices = scores_array.argsort()[-num_results:][::-1]

    # Prepare results
    results = []
    for idx in top_indices:
        entry = self.df.iloc[idx]
        synopsis = entry["combined_synopsis"]
        results.append(
            {
                "id": entry[self.id_col],
                "title": entry["title"],
                "score": float(scores_array[idx]),
                "synopsis": (
                    synopsis[:500] + "..." if len(synopsis) > 500 else synopsis
                ),
            }
        )

    logger.info("Found %d matches", len(results))
    return results

AnimeSearchModel¶

Specialized model for anime search:

src.models.anime_search_model.AnimeSearchModel ¶

AnimeSearchModel(model_name: str = MODEL_NAME, device: Optional[str] = None)

Bases: BaseSearchModel

A specialized search model for finding anime based on textual descriptions.

This class extends BaseSearchModel to provide anime-specific search functionality. It loads a comprehensive dataset of anime entries and uses a cross-encoder model to compute semantic similarity between user queries and anime synopses, returning the most relevant matches.

The model uses the merged anime dataset to provide search capabilities across a wide range of anime titles with rich metadata and synopses information.

ATTRIBUTE	DESCRIPTION
`df`	The loaded anime dataset TYPE: `DataFrame`
`id_col`	Column name for the anime ID in the dataset TYPE: `str`
`model`	The cross-encoder model used for scoring TYPE: `CrossEncoder`
`device`	The device being used ('cpu', 'cuda', etc.) TYPE: `str`

This constructor sets up the anime search model by loading the anime dataset and initializing the cross-encoder model.

PARAMETER	DESCRIPTION
`model_name`	Name or path of the cross-encoder model to use. Defaults to the value specified in constants.MODEL_NAME. TYPE: `str` DEFAULT: `MODEL_NAME`
`device`	Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.). If None, automatically selects the best available device. TYPE: `Optional[str]` DEFAULT: `None`

RAISES	DESCRIPTION
`FileNotFoundError`	If the anime dataset cannot be found
`ValueError`	If the model_name is invalid or the model cannot be loaded

Example

# Basic initialization with default settings
anime_model = AnimeSearchModel()

# Initialize with custom model and specific device
custom_model = AnimeSearchModel(
    model_name="cross-encoder/ms-marco-MiniLM-L-12-v2",
    device="cuda"
)

Source code in src/models/anime_search_model.py

def __init__(
    self,
    model_name: str = MODEL_NAME,
    device: Optional[str] = None,
):
    """
    Initialize the anime search model with the specified parameters.

    This constructor sets up the anime search model by loading the anime dataset
    and initializing the cross-encoder model.

    Args:
        model_name: Name or path of the cross-encoder model to use.
            Defaults to the value specified in constants.MODEL_NAME.
        device: Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.).
            If None, automatically selects the best available device.

    Raises:
        FileNotFoundError: If the anime dataset cannot be found
        ValueError: If the model_name is invalid or the model cannot be loaded

    Example:
        ```python
        # Basic initialization with default settings
        anime_model = AnimeSearchModel()

        # Initialize with custom model and specific device
        custom_model = AnimeSearchModel(
            model_name="cross-encoder/ms-marco-MiniLM-L-12-v2",
            device="cuda"
        )
        ```
    """
    logger.info("Initializing AnimeSearchModel")
    super().__init__(
        dataset_path=ANIME_DATASET_PATH,
        id_column="anime_id",
        model_name=model_name,
        device=device,
        dataset_type="anime",
    )

MangaSearchModel¶

Specialized model for manga search:

src.models.manga_search_model.MangaSearchModel ¶

MangaSearchModel(model_name: str = MODEL_NAME, device: Optional[str] = None, include_light_novels: bool = False)

Bases: BaseSearchModel

A specialized search model for finding manga based on textual descriptions.

This class extends BaseSearchModel to provide manga-specific search functionality. It loads a comprehensive dataset of manga entries and uses a cross-encoder model to compute semantic similarity between user queries and manga synopses, returning the most relevant matches.

The model provides additional functionality over the base class:

Optional filtering of light novels
Customized search parameters for manga content
Batch processing for efficient memory usage
Progress tracking during search operations

ATTRIBUTE	DESCRIPTION
`include_light_novels`	Whether to include light novels in search results TYPE: `bool`
`df`	The loaded manga dataset TYPE: `DataFrame`
`id_col`	Column name for the manga ID in the dataset TYPE: `str`
`model`	The cross-encoder model used for scoring TYPE: `CrossEncoder`
`device`	The device being used ('cpu', 'cuda', etc.) TYPE: `str`

This constructor sets up the manga search model by loading the manga dataset and initializing the cross-encoder model. It also configures whether light novels should be included in search results.

PARAMETER	DESCRIPTION
`model_name`	Name or path of the cross-encoder model to use. Defaults to the value specified in constants.MODEL_NAME. TYPE: `str` DEFAULT: `MODEL_NAME`
`device`	Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.). If None, automatically selects the best available device. TYPE: `Optional[str]` DEFAULT: `None`
`include_light_novels`	Whether to include light novels in search results. When False, entries with type 'light_novel' will be filtered out. Defaults to False. TYPE: `bool` DEFAULT: `False`

RAISES	DESCRIPTION
`FileNotFoundError`	If the manga dataset cannot be found
`ValueError`	If the model_name is invalid or the model cannot be loaded

Example

# Basic initialization with default settings
manga_model = MangaSearchModel()

# Initialize with custom model and including light novels
custom_model = MangaSearchModel(
    model_name="cross-encoder/ms-marco-MiniLM-L-12-v2",
    device="cuda",
    include_light_novels=True
)

Source code in src/models/manga_search_model.py

def __init__(
    self,
    model_name: str = MODEL_NAME,
    device: Optional[str] = None,
    include_light_novels: bool = False,
):
    """
    Initialize the manga search model with the specified parameters.

    This constructor sets up the manga search model by loading the manga dataset
    and initializing the cross-encoder model. It also configures whether light
    novels should be included in search results.

    Args:
        model_name: Name or path of the cross-encoder model to use.
            Defaults to the value specified in constants.MODEL_NAME.
        device: Device to run the model on ('cpu', 'cuda', 'cuda:0', etc.).
            If None, automatically selects the best available device.
        include_light_novels: Whether to include light novels in search results.
            When False, entries with type 'light_novel' will be filtered out.
            Defaults to False.

    Raises:
        FileNotFoundError: If the manga dataset cannot be found
        ValueError: If the model_name is invalid or the model cannot be loaded

    Example:
        ```python
        # Basic initialization with default settings
        manga_model = MangaSearchModel()

        # Initialize with custom model and including light novels
        custom_model = MangaSearchModel(
            model_name="cross-encoder/ms-marco-MiniLM-L-12-v2",
            device="cuda",
            include_light_novels=True
        )
        ```
    """
    logger.info("Initializing MangaSearchModel")
    super().__init__(
        dataset_path=MANGA_DATASET_PATH,
        id_column="manga_id",
        model_name=model_name,
        device=device,
        dataset_type="manga",
    )
    self.include_light_novels = include_light_novels
    logger.info(
        "Light novels will %sbe included in search results",
        "" if include_light_novels else "not ",
    )

include_light_novels `instance-attribute` ¶

include_light_novels = include_light_novels

search ¶

search(query: str, num_results: int = NUM_RESULTS, batch_size: int = DEFAULT_BATCH_SIZE) -> List[Dict[str, Any]]

Search for manga entries matching the provided description or query.

This method computes semantic similarity scores between the provided query and all manga synopses in the dataset (after optional filtering), returning the top matches sorted by relevance.

The search process includes:

Optional filtering of the dataset (e.g., removing light novels)
Creating sentence pairs between the query and all manga synopses
Computing relevance scores using the cross-encoder model in batches
Sorting results by score and returning the top matches

PARAMETER	DESCRIPTION
`query`	The search query or description to match against manga synopses. This should be a descriptive text that captures the manga content the user is looking for. TYPE: `str`
`num_results`	Number of top matches to return, sorted by relevance score. Defaults to the value specified in constants.NUM_RESULTS. TYPE: `int` DEFAULT: `NUM_RESULTS`
`batch_size`	Number of sentence pairs to process at once with the model. Using batches helps manage memory usage with large datasets. Defaults to the value specified in constants.DEFAULT_BATCH_SIZE. TYPE: `int` DEFAULT: `DEFAULT_BATCH_SIZE`

RETURNS	DESCRIPTION
`List[Dict[str, Any]]`	List[Dict[str, Any]]: A list of dictionaries, each containing: - id (int): The manga ID - title (str): The manga title - score (float): The relevance score (higher is better) - synopsis (str): A preview of the manga synopsis (truncated to 500 chars) The list is sorted by score in descending order.

RAISES	DESCRIPTION
`ValueError`	If the query is empty or consists only of whitespace

Example

# Search for manga about time travel
results = manga_model.search(
    query="A story about characters who can travel through time",
    num_results=3,
    batch_size=64
)

# Process the top results
for result in results:
    print(f"{result['title']} (Score: {result['score']:.2f})")

Source code in src/models/manga_search_model.py

def search(
    self,
    query: str,
    num_results: int = NUM_RESULTS,
    batch_size: int = DEFAULT_BATCH_SIZE,
) -> List[Dict[str, Any]]:
    """
    Search for manga entries matching the provided description or query.

    This method computes semantic similarity scores between the provided query
    and all manga synopses in the dataset (after optional filtering), returning
    the top matches sorted by relevance.

    The search process includes:

    1. Optional filtering of the dataset (e.g., removing light novels)
    2. Creating sentence pairs between the query and all manga synopses
    3. Computing relevance scores using the cross-encoder model in batches
    4. Sorting results by score and returning the top matches

    Args:
        query: The search query or description to match against manga synopses.
            This should be a descriptive text that captures the manga content
            the user is looking for.
        num_results: Number of top matches to return, sorted by relevance score.
            Defaults to the value specified in constants.NUM_RESULTS.
        batch_size: Number of sentence pairs to process at once with the model.
            Using batches helps manage memory usage with large datasets.
            Defaults to the value specified in constants.DEFAULT_BATCH_SIZE.

    Returns:
        List[Dict[str, Any]]: A list of dictionaries, each containing:
            - id (int): The manga ID
            - title (str): The manga title
            - score (float): The relevance score (higher is better)
            - synopsis (str): A preview of the manga synopsis (truncated to 500 chars)

            The list is sorted by score in descending order.

    Raises:
        ValueError: If the query is empty or consists only of whitespace

    Example:
        ```python
        # Search for manga about time travel
        results = manga_model.search(
            query="A story about characters who can travel through time",
            num_results=3,
            batch_size=64
        )

        # Process the top results
        for result in results:
            print(f"{result['title']} (Score: {result['score']:.2f})")
        ```
    """
    if not query.strip():
        raise ValueError("Search query cannot be empty")

    logger.info("Searching for: %s", query)

    # Get the appropriate dataframe for search
    search_df = self._get_search_dataframe()

    # Prepare pairs for cross-encoder scoring
    all_synopses = search_df["combined_synopsis"].tolist()
    sentence_pairs = [(query, text) for text in all_synopses]

    # Calculate total number of pairs and batches
    total_pairs = len(sentence_pairs)
    if batch_size <= 0:
        batch_size = DEFAULT_BATCH_SIZE
        logger.warning(
            "Invalid batch size provided, using default: %d", DEFAULT_BATCH_SIZE
        )

    # Compute relevance scores in batches
    logger.info(
        "Computing relevance scores with cross-encoder (batch size: %d)", batch_size
    )
    scores: List[Any] = []

    with tqdm(
        total=total_pairs,
        desc="Scoring",
        disable=False,
    ) as pbar:
        for i in range(0, total_pairs, batch_size):
            batch = sentence_pairs[i : i + batch_size]
            # Disable progress_bar in the model's predict method to avoid multiple progress bars
            batch_scores = self.model.predict(batch, show_progress_bar=False)

            if isinstance(batch_scores, np.ndarray):
                scores.extend(batch_scores.tolist())
            else:
                scores.extend(batch_scores)

            pbar.update(len(batch))

    scores_array = np.array(scores)

    # Get indices of top scores
    top_indices = scores_array.argsort()[-num_results:][::-1]

    # Prepare results
    results = []
    for idx in top_indices:
        entry = search_df.iloc[idx]
        synopsis = entry["combined_synopsis"]
        results.append(
            {
                "id": entry[self.id_col],
                "title": entry["title"],
                "score": float(scores_array[idx]),
                "synopsis": (
                    synopsis[:500] + "..." if len(synopsis) > 500 else synopsis
                ),
            }
        )

    logger.info("Found %d matches", len(results))
    return results

Models API¶

Overview¶

Model Search Workflow¶

BaseSearchModel¶

src.models.base_search_model.BaseSearchModel ¶

dataset_path instance-attribute ¶

dataset_type instance-attribute ¶

device instance-attribute ¶

id_col instance-attribute ¶

model_name instance-attribute ¶

list_available_models staticmethod ¶

list_fine_tuned_models staticmethod ¶

search ¶

AnimeSearchModel¶

src.models.anime_search_model.AnimeSearchModel ¶

MangaSearchModel¶

src.models.manga_search_model.MangaSearchModel ¶

include_light_novels instance-attribute ¶

search ¶

dataset_path `instance-attribute` ¶

dataset_type `instance-attribute` ¶

device `instance-attribute` ¶

id_col `instance-attribute` ¶

model_name `instance-attribute` ¶

list_available_models `staticmethod` ¶

list_fine_tuned_models `staticmethod` ¶

include_light_novels `instance-attribute` ¶