Datasets¶

AniSearch Model uses a variety of anime and manga datasets to provide comprehensive search capabilities. This page details the datasets used and how they're processed.

Anime Datasets¶

The system combines multiple anime datasets to ensure broad coverage of titles:

MyAnimeList Dataset (Anime.csv)
- Source: Kaggle
- Contents: ~17,500 anime entries with ratings, genres, and synopses
Anime Dataset 2023 (anime-dataset-2023.csv)
- Source: Kaggle
- Contents: Updated anime entries with recent titles
Anime Database 2022 (Anime-2022.csv)
- Source: Kaggle
- Contents: ~15,000 anime entries with detailed metadata
Anime Dataset (animes.csv)
- Source: Kaggle
- Contents: Alternative set of anime entries
Anime DataSet (anime4500.csv)
- Source: Kaggle
- Contents: ~4,500 popular anime titles
Anime Data (Anime_data.csv)
- Source: Kaggle
- Contents: Detailed anime information with extended descriptions
Anime2 (anime2.csv)
- Source: Kaggle
- Contents: Additional anime entries
MAL Anime (mal_anime.csv)
- Source: Kaggle
- Contents: Comprehensive MyAnimeList data
Anime 270
- Source: Hugging Face
- Contents: Curated set of 270 anime entries
Wykonos Anime
- Source: Hugging Face
- Contents: Specialized anime dataset with detailed tags

Manga Datasets¶

For manga search functionality, the following datasets are used:

MyAnimeList Manga Dataset (Manga.csv)
- Source: Kaggle
- Contents: ~14,000 manga entries with ratings and synopses
MyAnimeList Jikan Database (jikan.csv)
- Source: Kaggle
- Contents: Data extracted from MyAnimeList via the Jikan API
Manga, Manhwa and Manhua Dataset (data.csv)
- Source: Kaggle
- Contents: Diverse collection of Japanese manga, Korean manhwa, and Chinese manhua

Dataset Processing¶

The merge_datasets.py script handles dataset preparation:

Cleaning:
- Removes duplicate entries
- Standardizes text fields
- Filters entries without synopses
Merging:
- Combines datasets based on unique identifiers
- Resolves conflicting information
- Creates unified CSV files
Output:
- model/merged_anime_dataset.csv: Combined anime dataset
- model/merged_manga_dataset.csv: Combined manga dataset

Dataset Structure¶

The final merged datasets contain these key fields:

Field	Description
`id`	Unique identifier (typically MyAnimeList ID)
`title`	Primary title of the anime/manga
`title_english`	English title (if different)
`synopsis`	Plot summary/description
`genres`	List of genres
`type`	Media type (TV, Movie, OVA, Manga, etc.)
`score`	Average user rating
`popularity`	Popularity ranking
`episodes`/`chapters`	Number of episodes/chapters

Light Novels¶

For manga searches, you can optionally include light novels:

python src/main.py search --type manga --query "Fantasy world" --include-light-novels

This includes entries with type "Light Novel" in the search results, which are filtered out by default.