Datasets¶
AniSearch Model uses a variety of anime and manga datasets to provide comprehensive search capabilities. This page details the datasets used and how they're processed.
Anime Datasets¶
The system combines multiple anime datasets to ensure broad coverage of titles:
-
MyAnimeList Dataset (
Anime.csv
)- Source: Kaggle
- Contents: ~17,500 anime entries with ratings, genres, and synopses
-
Anime Dataset 2023 (
anime-dataset-2023.csv
)- Source: Kaggle
- Contents: Updated anime entries with recent titles
-
Anime Database 2022 (
Anime-2022.csv
)- Source: Kaggle
- Contents: ~15,000 anime entries with detailed metadata
-
Anime Dataset (
animes.csv
)- Source: Kaggle
- Contents: Alternative set of anime entries
-
Anime DataSet (
anime4500.csv
)- Source: Kaggle
- Contents: ~4,500 popular anime titles
-
Anime Data (
Anime_data.csv
)- Source: Kaggle
- Contents: Detailed anime information with extended descriptions
-
Anime2 (
anime2.csv
)- Source: Kaggle
- Contents: Additional anime entries
-
MAL Anime (
mal_anime.csv
)- Source: Kaggle
- Contents: Comprehensive MyAnimeList data
-
Anime 270
- Source: Hugging Face
- Contents: Curated set of 270 anime entries
-
Wykonos Anime
- Source: Hugging Face
- Contents: Specialized anime dataset with detailed tags
Manga Datasets¶
For manga search functionality, the following datasets are used:
-
MyAnimeList Manga Dataset (
Manga.csv
)- Source: Kaggle
- Contents: ~14,000 manga entries with ratings and synopses
-
MyAnimeList Jikan Database (
jikan.csv
)- Source: Kaggle
- Contents: Data extracted from MyAnimeList via the Jikan API
-
Manga, Manhwa and Manhua Dataset (
data.csv
)- Source: Kaggle
- Contents: Diverse collection of Japanese manga, Korean manhwa, and Chinese manhua
Dataset Processing¶
The merge_datasets.py
script handles dataset preparation:
-
Cleaning:
- Removes duplicate entries
- Standardizes text fields
- Filters entries without synopses
-
Merging:
- Combines datasets based on unique identifiers
- Resolves conflicting information
- Creates unified CSV files
-
Output:
model/merged_anime_dataset.csv
: Combined anime datasetmodel/merged_manga_dataset.csv
: Combined manga dataset
Dataset Structure¶
The final merged datasets contain these key fields:
Field | Description |
---|---|
id | Unique identifier (typically MyAnimeList ID) |
title | Primary title of the anime/manga |
title_english | English title (if different) |
synopsis | Plot summary/description |
genres | List of genres |
type | Media type (TV, Movie, OVA, Manga, etc.) |
score | Average user rating |
popularity | Popularity ranking |
episodes /chapters | Number of episodes/chapters |
Light Novels¶
For manga searches, you can optionally include light novels:
This includes entries with type "Light Novel" in the search results, which are filtered out by default.