Skip to content

Datasets

AniSearch Model uses a variety of anime and manga datasets to provide comprehensive search capabilities. This page details the datasets used and how they're processed.

Anime Datasets

The system combines multiple anime datasets to ensure broad coverage of titles:

  1. MyAnimeList Dataset (Anime.csv)

    • Source: Kaggle
    • Contents: ~17,500 anime entries with ratings, genres, and synopses
  2. Anime Dataset 2023 (anime-dataset-2023.csv)

    • Source: Kaggle
    • Contents: Updated anime entries with recent titles
  3. Anime Database 2022 (Anime-2022.csv)

    • Source: Kaggle
    • Contents: ~15,000 anime entries with detailed metadata
  4. Anime Dataset (animes.csv)

    • Source: Kaggle
    • Contents: Alternative set of anime entries
  5. Anime DataSet (anime4500.csv)

    • Source: Kaggle
    • Contents: ~4,500 popular anime titles
  6. Anime Data (Anime_data.csv)

    • Source: Kaggle
    • Contents: Detailed anime information with extended descriptions
  7. Anime2 (anime2.csv)

    • Source: Kaggle
    • Contents: Additional anime entries
  8. MAL Anime (mal_anime.csv)

    • Source: Kaggle
    • Contents: Comprehensive MyAnimeList data
  9. Anime 270

    • Source: Hugging Face
    • Contents: Curated set of 270 anime entries
  10. Wykonos Anime

    • Source: Hugging Face
    • Contents: Specialized anime dataset with detailed tags

Manga Datasets

For manga search functionality, the following datasets are used:

  1. MyAnimeList Manga Dataset (Manga.csv)

    • Source: Kaggle
    • Contents: ~14,000 manga entries with ratings and synopses
  2. MyAnimeList Jikan Database (jikan.csv)

    • Source: Kaggle
    • Contents: Data extracted from MyAnimeList via the Jikan API
  3. Manga, Manhwa and Manhua Dataset (data.csv)

    • Source: Kaggle
    • Contents: Diverse collection of Japanese manga, Korean manhwa, and Chinese manhua

Dataset Processing

The merge_datasets.py script handles dataset preparation:

  1. Cleaning:

    • Removes duplicate entries
    • Standardizes text fields
    • Filters entries without synopses
  2. Merging:

    • Combines datasets based on unique identifiers
    • Resolves conflicting information
    • Creates unified CSV files
  3. Output:

    • model/merged_anime_dataset.csv: Combined anime dataset
    • model/merged_manga_dataset.csv: Combined manga dataset

Dataset Structure

The final merged datasets contain these key fields:

Field Description
id Unique identifier (typically MyAnimeList ID)
title Primary title of the anime/manga
title_english English title (if different)
synopsis Plot summary/description
genres List of genres
type Media type (TV, Movie, OVA, Manga, etc.)
score Average user rating
popularity Popularity ranking
episodes/chapters Number of episodes/chapters

Light Novels

For manga searches, you can optionally include light novels:

python src/main.py search --type manga --query "Fantasy world" --include-light-novels

This includes entries with type "Light Novel" in the search results, which are filtered out by default.