MaxTokens
This module calculates the maximum token count for different transformer models across anime and manga datasets.
The module processes multiple synopsis columns from anime and manga datasets, tokenizing the text using various transformer models to determine the maximum token length needed for each model. This information is useful for setting appropriate maximum sequence lengths when training or using these models.
anime_max_tokens
module-attribute
¶
anime_max_tokens = calculate_max_tokens('model/merged_anime_dataset.csv', anime_synopsis_columns, model_list)
anime_synopsis_columns
module-attribute
¶
anime_synopsis_columns = ['synopsis', 'Synopsis anime_dataset_2023', 'Synopsis animes dataset', 'Synopsis anime_270 Dataset', 'Synopsis Anime-2022 Dataset', 'Synopsis anime4500 Dataset', 'Synopsis wykonos Dataset', 'Synopsis Anime_data Dataset', 'Synopsis anime2 Dataset', 'Synopsis mal_anime Dataset']
manga_max_tokens
module-attribute
¶
manga_max_tokens = calculate_max_tokens('model/merged_manga_dataset.csv', manga_synopsis_columns, model_list)
manga_synopsis_columns
module-attribute
¶
model_list
module-attribute
¶
model_list = ['toobi/anime', 'sentence-transformers/all-distilroberta-v1', 'sentence-transformers/all-MiniLM-L6-v1', 'sentence-transformers/all-MiniLM-L12-v1', 'sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-MiniLM-L12-v2', 'sentence-transformers/all-mpnet-base-v1', 'sentence-transformers/all-mpnet-base-v2', 'sentence-transformers/all-roberta-large-v1', 'sentence-transformers/gtr-t5-base', 'sentence-transformers/gtr-t5-large', 'sentence-transformers/gtr-t5-xl', 'sentence-transformers/multi-qa-distilbert-dot-v1', 'sentence-transformers/multi-qa-mpnet-base-cos-v1', 'sentence-transformers/multi-qa-mpnet-base-dot-v1', 'sentence-transformers/paraphrase-distilroberta-base-v2', 'sentence-transformers/paraphrase-mpnet-base-v2', 'sentence-transformers/sentence-t5-base', 'sentence-transformers/sentence-t5-large', 'sentence-transformers/sentence-t5-xl', 'sentence-transformers/sentence-t5-xxl']
calculate_max_tokens
¶
calculate_max_tokens(dataset_path: str, synopsis_columns: List[str], model_names: List[str], batch_size: int = 64) -> Dict[str, int]
Calculate the maximum token count for each model across specified synopsis columns in a dataset.
PARAMETER | DESCRIPTION |
---|---|
dataset_path
|
Path to the CSV dataset file.
TYPE:
|
synopsis_columns
|
List of column names containing synopsis text to analyze.
TYPE:
|
model_names
|
List of model names/paths to test for tokenization.
TYPE:
|
batch_size
|
Batch size for processing. Defaults to 64.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
dict
|
Dictionary mapping model names to their maximum token counts. Example: {'model-name': max_token_count}
TYPE:
|