Common
This module provides utility functions for loading datasets, preprocessing text, and saving evaluation data for machine learning models.
FUNCTION | DESCRIPTION |
---|---|
load_dataset |
Load and preprocess a dataset from a CSV file. |
preprocess_text |
Clean and normalize text data for ML processing. |
save_evaluation_data |
Save model evaluation results to JSON. |
load_dataset
¶
Load dataset from a CSV file and fill missing values in the 'Synopsis' column.
PARAMETER | DESCRIPTION |
---|---|
file_path
|
Path to the CSV file containing the dataset.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
pd.DataFrame: Loaded dataset with filled 'Synopsis' column. |
Source code in src/common.py
preprocess_text
¶
Preprocess text data by applying various cleaning and normalization steps.
Steps include
- Converting to lowercase
- Expanding contractions
- Removing accents
- Removing extra whitespace
- Removing URLs
- Removing source citations
- Removing stopwords
- Lemmatizing words
PARAMETER | DESCRIPTION |
---|---|
text
|
Input text to preprocess. Can be string or other type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Any
|
Preprocessed text if input was string, otherwise returns input unchanged.
TYPE:
|
Source code in src/common.py
save_evaluation_data
¶
save_evaluation_data(model_name: str, batch_size: int, num_embeddings: int, additional_info: Optional[Dict[str, Any]] = None) -> None
Save model evaluation data to a JSON file with timestamp and parameters.
Creates or appends to 'model/evaluation_results.json', storing evaluation metrics and model configuration details.
PARAMETER | DESCRIPTION |
---|---|
model_name
|
Name/identifier of the model being evaluated.
TYPE:
|
batch_size
|
Batch size used for generating embeddings.
TYPE:
|
num_embeddings
|
Total number of embeddings generated.
TYPE:
|
additional_info
|
Additional evaluation metrics or parameters.
TYPE:
|