PairGeneration
This module handles the generation of training pairs for a sentence transformer model.
It provides functionality to create three types of pairs:
-
Positive pairs: Pairs of synopses from same entry with high similarity (>=0.8)
-
Partial positive pairs: Pairs from different entries with moderate similarity (>=0.5 and <0.8)
-
Negative pairs: Pairs from different entries with low similarity (<0.5)
The similarity between entries is calculated based on their genres and themes using semantic embeddings. The module uses multiprocessing for efficient pair generation and includes functions for both single-row processing and batch processing.
FUNCTION | DESCRIPTION |
---|---|
calculate_semantic_similarity |
Calculate weighted similarity between genres/themes |
create_positive_pairs |
Generate pairs from same-entry synopses with high sim |
generate_partial_positive_pairs |
Generate pairs from different entries with moderate similarity |
create_partial_positive_pairs |
Orchestrate partial positive pair generation |
generate_negative_pairs |
Generate pairs from different entries with low sim |
create_negative_pairs |
Orchestrate negative pair generation with multiprocessing |
The module supports saving generated pairs to CSV files and includes proper error handling and logging throughout the pair generation process. For all pair types, the shorter synopsis must be at least 50% the length of the longer synopsis.
calculate_semantic_similarity
¶
calculate_semantic_similarity(category_to_embedding: Dict[str, NDArray[float64]], genres_a: Set[str], genres_b: Set[str], themes_a: Set[str], themes_b: Set[str], genre_weight: float = 0.35, theme_weight: float = 0.65) -> float
Calculate the semantic similarity between two sets of genres and themes.
PARAMETER | DESCRIPTION |
---|---|
category_to_embedding
|
Dictionary mapping categories to embeddings
TYPE:
|
genres_a
|
Set of genres for the first item
TYPE:
|
genres_b
|
Set of genres for the second item
TYPE:
|
themes_a
|
Set of themes for the first item
TYPE:
|
themes_b
|
Set of themes for the second item
TYPE:
|
genre_weight
|
Weight for genre similarity. Defaults to 0.35
TYPE:
|
theme_weight
|
Weight for theme similarity. Defaults to 0.65
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
float
|
Weighted semantic similarity score between 0 and 1
TYPE:
|
Source code in src/training/data/pair_generation.py
create_negative_pairs
¶
create_negative_pairs(df: DataFrame, synopses_columns: List[str], partial_threshold: float, max_negative_per_row: int, negative_pairs_file: Optional[str], num_workers: int, category_to_embedding: Dict[str, NDArray[float64]])
Create negative pairs from the dataframe using multiprocessing.
PARAMETER | DESCRIPTION |
---|---|
df
|
DataFrame containing the data
TYPE:
|
synopses_columns
|
List of column names containing synopses
TYPE:
|
partial_threshold
|
Maximum similarity threshold for negatives
TYPE:
|
max_negative_per_row
|
Maximum number of negative pairs per row
TYPE:
|
negative_pairs_file
|
Path to save pairs CSV, if provided
TYPE:
|
num_workers
|
Number of worker processes for multiprocessing
TYPE:
|
category_to_embedding
|
Category embedding dict
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[InputExample]: List of negative pairs with similarity between 0.15 and |
|
partial_threshold-0.01. Each pair consists of synopses from different entries |
|
where the shorter synopsis is at least 50% the length of the longer one. |
Source code in src/training/data/pair_generation.py
create_partial_positive_pairs
¶
create_partial_positive_pairs(df: DataFrame, synopses_columns: List[str], partial_threshold: float, max_partial_per_row: int, partial_positive_pairs_file: Optional[str], num_workers: int, category_to_embedding: Dict[str, NDArray[float64]]) -> List[InputExample]
Create partial positive pairs from the dataframe using multiprocessing.
PARAMETER | DESCRIPTION |
---|---|
df
|
DataFrame containing the data
TYPE:
|
synopses_columns
|
List of column names containing synopses
TYPE:
|
partial_threshold
|
Minimum similarity threshold for partial positives
TYPE:
|
max_partial_per_row
|
Maximum number of partial positive pairs per row
TYPE:
|
partial_positive_pairs_file
|
Path to save pairs CSV, if provided
TYPE:
|
num_workers
|
Number of worker processes for multiprocessing
TYPE:
|
category_to_embedding
|
Category embedding dict
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[InputExample]
|
List[InputExample]: List of partial positive pairs with similarity between |
List[InputExample]
|
partial_threshold+0.01 and 0.8. Each pair consists of synopses from different |
List[InputExample]
|
entries where the shorter synopsis is at least 50% the length of the longer one. |
Source code in src/training/data/pair_generation.py
create_positive_pairs
¶
create_positive_pairs(df: DataFrame, synopses_columns: List[str], encoder_model: SentenceTransformer, positive_pairs_file: Optional[str]) -> List[InputExample]
Create positive pairs of synopses from the same entry with high similarity.
PARAMETER | DESCRIPTION |
---|---|
df
|
DataFrame containing the data
TYPE:
|
synopses_columns
|
List of column names containing synopses
TYPE:
|
encoder_model
|
Pre-trained sentence transformer model
TYPE:
|
positive_pairs_file
|
Path to save positive pairs CSV, if provided
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[InputExample]
|
List[InputExample]: List of positive pairs with similarity scores >= 0.8. |
List[InputExample]
|
Each pair consists of synopses from the same entry where the shorter |
List[InputExample]
|
synopsis is at least 50% the length of the longer one. |
Source code in src/training/data/pair_generation.py
generate_negative_pairs
¶
generate_negative_pairs(i, df, synopses_columns, partial_threshold, max_negative_per_row, category_to_embedding, valid_indices, max_attempts=50)
Generate negative pairs for a single row in the dataframe.
PARAMETER | DESCRIPTION |
---|---|
i
|
Index of the row to process
TYPE:
|
df
|
DataFrame containing the data
TYPE:
|
synopses_columns
|
List of column names containing synopses
TYPE:
|
partial_threshold
|
Maximum similarity threshold for negatives
TYPE:
|
max_negative_per_row
|
Maximum number of negative pairs per row
TYPE:
|
category_to_embedding
|
Category embedding dict
TYPE:
|
valid_indices
|
List of valid row indices to sample from
TYPE:
|
max_attempts
|
Max attempts to find pairs. Defaults to 50
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[InputExample]: List of negative pairs with similarity between 0.15 and |
|
partial_threshold-0.01. Each pair consists of synopses from different entries |
|
where the shorter synopsis is at least 50% the length of the longer one. |
Source code in src/training/data/pair_generation.py
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 |
|
generate_partial_positive_pairs
¶
generate_partial_positive_pairs(i: int, df: DataFrame, synopses_columns: List[str], partial_threshold: float, max_partial_per_row: int, category_to_embedding: Dict[str, NDArray[float64]], valid_indices: List[int], max_attempts: int = 200) -> List[InputExample]
Generate partial positive pairs for a single row in the dataframe.
PARAMETER | DESCRIPTION |
---|---|
i
|
Index of the row to process
TYPE:
|
df
|
DataFrame containing the data
TYPE:
|
synopses_columns
|
List of column names containing synopses
TYPE:
|
partial_threshold
|
Minimum similarity threshold for partial positives
TYPE:
|
max_partial_per_row
|
Maximum number of partial positive pairs per row
TYPE:
|
category_to_embedding
|
Category embedding dict
TYPE:
|
valid_indices
|
List of valid row indices to sample from
TYPE:
|
max_attempts
|
Max attempts to find pairs. Defaults to 200
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[InputExample]
|
List[InputExample]: List of partial positive pairs with similarity between |
List[InputExample]
|
partial_threshold+0.01 and 0.8. Each pair consists of synopses from different |
List[InputExample]
|
entries where the shorter synopsis is at least 50% the length of the longer one. |
Source code in src/training/data/pair_generation.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 |
|