cerebras.modelzoo.data_preparation.nlp.data_dedup.generate_duplicate_pairs#
This script is used for duplicate pairs generation.
It includes some functions from the datasketch library for calculation of range and bands - namely, _false_positive_probability, _false_negative_probability and optimal_param. The original source code can be found at: https://github.com/ekzhu/datasketch/blob/master/datasketch/lsh.py#L24
Functions
Compute the optimal MinHashLSH parameter that minimizes the weighted sum of probabilities of false positive and false negative. |
|