word_list_matching¶
- word_list_matching.debug_print(*args, doprint=True)¶
Prints arguments if the debug flag is set and doprint is True.
- Parameters:
args – Variable length argument list.
doprint (bool) – Boolean flag to control printing.
- Raises:
TypeError – if doprint is not a boolean.
- word_list_matching.main()¶
Main function to interact with the user and calculate naturalness.
Gets user input, tokenizes it, evaluates naturalness, and exports results.
- No-index:
- class word_list_matching.NaturalnessCalculator(wordlist_file='./data/words/words', excel_file=None)¶
Calculates the naturalness of words and identifiers.
- Parameters:
wordlist_file (str) – Path to the word list file.
excel_file (str or None) – Path to the Excel file containing pre-calculated word vectors.
- score_result(self, best_match_dict)¶
Calculates a score based on edit distance and ambiguity.
- Parameters:
best_match_dict (dict) – Dictionary containing best match information.
- Returns:
The calculated score.
- Return type:
float
- assign_label_from_score(self, score)¶
Assigns a naturalness label based on the score.
- Parameters:
score (float) – The calculated naturalness score.
- Returns:
The naturalness label (N1, N2, or N3).
- Return type:
str
- tokenize_identifier(self, identifier, camel_case=True)¶
Tokenizes an identifier string.
- Parameters:
identifier (str) – The identifier string.
camel_case (bool) – Whether to split on camel case.
- Returns:
A list of tokens.
- Return type:
list[str]
- get_best_match(self, token=None, result_dict=None)¶
Retrieves the best match for a token.
- Parameters:
token (str or None) – The token to find the best match for.
result_dict (dict or None) – The result dictionary from evaluate_naturalness.
- Returns:
A dictionary containing the best match information.
- Return type:
dict
- Raises:
ValueError – if neither token nor result_dict is provided.
- calculate_score_and_distance_of_delim_and_composite(self, identifier)¶
Calculates score and distance for delimited and composite identifiers.
- Parameters:
identifier (str) – The identifier string.
- Returns:
A tuple containing a list of words and a dictionary of scores.
- Return type:
tuple[list, dict]
- _mp_batch_function(self, identifier_sublist, mp_result_dict, p_ix)¶
Batch function for multiprocessing score calculation.
- Parameters:
identifier_sublist (list[str]) – A sublist of identifiers.
mp_result_dict (dict) – A shared dictionary to store results.
p_ix (int) – The process index.
- mp_calculate_score_and_distance_of_delim_and_composite(self, identifier_list, num_processes=16)¶
Calculates scores and distances using multiprocessing.
- Parameters:
identifier_list (list[str]) – A list of identifiers.
num_processes (int) – The number of processes to use.
- Returns:
A list of identifiers.
- Return type:
list[str]
- get_min_distance_for_single_word(self, word)¶
Calculates the minimum edit distance for a single word.
- Parameters:
word (str) – The word to calculate the distance for.
- Returns:
The minimum edit distance.
- Return type:
int
- distance_batch_function(self, word_sublist, mp_result_dict)¶
Batch function for multiprocessing distance calculation.
- Parameters:
word_sublist (list[str]) – A sublist of words.
mp_result_dict (dict) – A shared dictionary to store results.
- mp_get_min_distance_for_word_list(self, word_list, num_processes=16)¶
Calculates minimum distances for a word list using multiprocessing.
- Parameters:
word_list (list[str]) – A list of words.
num_processes (int) – The number of processes to use.
- Returns:
A dictionary of minimum distances for each word.
- Return type:
dict
- find_most_natural_in_composite(self, identifier, best_score=1000000, word_list=None, score_dict=None, reparse_distance=4, min_word_len=3)¶
Finds the most natural words in a composite identifier.
- Parameters:
identifier (str) – The identifier string.
best_score (int) – The best score found so far.
word_list (list or None) – A list to store the most natural words.
score_dict (dict or None) – A dictionary to store scores for each word.
reparse_distance (int) – The distance threshold for reparsing.
min_word_len (int) – The minimum word length to consider.
- Returns:
A tuple containing the word list and score dictionary.
- Return type:
tuple[list, dict]
- evaluate_naturalness(self, word)¶
Evaluates the naturalness of a word.
- Parameters:
word (str) – The word to evaluate.
- Returns:
A dictionary containing naturalness information.
- Return type:
dict
- iterate_possible_words(self, abbreviation)¶
Generates possible word pairs from an abbreviation.
- Parameters:
abbreviation (str) – The abbreviation string.
- Returns:
A list of word pairs.
- Return type:
list[tuple[str, str]]
- get_levenshstein_dist_dict(self, word, word_df)¶
Calculates Levenshtein distances between a word and a DataFrame of words.
- Parameters:
word (str) – The word to calculate distances from.
word_df (pd.DataFrame) – The DataFrame of words.
- Returns:
A tuple containing a dictionary of distances and a dictionary of statistics.
- Return type:
tuple[dict, dict]
- compare_letter_order(self, short_word, long_word, first_letter_match=True)¶
Compares the letter order of two words.
- Parameters:
short_word (str) – The shorter word.
long_word (str) – The longer word.
first_letter_match (bool) – Whether the first letters must match.
- Returns:
True if the letters are in order, False otherwise.
- Return type:
bool
- get_words_with_letters_in_order(self, word, word_df)¶
Filters a DataFrame for words with letters in the same order as the given word.
- Parameters:
word (str) – The word to compare against.
word_df (pd.DataFrame) – The DataFrame of words.
- Returns:
A filtered DataFrame.
- Return type:
pd.DataFrame
- get_words_with_same_letters(self, word, word_df)¶
Filters a DataFrame for words with the same letters as the given word.
- Parameters:
word (str) – The word to compare against.
word_df (pd.DataFrame) – The DataFrame of words.
- Returns:
A filtered DataFrame.
- Return type:
pd.DataFrame
- create_word_letter_freq_vectors(self, filename='./words/words', min_word_length=3)¶
Creates word letter frequency vectors.
- Parameters:
filename (str) – The path to the word list file.
min_word_length (int) – The minimum word length to consider.
- Returns:
A DataFrame containing word letter frequency vectors.
- Return type:
pd.DataFrame
- default_value(self)¶
Returns the default value for dictionaries (0).
- Returns:
0
- Return type:
int
- default_value_list(self)¶
Returns the default value for dictionaries (empty list).
- Returns:
[]
- Return type:
list