word_list_matching

word_list_matching.debug_print(*args, doprint=True)

Prints arguments if the debug flag is set and doprint is True.

Parameters:
  • args – Variable length argument list.

  • doprint (bool) – Boolean flag to control printing.

Raises:

TypeError – if doprint is not a boolean.

word_list_matching.main()

Main function to interact with the user and calculate naturalness.

Gets user input, tokenizes it, evaluates naturalness, and exports results.

No-index:

class word_list_matching.NaturalnessCalculator(wordlist_file='./data/words/words', excel_file=None)

Calculates the naturalness of words and identifiers.

Parameters:
  • wordlist_file (str) – Path to the word list file.

  • excel_file (str or None) – Path to the Excel file containing pre-calculated word vectors.

score_result(self, best_match_dict)

Calculates a score based on edit distance and ambiguity.

Parameters:

best_match_dict (dict) – Dictionary containing best match information.

Returns:

The calculated score.

Return type:

float

assign_label_from_score(self, score)

Assigns a naturalness label based on the score.

Parameters:

score (float) – The calculated naturalness score.

Returns:

The naturalness label (N1, N2, or N3).

Return type:

str

tokenize_identifier(self, identifier, camel_case=True)

Tokenizes an identifier string.

Parameters:
  • identifier (str) – The identifier string.

  • camel_case (bool) – Whether to split on camel case.

Returns:

A list of tokens.

Return type:

list[str]

get_best_match(self, token=None, result_dict=None)

Retrieves the best match for a token.

Parameters:
  • token (str or None) – The token to find the best match for.

  • result_dict (dict or None) – The result dictionary from evaluate_naturalness.

Returns:

A dictionary containing the best match information.

Return type:

dict

Raises:

ValueError – if neither token nor result_dict is provided.

calculate_score_and_distance_of_delim_and_composite(self, identifier)

Calculates score and distance for delimited and composite identifiers.

Parameters:

identifier (str) – The identifier string.

Returns:

A tuple containing a list of words and a dictionary of scores.

Return type:

tuple[list, dict]

_mp_batch_function(self, identifier_sublist, mp_result_dict, p_ix)

Batch function for multiprocessing score calculation.

Parameters:
  • identifier_sublist (list[str]) – A sublist of identifiers.

  • mp_result_dict (dict) – A shared dictionary to store results.

  • p_ix (int) – The process index.

mp_calculate_score_and_distance_of_delim_and_composite(self, identifier_list, num_processes=16)

Calculates scores and distances using multiprocessing.

Parameters:
  • identifier_list (list[str]) – A list of identifiers.

  • num_processes (int) – The number of processes to use.

Returns:

A list of identifiers.

Return type:

list[str]

get_min_distance_for_single_word(self, word)

Calculates the minimum edit distance for a single word.

Parameters:

word (str) – The word to calculate the distance for.

Returns:

The minimum edit distance.

Return type:

int

distance_batch_function(self, word_sublist, mp_result_dict)

Batch function for multiprocessing distance calculation.

Parameters:
  • word_sublist (list[str]) – A sublist of words.

  • mp_result_dict (dict) – A shared dictionary to store results.

mp_get_min_distance_for_word_list(self, word_list, num_processes=16)

Calculates minimum distances for a word list using multiprocessing.

Parameters:
  • word_list (list[str]) – A list of words.

  • num_processes (int) – The number of processes to use.

Returns:

A dictionary of minimum distances for each word.

Return type:

dict

find_most_natural_in_composite(self, identifier, best_score=1000000, word_list=None, score_dict=None, reparse_distance=4, min_word_len=3)

Finds the most natural words in a composite identifier.

Parameters:
  • identifier (str) – The identifier string.

  • best_score (int) – The best score found so far.

  • word_list (list or None) – A list to store the most natural words.

  • score_dict (dict or None) – A dictionary to store scores for each word.

  • reparse_distance (int) – The distance threshold for reparsing.

  • min_word_len (int) – The minimum word length to consider.

Returns:

A tuple containing the word list and score dictionary.

Return type:

tuple[list, dict]

evaluate_naturalness(self, word)

Evaluates the naturalness of a word.

Parameters:

word (str) – The word to evaluate.

Returns:

A dictionary containing naturalness information.

Return type:

dict

iterate_possible_words(self, abbreviation)

Generates possible word pairs from an abbreviation.

Parameters:

abbreviation (str) – The abbreviation string.

Returns:

A list of word pairs.

Return type:

list[tuple[str, str]]

get_levenshstein_dist_dict(self, word, word_df)

Calculates Levenshtein distances between a word and a DataFrame of words.

Parameters:
  • word (str) – The word to calculate distances from.

  • word_df (pd.DataFrame) – The DataFrame of words.

Returns:

A tuple containing a dictionary of distances and a dictionary of statistics.

Return type:

tuple[dict, dict]

compare_letter_order(self, short_word, long_word, first_letter_match=True)

Compares the letter order of two words.

Parameters:
  • short_word (str) – The shorter word.

  • long_word (str) – The longer word.

  • first_letter_match (bool) – Whether the first letters must match.

Returns:

True if the letters are in order, False otherwise.

Return type:

bool

get_words_with_letters_in_order(self, word, word_df)

Filters a DataFrame for words with letters in the same order as the given word.

Parameters:
  • word (str) – The word to compare against.

  • word_df (pd.DataFrame) – The DataFrame of words.

Returns:

A filtered DataFrame.

Return type:

pd.DataFrame

get_words_with_same_letters(self, word, word_df)

Filters a DataFrame for words with the same letters as the given word.

Parameters:
  • word (str) – The word to compare against.

  • word_df (pd.DataFrame) – The DataFrame of words.

Returns:

A filtered DataFrame.

Return type:

pd.DataFrame

create_word_letter_freq_vectors(self, filename='./words/words', min_word_length=3)

Creates word letter frequency vectors.

Parameters:
  • filename (str) – The path to the word list file.

  • min_word_length (int) – The minimum word length to consider.

Returns:

A DataFrame containing word letter frequency vectors.

Return type:

pd.DataFrame

default_value(self)

Returns the default value for dictionaries (0).

Returns:

0

Return type:

int

default_value_list(self)

Returns the default value for dictionaries (empty list).

Returns:

[]

Return type:

list