word_list_matching¶

word_list_matching.debug_print(*args, doprint=True)¶

Prints arguments if the debug flag is set and doprint is True.

Parameters:

args – Variable length argument list.
doprint (bool) – Boolean flag to control printing.

Raises:

TypeError – if doprint is not a boolean.

word_list_matching.main()¶

Main function to interact with the user and calculate naturalness.

Gets user input, tokenizes it, evaluates naturalness, and exports results.

No-index:

class word_list_matching.NaturalnessCalculator(wordlist_file='./data/words/words', excel_file=None)¶

Calculates the naturalness of words and identifiers.

Parameters:

wordlist_file (str) – Path to the word list file.
excel_file (str or None) – Path to the Excel file containing pre-calculated word vectors.

score_result(self, best_match_dict)¶

Calculates a score based on edit distance and ambiguity.

Parameters:: best_match_dict (dict) – Dictionary containing best match information.
Returns:: The calculated score.
Return type:: float

assign_label_from_score(self, score)¶

Assigns a naturalness label based on the score.

Parameters:: score (float) – The calculated naturalness score.
Returns:: The naturalness label (N1, N2, or N3).
Return type:: str

tokenize_identifier(self, identifier, camel_case=True)¶

Tokenizes an identifier string.

Parameters:

identifier (str) – The identifier string.
camel_case (bool) – Whether to split on camel case.

Returns:

A list of tokens.

Return type:

list[str]

get_best_match(self, token=None, result_dict=None)¶

Retrieves the best match for a token.

Parameters:

token (str or None) – The token to find the best match for.
result_dict (dict or None) – The result dictionary from evaluate_naturalness.

Returns:

A dictionary containing the best match information.

Return type:

dict

Raises:

ValueError – if neither token nor result_dict is provided.

calculate_score_and_distance_of_delim_and_composite(self, identifier)¶

Calculates score and distance for delimited and composite identifiers.

Parameters:: identifier (str) – The identifier string.
Returns:: A tuple containing a list of words and a dictionary of scores.
Return type:: tuple[list, dict]

_mp_batch_function(self, identifier_sublist, mp_result_dict, p_ix)¶

Batch function for multiprocessing score calculation.

Parameters:

identifier_sublist (list[str]) – A sublist of identifiers.
mp_result_dict (dict) – A shared dictionary to store results.
p_ix (int) – The process index.

mp_calculate_score_and_distance_of_delim_and_composite(self, identifier_list, num_processes=16)¶

Calculates scores and distances using multiprocessing.

Parameters:

identifier_list (list[str]) – A list of identifiers.
num_processes (int) – The number of processes to use.

Returns:

A list of identifiers.

Return type:

list[str]

get_min_distance_for_single_word(self, word)¶

Calculates the minimum edit distance for a single word.

Parameters:: word (str) – The word to calculate the distance for.
Returns:: The minimum edit distance.
Return type:: int

distance_batch_function(self, word_sublist, mp_result_dict)¶

Batch function for multiprocessing distance calculation.

Parameters:

word_sublist (list[str]) – A sublist of words.
mp_result_dict (dict) – A shared dictionary to store results.

mp_get_min_distance_for_word_list(self, word_list, num_processes=16)¶

Calculates minimum distances for a word list using multiprocessing.

Parameters:

word_list (list[str]) – A list of words.
num_processes (int) – The number of processes to use.

Returns:

A dictionary of minimum distances for each word.

Return type:

dict

find_most_natural_in_composite(self, identifier, best_score=1000000, word_list=None, score_dict=None, reparse_distance=4, min_word_len=3)¶

Finds the most natural words in a composite identifier.

Parameters:

identifier (str) – The identifier string.
best_score (int) – The best score found so far.
word_list (list or None) – A list to store the most natural words.
score_dict (dict or None) – A dictionary to store scores for each word.
reparse_distance (int) – The distance threshold for reparsing.
min_word_len (int) – The minimum word length to consider.

Returns:

A tuple containing the word list and score dictionary.

Return type:

tuple[list, dict]

evaluate_naturalness(self, word)¶

Evaluates the naturalness of a word.

Parameters:: word (str) – The word to evaluate.
Returns:: A dictionary containing naturalness information.
Return type:: dict

iterate_possible_words(self, abbreviation)¶

Generates possible word pairs from an abbreviation.

Parameters:: abbreviation (str) – The abbreviation string.
Returns:: A list of word pairs.
Return type:: list[tuple[str, str]]

get_levenshstein_dist_dict(self, word, word_df)¶

Calculates Levenshtein distances between a word and a DataFrame of words.

Parameters:

word (str) – The word to calculate distances from.
word_df (pd.DataFrame) – The DataFrame of words.

Returns:

A tuple containing a dictionary of distances and a dictionary of statistics.

Return type:

tuple[dict, dict]

compare_letter_order(self, short_word, long_word, first_letter_match=True)¶

Compares the letter order of two words.

Parameters:

short_word (str) – The shorter word.
long_word (str) – The longer word.
first_letter_match (bool) – Whether the first letters must match.

Returns:

True if the letters are in order, False otherwise.

Return type:

bool

get_words_with_letters_in_order(self, word, word_df)¶

Filters a DataFrame for words with letters in the same order as the given word.

Parameters:

word (str) – The word to compare against.
word_df (pd.DataFrame) – The DataFrame of words.

Returns:

A filtered DataFrame.

Return type:

pd.DataFrame

get_words_with_same_letters(self, word, word_df)¶

Filters a DataFrame for words with the same letters as the given word.

Parameters:

word (str) – The word to compare against.
word_df (pd.DataFrame) – The DataFrame of words.

Returns:

A filtered DataFrame.

Return type:

pd.DataFrame

create_word_letter_freq_vectors(self, filename='./words/words', min_word_length=3)¶

Creates word letter frequency vectors.

Parameters:

filename (str) – The path to the word list file.
min_word_length (int) – The minimum word length to consider.

Returns:

A DataFrame containing word letter frequency vectors.

Return type:

pd.DataFrame

default_value(self)¶

Returns the default value for dictionaries (0).

Returns:: 0
Return type:: int

default_value_list(self)¶

Returns the default value for dictionaries (empty list).

Returns:: []
Return type:: list

word_list_matching¶

SNAILS

Navigation

Related Topics