load_consolidated_results¶

class ConsolidatedResultsLoader¶

Class for loading and storing the consolidated results of the NL-to-SQL annotation files and other analysis outputs such as token analysis and query statistics

config_dict: dict¶: Dictionary containing the configuration parameters for the analysis

get_joined_dataframes(jointype='left')¶

Joins all of the dataframes into a single dataframe. The join condition is a composite of model, database name, naturalness level, and question number

Parameters:: jointype (str) – Type of join to perform. Defaults to left join.
Returns:: Joined dataframe.
Return type:: pd.DataFrame

load_prompt_tokens(file_directory='./data/tokenizer_analysis/')¶

Load the token analysis data generated by the tokenizer_analysis.ipynb workbook The file contains all question prompts and their tokenizations generated by each model used in the experiments.

Parameters:: file_directory (str) – Directory where the files are stored.
Returns:: Pandas Dataframe containing all of the prompt token files
Return type:: pd.DataFrame

load_sentence_level_similarities(file_directory='./data/tokenizer_analysis/')¶

Load the sentence level embedding similarity comparisons generated by the tokenizer_analysis.ipynb workbook Scores are based on cosine similarity (distance) between the semantic embeddings (SentenceTransformers) generated for a NL question and a corresponding gold query for each naturalness level.

Parameters:: file_directory (str) – Directory where the files are stored.
Returns:: Pandas Dataframe containing all of the question-query similarity scores for each naturalness level
Return type:: pd.DataFrame

load_world_level_similarities(file_directory='./data/tokenizer_analysis/')¶

Load the word - identifier level embedding similarity comparisons generated by the tokenizer_analysis.ipynb workbook. scores are based on the highest cosine similarity distance between a schema identifier and a word in the natural language question.

Parameters:: file_directory (str) – Directory where the files are stored.
Returns:: Pandas Dataframe containing all of the identifier-word similarity scores for each naturalness level
Return type:: pd.DataFrame

load_identifier_token_analysis_files(file_directory='./data/tokenizer_analysis/')¶

Load the token analysis data generated by the tokenizer_analysis.ipynb workbook The file contains all database identifiers and their tokenizations generated by each model used in the experiments.

Parameters:: file_directory (str) – Directory where the files are stored.
Returns:: Pandas Dataframe containing all of the token files
Return type:: pd.DataFrame

load_question_token_analysis_files(file_directory='./data/tokenizer_analysis/')¶

Load the token analysis data generated by the tokenizer_analysis.ipynb workbook

Parameters:: file_directory (str) – Directory where the files are stored.
Returns:: Pandas Dataframe containing all of the token files
Return type:: pd.DataFrame

load_query_token_character_ratio_file(file_directory='./data/tokenizer_analysis')¶

Load query-level mean token:char ratios generated by the tokenizer_analysis.ipynb workbook

Parameters:: file_directory (str) – Directory where the files are stored.
Returns:: Pandas Dataframe containing all of the ratio means at the question-query pair level
Return type:: pd.DataFrame

load_annotation_files(annotation_directory=None, database=None, remove_error_columns=True)¶

Load all of the human-validated NL-to-SQL annotation files into a single dataframe

Parameters:

annotation_directory (str or None) – Directory where the annotation files are stored
database (str or None) – Name of the database to filter by. If None, all databases will be loaded
remove_error_columns (bool) – Option to exclude error classification data from annotations.

Returns:

Pandas Dataframe containing all of the annotation files

Return type:

pd.DataFrame

load_identifier_crosswalks(file_directory='./db/schema-xwalks/consolidated_and_validated', SBODemo_full=False)¶

Load a single dataframe containing identifier naturalness crosswalk data from all databases

Parameters:

file_directory (str) – Directory where the files are stored.
SBODemo_full (bool) – Use crosswalk containing ALL SBODemo identifiers (as opposed to the benchmark subset)

Returns:

A single dataframe containing identifier naturalness crosswalk data from all databases

Return type:

pd.DataFrame

export_gold_data()¶: Exports gold data to an Excel file.

load_consolidated_results¶

SNAILS

Navigation

Related Topics