load_consolidated_results

class ConsolidatedResultsLoader

Class for loading and storing the consolidated results of the NL-to-SQL annotation files and other analysis outputs such as token analysis and query statistics

config_dict: dict

Dictionary containing the configuration parameters for the analysis

get_joined_dataframes(jointype='left')

Joins all of the dataframes into a single dataframe. The join condition is a composite of model, database name, naturalness level, and question number

Parameters:

jointype (str) – Type of join to perform. Defaults to left join.

Returns:

Joined dataframe.

Return type:

pd.DataFrame

load_prompt_tokens(file_directory='./data/tokenizer_analysis/')

Load the token analysis data generated by the tokenizer_analysis.ipynb workbook The file contains all question prompts and their tokenizations generated by each model used in the experiments.

Parameters:

file_directory (str) – Directory where the files are stored.

Returns:

Pandas Dataframe containing all of the prompt token files

Return type:

pd.DataFrame

load_sentence_level_similarities(file_directory='./data/tokenizer_analysis/')

Load the sentence level embedding similarity comparisons generated by the tokenizer_analysis.ipynb workbook Scores are based on cosine similarity (distance) between the semantic embeddings (SentenceTransformers) generated for a NL question and a corresponding gold query for each naturalness level.

Parameters:

file_directory (str) – Directory where the files are stored.

Returns:

Pandas Dataframe containing all of the question-query similarity scores for each naturalness level

Return type:

pd.DataFrame

load_world_level_similarities(file_directory='./data/tokenizer_analysis/')

Load the word - identifier level embedding similarity comparisons generated by the tokenizer_analysis.ipynb workbook. scores are based on the highest cosine similarity distance between a schema identifier and a word in the natural language question.

Parameters:

file_directory (str) – Directory where the files are stored.

Returns:

Pandas Dataframe containing all of the identifier-word similarity scores for each naturalness level

Return type:

pd.DataFrame

load_identifier_token_analysis_files(file_directory='./data/tokenizer_analysis/')

Load the token analysis data generated by the tokenizer_analysis.ipynb workbook The file contains all database identifiers and their tokenizations generated by each model used in the experiments.

Parameters:

file_directory (str) – Directory where the files are stored.

Returns:

Pandas Dataframe containing all of the token files

Return type:

pd.DataFrame

load_question_token_analysis_files(file_directory='./data/tokenizer_analysis/')

Load the token analysis data generated by the tokenizer_analysis.ipynb workbook

Parameters:

file_directory (str) – Directory where the files are stored.

Returns:

Pandas Dataframe containing all of the token files

Return type:

pd.DataFrame

load_query_token_character_ratio_file(file_directory='./data/tokenizer_analysis')

Load query-level mean token:char ratios generated by the tokenizer_analysis.ipynb workbook

Parameters:

file_directory (str) – Directory where the files are stored.

Returns:

Pandas Dataframe containing all of the ratio means at the question-query pair level

Return type:

pd.DataFrame

load_annotation_files(annotation_directory=None, database=None, remove_error_columns=True)

Load all of the human-validated NL-to-SQL annotation files into a single dataframe

Parameters:
  • annotation_directory (str or None) – Directory where the annotation files are stored

  • database (str or None) – Name of the database to filter by. If None, all databases will be loaded

  • remove_error_columns (bool) – Option to exclude error classification data from annotations.

Returns:

Pandas Dataframe containing all of the annotation files

Return type:

pd.DataFrame

load_identifier_crosswalks(file_directory='./db/schema-xwalks/consolidated_and_validated', SBODemo_full=False)

Load a single dataframe containing identifier naturalness crosswalk data from all databases

Parameters:
  • file_directory (str) – Directory where the files are stored.

  • SBODemo_full (bool) – Use crosswalk containing ALL SBODemo identifiers (as opposed to the benchmark subset)

Returns:

A single dataframe containing identifier naturalness crosswalk data from all databases

Return type:

pd.DataFrame

export_gold_data()

Exports gold data to an Excel file.