load_consolidated_results¶
- class ConsolidatedResultsLoader¶
Class for loading and storing the consolidated results of the NL-to-SQL annotation files and other analysis outputs such as token analysis and query statistics
- config_dict: dict¶
Dictionary containing the configuration parameters for the analysis
- get_joined_dataframes(jointype='left')¶
Joins all of the dataframes into a single dataframe. The join condition is a composite of model, database name, naturalness level, and question number
- Parameters:
jointype (str) – Type of join to perform. Defaults to left join.
- Returns:
Joined dataframe.
- Return type:
pd.DataFrame
- load_prompt_tokens(file_directory='./data/tokenizer_analysis/')¶
Load the token analysis data generated by the tokenizer_analysis.ipynb workbook The file contains all question prompts and their tokenizations generated by each model used in the experiments.
- Parameters:
file_directory (str) – Directory where the files are stored.
- Returns:
Pandas Dataframe containing all of the prompt token files
- Return type:
pd.DataFrame
- load_sentence_level_similarities(file_directory='./data/tokenizer_analysis/')¶
Load the sentence level embedding similarity comparisons generated by the tokenizer_analysis.ipynb workbook Scores are based on cosine similarity (distance) between the semantic embeddings (SentenceTransformers) generated for a NL question and a corresponding gold query for each naturalness level.
- Parameters:
file_directory (str) – Directory where the files are stored.
- Returns:
Pandas Dataframe containing all of the question-query similarity scores for each naturalness level
- Return type:
pd.DataFrame
- load_world_level_similarities(file_directory='./data/tokenizer_analysis/')¶
Load the word - identifier level embedding similarity comparisons generated by the tokenizer_analysis.ipynb workbook. scores are based on the highest cosine similarity distance between a schema identifier and a word in the natural language question.
- Parameters:
file_directory (str) – Directory where the files are stored.
- Returns:
Pandas Dataframe containing all of the identifier-word similarity scores for each naturalness level
- Return type:
pd.DataFrame
- load_identifier_token_analysis_files(file_directory='./data/tokenizer_analysis/')¶
Load the token analysis data generated by the tokenizer_analysis.ipynb workbook The file contains all database identifiers and their tokenizations generated by each model used in the experiments.
- Parameters:
file_directory (str) – Directory where the files are stored.
- Returns:
Pandas Dataframe containing all of the token files
- Return type:
pd.DataFrame
- load_question_token_analysis_files(file_directory='./data/tokenizer_analysis/')¶
Load the token analysis data generated by the tokenizer_analysis.ipynb workbook
- Parameters:
file_directory (str) – Directory where the files are stored.
- Returns:
Pandas Dataframe containing all of the token files
- Return type:
pd.DataFrame
- load_query_token_character_ratio_file(file_directory='./data/tokenizer_analysis')¶
Load query-level mean token:char ratios generated by the tokenizer_analysis.ipynb workbook
- Parameters:
file_directory (str) – Directory where the files are stored.
- Returns:
Pandas Dataframe containing all of the ratio means at the question-query pair level
- Return type:
pd.DataFrame
- load_annotation_files(annotation_directory=None, database=None, remove_error_columns=True)¶
Load all of the human-validated NL-to-SQL annotation files into a single dataframe
- Parameters:
annotation_directory (str or None) – Directory where the annotation files are stored
database (str or None) – Name of the database to filter by. If None, all databases will be loaded
remove_error_columns (bool) – Option to exclude error classification data from annotations.
- Returns:
Pandas Dataframe containing all of the annotation files
- Return type:
pd.DataFrame
- load_identifier_crosswalks(file_directory='./db/schema-xwalks/consolidated_and_validated', SBODemo_full=False)¶
Load a single dataframe containing identifier naturalness crosswalk data from all databases
- Parameters:
file_directory (str) – Directory where the files are stored.
SBODemo_full (bool) – Use crosswalk containing ALL SBODemo identifiers (as opposed to the benchmark subset)
- Returns:
A single dataframe containing identifier naturalness crosswalk data from all databases
- Return type:
pd.DataFrame
- export_gold_data()¶
Exports gold data to an Excel file.