.. _load_consolidated_results:

load_consolidated_results
=========================


.. py:class:: ConsolidatedResultsLoader

   Class for loading and storing the consolidated results of the NL-to-SQL annotation files
   and other analysis outputs such as token analysis and query statistics

   .. py:attribute:: config_dict
      :type: dict

      Dictionary containing the configuration parameters for the analysis


   .. py:method:: get_joined_dataframes(jointype='left')

      Joins all of the dataframes into a single dataframe. The join condition is a composite of model, database name, naturalness level, and question number

      :param jointype: Type of join to perform. Defaults to left join.
      :type jointype: str
      :return: Joined dataframe.
      :rtype: pd.DataFrame


   .. py:method:: load_prompt_tokens(file_directory="./data/tokenizer_analysis/")

      Load the token analysis data generated by the tokenizer_analysis.ipynb workbook
      The file contains all question prompts and their tokenizations generated
      by each model used in the experiments.

      :param file_directory: Directory where the files are stored.
      :type file_directory: str
      :return: Pandas Dataframe containing all of the prompt token files
      :rtype: pd.DataFrame


   .. py:method:: load_sentence_level_similarities(file_directory="./data/tokenizer_analysis/")

      Load the sentence level embedding similarity comparisons generated by the tokenizer_analysis.ipynb workbook
      Scores are based on cosine similarity (distance) between the semantic embeddings (SentenceTransformers)
      generated for a NL question and a corresponding gold query for each naturalness level.

      :param file_directory: Directory where the files are stored.
      :type file_directory: str
      :return: Pandas Dataframe containing all of the question-query similarity scores for each naturalness level
      :rtype: pd.DataFrame


   .. py:method:: load_world_level_similarities(file_directory="./data/tokenizer_analysis/")

      Load the word - identifier level embedding similarity comparisons generated by the tokenizer_analysis.ipynb workbook.
      scores are based on the highest cosine similarity distance between a schema identifier and a word in the 
      natural language question.

      :param file_directory: Directory where the files are stored.
      :type file_directory: str
      :return: Pandas Dataframe containing all of the identifier-word similarity scores for each naturalness level
      :rtype: pd.DataFrame


   .. py:method:: load_identifier_token_analysis_files(file_directory='./data/tokenizer_analysis/')

      Load the token analysis data generated by the tokenizer_analysis.ipynb workbook
      The file contains all database identifiers and their tokenizations generated
      by each model used in the experiments.

      :param file_directory: Directory where the files are stored.
      :type file_directory: str
      :return: Pandas Dataframe containing all of the token files
      :rtype: pd.DataFrame


   .. py:method:: load_question_token_analysis_files(file_directory='./data/tokenizer_analysis/')

      Load the token analysis data generated by the tokenizer_analysis.ipynb workbook

      :param file_directory: Directory where the files are stored.
      :type file_directory: str
      :return: Pandas Dataframe containing all of the token files
      :rtype: pd.DataFrame


   .. py:method:: load_query_token_character_ratio_file(file_directory='./data/tokenizer_analysis')

      Load query-level mean token:char ratios generated by the tokenizer_analysis.ipynb workbook

      :param file_directory: Directory where the files are stored.
      :type file_directory: str
      :return: Pandas Dataframe containing all of the ratio means at the question-query pair level
      :rtype: pd.DataFrame


   .. py:method:: load_annotation_files(annotation_directory=None, database=None, remove_error_columns=True)

      Load all of the human-validated NL-to-SQL annotation files into a single dataframe

      :param annotation_directory: Directory where the annotation files are stored
      :type annotation_directory: str or None
      :param database: Name of the database to filter by. If None, all databases will be loaded
      :type database: str or None
      :param remove_error_columns: Option to exclude error classification data from annotations.
      :type remove_error_columns: bool
      :return: Pandas Dataframe containing all of the annotation files
      :rtype: pd.DataFrame


   .. py:method:: load_identifier_crosswalks(file_directory="./db/schema-xwalks/consolidated_and_validated", SBODemo_full=False)

      Load a single dataframe containing identifier naturalness crosswalk data from all databases

      :param file_directory: Directory where the files are stored.
      :type file_directory: str
      :param SBODemo_full: Use crosswalk containing ALL SBODemo identifiers (as opposed to the benchmark subset)
      :type SBODemo_full: bool
      :return: A single dataframe containing identifier naturalness crosswalk data from all databases
      :rtype: pd.DataFrame


.. py:function:: export_gold_data()

   Exports gold data to an Excel file.
   

.. toctree::
    :maxdepth: 2
    :caption: Contents: