data_dict_interpreter

class DataDictInterpreter(database_name: str = None, data_dict_file: str = None)

This class is used to interpret a data dictionary. This is a superclass. Subclasses exist for different document types.

Parameters:
  • database_name (str or None) – The name of the database.

  • data_dict_file (str or None) – The path to the data dictionary file.

Variables:
  • index (defaultdict[str, list[tuple[int, int]]]) – A dictionary mapping words to a list of locations in the document.

  • data_dict_file (str) – The path to the data dictionary file.

  • filename (str) – The name of the fewshot prompt file for a data dictionary.

  • call_gpt (function) – A function that takes a prompt and returns a GPT response.

interactive_prompt_builder(num_examples: int = 5)

User interaction method to build a fewshot prompt for a data dictionary.

The interaction prompts the user to provide a valid database identifier for the document. The user is then prompted to confirm whether the generated identifier is a good example. If the user confirms the example is good, the example is added to the fewshot prompt. The user is then prompted to provide another valid database identifier for the document. This process repeats until the user has provided the desired number of examples. The user is then asked if the prompt should be saved to disk and used for the data dictionary.

Parameters:

num_examples (int) – The number of examples to generate for the fewshot prompt.

Returns:

None

Return type:

None

make_zero_shot_prompt(identifier: str, context_limit: int = 10) str

Generates a zero-shot prompt for a data dictionary.

This method generates a zero-shot prompt for a data dictionary. The prompt is generated using the identifier provided by the user and the context around the identifier in the data dictionary.

Parameters:
  • identifier (str) – The identifier to generate a zero-shot prompt for.

  • context_limit (int) – The limit on the number of instances of the identifier referenced within the text of the data dictionary to include in the prompt.

Returns:

A zero-shot prompt for the data dictionary.

Return type:

str

make_few_shot_prompt(identifier: str, context_limit: int = 10, verbose: bool = True) str

Generates a few-shot prompt for a data dictionary.

Parameters:
  • identifier (str) – The identifier to generate a few-shot prompt for.

  • context_limit (int) – The limit on the number of instances of the identifier referenced within the text of the data dictionary to include in the prompt.

  • verbose (bool) – Whether to print verbose output.

Returns:

A few-shot prompt for the data dictionary.

Return type:

str

getNaturalIdentifier(identifier: str, verbose: bool = False) str

Returns the natural identifier for a database identifier.

Parameters:
  • identifier (str) – The identifier to get the natural identifier for.

  • verbose (bool) – Whether to print verbose output.

Returns:

The natural identifier for the database identifier.

Return type:

str

get_context_around_identifier(identifier: str, beam_width: int = None) list

Returns the context around an identifier in a data dictionary.

Parameters:
  • identifier (str) – The identifier to get the context around.

  • beam_width (int or None) – The number of characters/words to include on either side of the identifier.

Returns:

A list of strings containing the context around the identifier.

Return type:

list[str]

index_dictionary_file(file_obj) defaultdict

Indexes a data dictionary file.

Parameters:

file_obj (Varies depending on subclass.) – The file object to index.

Returns:

A defaultdict mapping words to a list of locations in the document.

Return type:

defaultdict[str, list]

class PdfDataDictInterpreter(database_name: str = None, data_dict_file: str = None)

This class is used to interpret a PDF data dictionary. Inherits from DataDictInterpreter.

Variables:
  • pdf (PdfReader) – A PdfReader object representing the PDF data dictionary.

  • beam_width (int) – The number of characters to include on either side of the identifier.

get_context_around_identifier(identifier: str, beam_width: int = None) list
.. :inheritdoc:: DataDictInterpreter.get_context_around_identifier
index_dictionary_file(file_obj: PdfReader) defaultdict
.. :inheritdoc:: DataDictInterpreter.index_dictionary_file
class XmlDataDictInterpreter(database_name: str = None, data_dict_file: str = None)

This class is used to interpret an XML data dictionary. Inherits from DataDictInterpreter.

Variables:
  • xml_text (str) – A string containing the text of the XML data dictionary.

  • xml_list (list[str]) – A list of words in the XML data dictionary.

  • beam_width (int) – The number of words to include on either side of the identifier.

get_context_around_identifier(identifier: str, beam_width: int = None) list
.. :inheritdoc:: DataDictInterpreter.get_context_around_identifier
index_dictionary_file(file_obj) defaultdict
.. :inheritdoc:: DataDictInterpreter.index_dictionary_file
class CsvDataDictInterpreter(database_name: str = None, data_dict_file: str = None)

This class is used to interpret a CSV data dictionary. Inherits from DataDictInterpreter.

Variables:
  • csv (str) – The CSV data as a string.

  • csv_header (str) – The header row of the CSV.

  • beam_width (int) – The number of lines to include around the identifier.

get_context_around_identifier(identifier: str, beam_width: int = None) list
.. :inheritdoc:: DataDictInterpreter.get_context_around_identifier
index_dictionary_file(file_obj) defaultdict
.. :inheritdoc:: DataDictInterpreter.index_dictionary_file
class JsonDataDictInterpreter(database_name: str = None, data_dict_file: str = None)

This class is used to interpret a JSON data dictionary. Inherits from DataDictInterpreter.

Variables:
  • json_text (str) – The JSON data as a string.

  • json_list (list[str]) – A list of words in the JSON data.

  • beam_width (int) – The number of words to include around the identifier.

get_context_around_identifier(identifier: str, beam_width: int = None) list
.. :inheritdoc:: DataDictInterpreter.get_context_around_identifier
index_dictionary_file(file_obj) defaultdict
.. :inheritdoc:: DataDictInterpreter.index_dictionary_file
class DataDictInterpreterFactory(database_name: str)

Factory class for creating DataDictInterpreter objects.

Parameters:

database_name (str) – The name of the database.

get_current_interpreter() DataDictInterpreter

Returns the current DataDictInterpreter object.

Returns:

The current DataDictInterpreter object.

Return type:

DataDictInterpreter

get_new_interpreter(database_name: str) DataDictInterpreter

Creates and returns a new DataDictInterpreter object based on the database name.

Parameters:

database_name (str) – The name of the database.

Returns:

A new DataDictInterpreter object.

Return type:

DataDictInterpreter