Supported Models

`ChatGPT_QA`

Bases: AbstractChatGPT

A Subclass of AbstractChatGPT which provides functionality to act as a question answering model for tabular data.

Attributes:

Name	Type	Description
`api_key`	`str`	The API key for the OpenAI client.
`api_org`	`str`	The organization ID for the OpenAI account. Defaults to None.
`model_name`	`str`	The name of the model to use. Defaults to 'gpt-3.5-turbo-0613'.

Methods:

Name	Description
`name`	Property attribute which returns the model name.
`prompt`	Property attribute which provides instructions for the model in a defined format.
`process_input`	Converts input data into a format which model can interpret.
`_normalize_output`	Normalize the output for question answering.

Note

The model used in this class is "gpt-3.5-turbo-0613" but you can specify any version you want.
The prompt contains few-shot examples to improve the QA task results

Examples:

>>> import pandas as pd
>>> from qatch.models import ChatGPT_QA
>>>
>>> data = pd.DataFrame([
...     ["John Doe", "123-456-7890"],
...     ["Jane Doe", "098-765-4321"]
... ], columns=["Name", "Phone Number"])
>>>
>>> chatgpt_qa_instance =  ChatGPT_QA(api_key=credentials['api_key_chatgpt'],
>>>                                  api_org=credentials['api_org_chatgpt'],
>>>                                  model_name="gpt-3.5-turbo-0613")
>>> query = "What is John Doe's phone number?"
>>> answer = chatgpt_qa_instance.predict(table=data, query=query, tbl_name='Contact Info')
>>> print(answer)
[['123-456-7890']]

Source code in qatch/models/chatgpt/chatgpt_QA.py

class ChatGPT_QA(AbstractChatGPT):
    """
       A Subclass of `AbstractChatGPT` which provides functionality to act as a question answering model
       for tabular data.

       Attributes:
           api_key (str): The API key for the OpenAI client.
           api_org (str, optional): The organization ID for the OpenAI account. Defaults to None.
           model_name (str, optional): The name of the model to use. Defaults to 'gpt-3.5-turbo-0613'.

       Methods:
           name: Property attribute which returns the model name.
           prompt: Property attribute which provides instructions for the model in a defined format.
           process_input: Converts input data into a format which model can interpret.
           _normalize_output: Normalize the output for question answering.

       Note:
           - The model used in this class is "gpt-3.5-turbo-0613" but you can specify any version you want.
           - The prompt contains few-shot examples to improve the QA task results


       Examples:
           >>> import pandas as pd
           >>> from qatch.models import ChatGPT_QA
           >>>
           >>> data = pd.DataFrame([
           ...     ["John Doe", "123-456-7890"],
           ...     ["Jane Doe", "098-765-4321"]
           ... ], columns=["Name", "Phone Number"])
           >>>
           >>> chatgpt_qa_instance =  ChatGPT_QA(api_key=credentials['api_key_chatgpt'],
           >>>                                  api_org=credentials['api_org_chatgpt'],
           >>>                                  model_name="gpt-3.5-turbo-0613")
           >>> query = "What is John Doe's phone number?"
           >>> answer = chatgpt_qa_instance.predict(table=data, query=query, tbl_name='Contact Info')
           >>> print(answer)
           [['123-456-7890']]
     """

    def __init__(self, api_key: str,
                 api_org: str | None,
                 model_name="gpt-3.5-turbo-0613",
                 *args, **kwargs):
        super().__init__(api_key, api_org, model_name,
                         *args, **kwargs)
        self.encoding = tiktoken.encoding_for_model(model_name)

    @property
    def name(self):
        return "ChatGPT_QA"

    @property
    def prompt(self):
        return [
            {"role": "user", "content":
                """I want you to act as a question answering model for tabular data.
                   I will pass you a table with one question.
                   I want you to only reply with the output of the question executed on the table.
                   I want you to return the answer in format: list of list (row and columns).
                   The answer must be complete of all the data from the table.
                   If an aggregations is present, return only the aggregate values.
                   Do not write explanations. Do not type commands.
                   This is an Example:
                   Table:
                    [
                        [['Simone', '[H] Name'], ['Papicchio', '[H] Surname']],
                         [['Marco', '[H] Name'], ['Clemente', '[H] Surname']]
                    ],
                    Question: 'Show all information about each body builder']
                    I want you to output:
                    [['Simone', 'Papicchio'], ['Marco', 'Clemente']]
                    """},
            {"role": "user", "content":
                "Table:[[['24172', '[H] Student ID'], ['30', '[H] Grade'], ['3431223445', '[H] Phone Numbers']], "
                "[['281811', '[H] Student ID'], ['22', '[H] Grade'], ['3435227445', '[H] Phone Numbers']]] "
                "Question: 'what are all the phone numbers?'"},
            {"role": "assistant",
             "content": "[['3431223445'], ['3435227445']]"},
            {"role": "user", "content":
                "Table:[[['24172', '[H] Student ID'], ['28', '[H] Grade'], ['3431223445', '[H] Phone Numbers']], "
                "[['281811', '[H] Student ID'], ['24', '[H] Grade'], ['3435227445', '[H] Phone Numbers']]] "
                "Question: 'what is the average of the grade?'"},
            {"role": "assistant",
             "content": "[[26]]"}
        ]

    def process_input(self,
                      table: pd.DataFrame | None,
                      db_table_schema: list | list[list] | None,
                      query: str,
                      query_tbl_name: str | list[str]) -> Any | None:
        if table is None:
            raise ValueError('To use ChatGPT for QA, you need to pass the pandas table')
        linearized_table = linearize_table(table)
        prompt = f"Table: {linearized_table},\nQuestion: '{query}'"
        num_tokens = self._num_tokens_from_string(prompt)
        if num_tokens > 4098:
            self.logger.error('prompt cannot be passed num_tokens > 4098')
            return None
        else:
            return {"role": "user", "content": prompt}

    def _num_tokens_from_string(self, string: str) -> int:
        """Returns the number of tokens in a text string."""
        num_tokens = len(self.encoding.encode(string))
        return num_tokens

    def _normalize_api_output(self, api_output):
        prediction: str = api_output.choices[0].message.content
        prediction: list = _normalize_output_for_QA(prediction)
        return prediction

`ChatGPT_SP`

Bases: AbstractChatGPT

A Subclass of AbstractChatGPT which provides functionality to act as a semantic parsing model for tabular data.

Attributes:

Name	Type	Description
`api_key`	`str`	The API key for the OpenAI client.
`api_org`	`str`	The organization ID for the OpenAI account. Defaults to None.
`model_name`	`str`	The name of the model to use. Defaults to 'gpt-3.5-turbo-0613'.

Methods:

Name	Description
`name`	Property attribute which returns the model name.
`prompt`	Property attribute which provides instructions for the model in a defined format: Table name: "body-builder", Schema: "[Name, Surname]", Questions: "Show all information about each body builder"
`process_input`	Converts input data into a format which model can interpret.
`_normalize_output`	Normalize the output for question answering.

Note

The model used in this class is "gpt-3.5-turbo-0613" but you can specify any version you want.
The prompt contains few-shot examples to improve the QA task results

Examples:

>>> import pandas as pd
>>> from qatch.models import ChatGPT_SP
>>>
>>> data = pd.DataFrame([
...     ["John Doe", "123-456-7890"],
...     ["Jane Doe", "098-765-4321"]
... ], columns=["Name", "Phone Number"])
>>>
>>> chatgpt_sp_instance = ChatGPT_SP(api_key=credentials['api_key_chatgpt'],
   >>>                                  api_org=credentials['api_org_chatgpt'],
   >>>                                  model_name="gpt-3.5-turbo-0613")
>>> query = "What is John Doe's phone number?"
>>> answer = chatgpt_sp_instance.predict(table=data, query=query, tbl_name='Contact Info')
>>> print(answer)
SELECT "Phone Number" FROM "Contact Info" WHERE "Name" = "John Doe"

Source code in qatch/models/chatgpt/chatgpt_SP.py

class ChatGPT_SP(AbstractChatGPT):
    """
    A Subclass of `AbstractChatGPT` which provides functionality to act as a semantic parsing model for tabular data.

    Attributes:
        api_key (str): The API key for the OpenAI client.
        api_org (str, optional): The organization ID for the OpenAI account. Defaults to None.
        model_name (str, optional): The name of the model to use. Defaults to 'gpt-3.5-turbo-0613'.

    Methods:
        name: Property attribute which returns the model name.
        prompt: Property attribute which provides instructions for the model in a defined format: Table name: "body-builder",
            Schema: "[Name, Surname]", Questions: "Show all information about each body builder"
        process_input: Converts input data into a format which model can interpret.
        _normalize_output: Normalize the output for question answering.

    Note:
       - The model used in this class is "gpt-3.5-turbo-0613" but you can specify any version you want.
       - The prompt contains few-shot examples to improve the QA task results


    Examples:
        >>> import pandas as pd
        >>> from qatch.models import ChatGPT_SP
        >>>
        >>> data = pd.DataFrame([
        ...     ["John Doe", "123-456-7890"],
        ...     ["Jane Doe", "098-765-4321"]
        ... ], columns=["Name", "Phone Number"])
        >>>
        >>> chatgpt_sp_instance = ChatGPT_SP(api_key=credentials['api_key_chatgpt'],
           >>>                                  api_org=credentials['api_org_chatgpt'],
           >>>                                  model_name="gpt-3.5-turbo-0613")
        >>> query = "What is John Doe's phone number?"
        >>> answer = chatgpt_sp_instance.predict(table=data, query=query, tbl_name='Contact Info')
        >>> print(answer)
        SELECT "Phone Number" FROM "Contact Info" WHERE "Name" = "John Doe"
    """

    def __init__(self, api_key: str,
                 api_org: str | None,
                 model_name="gpt-3.5-turbo-0613",
                 *args, **kwargs):
        super().__init__(api_key, api_org, model_name,
                         *args, **kwargs)

    @property
    def name(self):
        return 'ChatGPT_SP'

    @property
    def prompt(self):
        return [
            {"role": "user", "content":
                """I want you to act as a text to SQL model for tabular data.
                   I will pass you the schema of the table and one question.
                   I want you to parse the question into the SQL command.
                   The SQL command must be executable with the schema of the table.
                   Do not write explanations. Do not type commands. 
                   REPLY ONLY WITH THE SQL COMMAND.
                   This is an Example:
                   Table name: "body-builder", 
                    Schema: [Name, Surname], 
                    Questions: "Show all information about each body builder"
                    I want you to output:
                    "SELECT * FROM "body-builder""
                    """},
            {"role": "user", "content":
                'Table name: "student",'
                "Schema: [StudentID, Grade, PhoneNumbers]"
                'Question: "what are all the phone numbers?"'},

            {"role": "assistant",
             "content": 'SELECT "PhoneNumbers" FROM student'},
            {"role": "user", "content":
                'Table name: "student"'
                "Schema: [StudentID, Grade, PhoneNumbers]"
                'Question: "what is the average grade?"'},
            {"role": "assistant",
             "content": "SELECT AVG(Grade) FROM student"},
        ]

    def process_input(self,
                      table: pd.DataFrame | None,
                      db_table_schema: list | list[list] | None,
                      query: str,
                      query_tbl_name: str | list[str]) -> Any | None:
        if not query_tbl_name:
            raise ValueError('For Semantic Parsing, it is need the table name '
                             'for the chatgpt input prompt')

        schema = table.columns.tolist()
        prompt = f'Table Name: "{query_tbl_name}",\nSchema: {schema},\nQuestion: "{query}"'
        return {"role": "user", "content": prompt}

    def _normalize_api_output(self, api_output):
        prediction: str = api_output.choices[0].message.content
        return prediction

`ChatGPT_SP_join`

Bases: AbstractChatGPT

Implementation of the Llama2 model specialized for semantic parsing (SP) with JOIN statements. Inherits from the Abstract Llama2 model class.

This model processes the provided schemas and queries, and after transformation, predicts the appropriate SQL statements.

Attributes:

Name	Type	Description
`api_key`	`str`	The API key for the OpenAI client.
`api_org`	`str`	The organization ID for the OpenAI account. Defaults to None.
`model_name`	`str`	The name of the model to use. Defaults to 'gpt-3.5-turbo-0613'.

Methods:

Name	Description
`name`	Property attribute which returns the model name.
`prompt`	Property attribute which provides instructions for the model in a defined format: Database table names: ["customer", "product"], Schema table "customer": [CustomerID, name, surname] Schema table "product": [ProductID, CustomerID, name, surname, price] Question: "which products did Simone buy?"
`process_input`	Processes given inputs into a form that model can consume. Extracts and structures relevant data for the SP task.
`_normalize_output`	Normalizes the text received from model predictions. Strips away unnecessary characters from the result SQL statement.

For this model, the table parameter in predict and process_input methods

is not used and can be set to None.

Examples:

>>> chatgpt_sp_join = ChatGPT_SP_join(api_key=credentials['api_key_chatgpt'],
   >>>                                  api_org=credentials['api_org_chatgpt'],
   >>>                                  model_name="gpt-3.5-turbo-0613")
>>> # you need to specify all the database table schema
>>> # if you are using QATCH, you can use database.get_all_table_schema_given(db_id='name_of_the_database')
>>> db_table_schema = {
...                    "student": {"name": ["StudentID", "Grade", "PhoneNumbers"]},
...                    "customer": {"name": ["CustomerID", "name", "surname"]},
...                    "product": {"name": ["ProductID", "CustomerID", "name", "surname", "price"]}
...                   }
>>> query = "which products did Simone buy?"
>>> chatgpt_sp_join.predict(table=None,
>>>                       query=query,
>>>                       tbl_name=["customer", "product"],
>>>                       db_table_schema=db_table_schema)
SELECT T1.name, T2.name FROM "customer" as T1 JOIN "product" as T2 WHERE T1.name == "Simone"

Source code in qatch/models/chatgpt/chatgpt_SP_join.py

class ChatGPT_SP_join(AbstractChatGPT):
    """
    Implementation of the Llama2 model specialized for semantic parsing (SP)
    with JOIN statements. Inherits from the Abstract Llama2 model class.

    This model processes the provided schemas and queries, and after transformation,
    predicts the appropriate SQL statements.

    Attributes:
        api_key (str): The API key for the OpenAI client.
        api_org (str, optional): The organization ID for the OpenAI account. Defaults to None.
        model_name (str, optional): The name of the model to use. Defaults to 'gpt-3.5-turbo-0613'.

    Methods:
        name: Property attribute which returns the model name.
        prompt: Property attribute which provides instructions for the model in a defined format:
                Database table names: ["customer", "product"], Schema table "customer": [CustomerID, name, surname]
                Schema table "product": [ProductID, CustomerID, name, surname, price]
                Question: "which products did Simone buy?"
        process_input: Processes given inputs into a form that model can consume.
                       Extracts and structures relevant data for the SP task.
        _normalize_output: Normalizes the text received from model predictions.
                           Strips away unnecessary characters from the result SQL statement.

    Note: For this model, the `table` parameter in predict and process_input methods
            is not used and can be set to None.

    Examples:
        >>> chatgpt_sp_join = ChatGPT_SP_join(api_key=credentials['api_key_chatgpt'],
           >>>                                  api_org=credentials['api_org_chatgpt'],
           >>>                                  model_name="gpt-3.5-turbo-0613")
        >>> # you need to specify all the database table schema
        >>> # if you are using QATCH, you can use database.get_all_table_schema_given(db_id='name_of_the_database')
        >>> db_table_schema = {
        ...                    "student": {"name": ["StudentID", "Grade", "PhoneNumbers"]},
        ...                    "customer": {"name": ["CustomerID", "name", "surname"]},
        ...                    "product": {"name": ["ProductID", "CustomerID", "name", "surname", "price"]}
        ...                   }
        >>> query = "which products did Simone buy?"
        >>> chatgpt_sp_join.predict(table=None,
        >>>                       query=query,
        >>>                       tbl_name=["customer", "product"],
        >>>                       db_table_schema=db_table_schema)
        SELECT T1.name, T2.name FROM "customer" as T1 JOIN "product" as T2 WHERE T1.name == "Simone"

    """

    def __init__(self, api_key: str,
                 api_org: str | None,
                 model_name="gpt-3.5-turbo-0613",
                 *args, **kwargs):
        super().__init__(api_key, api_org, model_name,
                         *args, **kwargs)

    @property
    def name(self):
        return 'ChatGPT_SP_join'

    @property
    def prompt(self):
        return [
            {"role": "user", "content":
                """I want you to act as a text to SQL model for tabular data.
                   I will pass you as prompt: 
                   - all the table names and the respective tables schema present in the database 
                   - one question. 
                   I want you to parse the question into the SQL command.
                   The SQL command must be executable over the database.
                   Do not write explanations. Do not type commands. 
                   REPLY ONLY WITH THE SQL COMMAND.
                   This is an Example:
                    Database table names: ["body-builder"], 
                    Table schema "body-builder": [Name, Surname], 
                    Question: "Show all information about each body builder"
                    I want you to output:
                    "SELECT * FROM "body-builder""
                    """},

            {"role": "user", "content":
                'Database table names: ["student"],'
                'Schema table "body-builder": [StudentID, Grade, PhoneNumbers]'
                'Question: "what are all the phone numbers?"'},
            {"role": "assistant",
             "content": 'SELECT "PhoneNumbers" FROM student'},

            {"role": "user", "content":
                'Database table names: ["student"],'
                'Schema table "student": [StudentID, Grade, PhoneNumbers]'
                'Question: "what is the average grade?"'},
            {"role": "assistant",
             "content": "SELECT AVG(Grade) FROM student"},

            {"role": "user", "content":
                'Database table names: ["customer", "product"]'
                'Schema table "customer": [CustomerID, name, surname]'
                'Schema table "product": [ProductID, CustomerID, name, surname, price]'
                'Question: "which products did Simone buy?"'},
            {"role": "assistant",
             "content": 'SELECT T1.name, T2.name FROM "customer" as T1 JOIN "product" as T2 WHERE T1.name == "Simone"'},
        ]

    def process_input(self, table: pd.DataFrame | None,
                      db_table_schema: dict[str, pd.DataFrame], query: str,
                      query_tbl_name: str | list[str]) -> Any | None:
        if not db_table_schema:
            raise ValueError('For Semantic Parsing JOIN, it is needed the schema of the database')

        prompts = [f'Database table names: {list(db_table_schema.keys())}']
        for name, schema in db_table_schema.items():
            prompts.append(f'Schema table "{name}": {schema["name"].tolist()}')
        prompts.append(f'Question: "{query}"')

        return {"role": "user", "content": "\n".join(prompts)}

    def _normalize_api_output(self, api_output):
        prediction: str = api_output.choices[0].message.content
        return prediction

`LLama2_QA`

Bases: AbstractLLama2

A Subclass of AbstractLLama2 which provides functionality to act as a question answering model for tabular data.

Attributes:

Name	Type	Description
`model_name`	`str`	Name of the Llama model.
`hugging_face_token`	`(str, None)`	Token for the Hugging Face.
`force_cpu`	`bool`	To force usage of cpu. Defaults to False.

Methods:

Name	Description
`name`	Property attribute which returns the model name.
`prompt`	Property attribute which provides instructions for the model in a defined format.
`process_input`	Converts input data into a format which model can interpret.
`_normalize_output`	Normalize the output for question answering.

Note

The model used in this class is "meta-llama/Llama-2-7b-chat-hf".
The prompt contains few-shot examples to improve the QA task results

Examples:

>>> import pandas as pd
>>> from qatch.models import LLama2_QA
>>>
>>> data = pd.DataFrame([
...     ["John Doe", "123-456-7890"],
...     ["Jane Doe", "098-765-4321"]
... ], columns=["Name", "Phone Number"])
>>>
>>> llama2_qa_instance = LLama2_QA("meta-llama/Llama-2-7b-chat-hf")
>>> query = "What is John Doe's phone number?"
>>> answer = llama2_qa_instance.predict(table=data, query=query, tbl_name='Contact Info')
>>> print(answer)
[['123-456-7890']]

Source code in qatch/models/llama2/llama2_QA.py

class LLama2_QA(AbstractLLama2):
    """
    A Subclass of `AbstractLLama2` which provides functionality to act as a question answering model
    for tabular data.

    Attributes:
        model_name (str): Name of the Llama model.
        hugging_face_token (str, None): Token for the Hugging Face.
        force_cpu (bool, optional): To force usage of cpu. Defaults to False.

    Methods:
        name: Property attribute which returns the model name.
        prompt: Property attribute which provides instructions for the model in a defined format.
        process_input: Converts input data into a format which model can interpret.
        _normalize_output: Normalize the output for question answering.

    Note:
        - The model used in this class is "meta-llama/Llama-2-7b-chat-hf".
        - The prompt contains few-shot examples to improve the QA task results


    Examples:
        >>> import pandas as pd
        >>> from qatch.models import LLama2_QA
        >>>
        >>> data = pd.DataFrame([
        ...     ["John Doe", "123-456-7890"],
        ...     ["Jane Doe", "098-765-4321"]
        ... ], columns=["Name", "Phone Number"])
        >>>
        >>> llama2_qa_instance = LLama2_QA("meta-llama/Llama-2-7b-chat-hf")
        >>> query = "What is John Doe's phone number?"
        >>> answer = llama2_qa_instance.predict(table=data, query=query, tbl_name='Contact Info')
        >>> print(answer)
        [['123-456-7890']]
  """

    @property
    def name(self):
        return 'LLama2_QA'

    @property
    def prompt(self):
        return """\
        <<SYS>> I want you to act as a question answering model for tabular data.
        I will pass you a table with one question. 
        I want you to return the elements in the table that answer the question.
        I want you to return the answer in format: list of list (row and columns).
        The answer must be complete of all the data from the table.
        If an aggregations is present, return only the aggregate values.
        The answer must be generated only from the table provided.
        The answer must have the same format of the Table passed as input.
        The answer must be a list of tuples. Then for each tuple, a list of elements. For each element a list of cell values and the header. 
        Do not use different formats in the answer. 
        Do not repeat the instruction in the answer.
        <</SYS>>
        [INST] Table Name: "Body_Builders"
        Table: "[[['Simone', '[H] Name'], ['Papicchio', '[H] Surname']], [['Marco', '[H] Name'], ['Clemente', '[H] Surname']]]"
        Question: "Show all information about each body builder"
        [/INST]
        [[Simone', 'Papicchio'], ['Marco', 'Clemente']]
        [INST]
        Table Name: "Students"
        Table: "[[['24172', '[H] Student ID'], ['30', '[H] Grade'], ['3431223445', '[H] Phone Numbers']], [['281811', '[H] Student ID'], ['22', '[H] Grade'], ['3435227445', '[H] Phone Numbers']]]"
        Question: "what are all the phone numbers?"
        [/INST]
        [['3431223445'], ['3435227445']]
        [INST] Table Name: "Students"
        Table: "[[['24172', '[H] Student ID'], ['28', '[H] Grade'], ['3431223445', '[H] Phone Numbers']], [['281811', '[H] Student ID'], ['24', '[H] Grade'], ['3435227445', '[H] Phone Numbers']]]"
        Question: "what is the average of the grade?"
        [/INST]
        [[26]]
        """

    @override
    def process_input(self,
                      table: pd.DataFrame | None,
                      db_table_schema: dict | None,
                      query: str,
                      query_tbl_name: str | list[str]) -> Any | None:
        if table.size > 512:
            return None
        linearized_table = linearize_table(table)
        model_input = \
            f"""[INST] Table Name: "{query_tbl_name}"
            Table: "{linearized_table}"
            Question: "{query}"
            [/INST]"""
        return model_input

    @override
    def _normalize_output(self, text):
        return _normalize_output_for_QA(text)

`LLama2_SP`

Bases: AbstractLLama2

A Subclass of AbstractLLama2 which provides functionality to act as a semantic parsing model for tabular data.

Attributes:

Name	Type	Description
`model_name`	`str`	Name of the Llama model.
`hugging_face_token`	`(str, None)`	Token for the Hugging Face.
`force_cpu`	`bool`	To force usage of cpu. Defaults to False.

Methods:

Name	Description
`name`	Property attribute which returns the model name.
`prompt`	Property attribute which provides instructions for the model in a defined format: Table name: "body-builder", Schema: "[Name, Surname]", Questions: "Show all information about each body builder"
`process_input`	Converts input data into a format which model can interpret.
`_normalize_output`	Normalize the output for question answering.

Note

The model used in this class is "codellama/CodeLlama-7b-Instruct-hf".
The prompt contains few-shot examples to improve the SP task results

Examples:

>>> import pandas as pd
>>> from qatch.models import LLama2_QA
>>>
>>> data = pd.DataFrame([
...     ["John Doe", "123-456-7890"],
...     ["Jane Doe", "098-765-4321"]
... ], columns=["Name", "Phone Number"])
>>>
>>> llama2_sp_instance = LLama2_SP("codellama/CodeLlama-7b-Instruct-hf")
>>> query = "What is John Doe's phone number?"
>>> answer = llama2_sp_instance.predict(table=data, query=query, tbl_name='Contact Info')
>>> print(answer)
SELECT "Phone Number" FROM "Contact Info" WHERE "Name" = "John Doe"

Source code in qatch/models/llama2/llama2_SP.py

class LLama2_SP(AbstractLLama2):
    """
    A Subclass of `AbstractLLama2` which provides functionality to act as a semantic parsing model for tabular data.

    Attributes:
        model_name (str): Name of the Llama model.
        hugging_face_token (str, None): Token for the Hugging Face.
        force_cpu (bool, optional): To force usage of cpu. Defaults to False.

    Methods:
        name: Property attribute which returns the model name.
        prompt: Property attribute which provides instructions for the model in a defined format:  Table name: "body-builder",
            Schema: "[Name, Surname]", Questions: "Show all information about each body builder"
        process_input: Converts input data into a format which model can interpret.
        _normalize_output: Normalize the output for question answering.

    Note:
        - The model used in this class is "codellama/CodeLlama-7b-Instruct-hf".
        - The prompt contains few-shot examples to improve the SP task results


    Examples:
        >>> import pandas as pd
        >>> from qatch.models import LLama2_QA
        >>>
        >>> data = pd.DataFrame([
        ...     ["John Doe", "123-456-7890"],
        ...     ["Jane Doe", "098-765-4321"]
        ... ], columns=["Name", "Phone Number"])
        >>>
        >>> llama2_sp_instance = LLama2_SP("codellama/CodeLlama-7b-Instruct-hf")
        >>> query = "What is John Doe's phone number?"
        >>> answer = llama2_sp_instance.predict(table=data, query=query, tbl_name='Contact Info')
        >>> print(answer)
        SELECT "Phone Number" FROM "Contact Info" WHERE "Name" = "John Doe"
    """

    @property
    def name(self):
        return 'LLama2_SP_code'

    @property
    def prompt(self):
        return \
            """[INST] I want you to act as a text to SQL model for tabular data.
            I will pass you the schema of the table and one question.
            I want you to parse the question into the SQL query.
            The SQL command must be executable with the schema of the table.
            Do not write explanations. Do not type commands. 
            [/INST] 
            ok pass me the input
            [INST]
            Table name: "body-builder", 
            Schema: "[Name, Surname]", 
            Questions: "Show all information about each body builder"
            [/INST]
            SELECT * FROM "body-builder"
            [INST] Table Name: "Body_Builders"
            Schema: "[Name, Surname]"
            Question: "Show all information about each body builder"
            [/INST]
            SELECT * FROM "Body_Builders"
            [INST] Table name: "student"
            Schema: "[StudentID, Grade, PhoneNumbers]"
            Question: "what are all the phone numbers?"
            [/INST]
            SELECT "PhoneNumbers" FROM "student"
            [INST] Table name: "student"
            Schema: "[StudentID, Grade, PhoneNumbers]"
            Question: "what are all the phone numbers?"
            [/INST]
            SELECT AVG("Grade") FROM "student"
            """

    def process_input(self,
                      table: pd.DataFrame | None,
                      db_table_schema: dict | None,
                      query: str,
                      query_tbl_name: str | list[str]) -> Any | None:
        schema = table.columns.tolist()
        model_input = f"""
        [INST] Table name: "{query_tbl_name}"
        Schema: "{schema}"
        Question: "{query}"
        [/INST] """
        return model_input

    def _normalize_output(self, text):
        return text.replace('\n', '').strip()

`LLama2_SP_join`

Bases: AbstractLLama2

Implementation of the Llama2 model specialized for semantic parsing (SP) with JOIN statements. Inherits from the Abstract Llama2 model class.

This model processes the provided schemas and queries, and after transformation, predicts the appropriate SQL statements.

Attributes:

Name	Type	Description
`model_name`	`str`	Name of the Llama model.
`hugging_face_token`	`(str, None)`	Token for the Hugging Face.
`force_cpu`	`bool`	To force usage of cpu. Defaults to False.

Methods:

Name	Description
`name`	Property attribute which returns the model name.
`prompt`	Property attribute which provides instructions for the model in a defined format: Database table names: ["customer", "product"], Schema table "customer": [CustomerID, name, surname] Schema table "product": [ProductID, CustomerID, name, surname, price] Question: "which products did Simone buy?"
`process_input`	Processes given inputs into a form that model can consume. Extracts and structures relevant data for the SP task.
`_normalize_output`	Normalizes the text received from model predictions. Strips away unnecessary characters from the result SQL statement.

For this model, the table parameter in predict and process_input methods

is not used and can be set to None.

Examples:

>>> llama_sp_join = LLama2_SP_join(model_name="codellama/CodeLlama-7b-Instruct-hf",
>>>                                hugging_face_token=credentials['hugging_face_token'])
>>> # you need to specify all the database table schema
>>> # if you are using QATCH, you can use database.get_all_table_schema_given(db_id='name_of_the_database')
>>> db_table_schema = {
...                    "student": {"name": ["StudentID", "Grade", "PhoneNumbers"]},
...                    "customer": {"name": ["CustomerID", "name", "surname"]},
...                    "product": {"name": ["ProductID", "CustomerID", "name", "surname", "price"]}
...                   }
>>> query = "which products did Simone buy?"
>>> llama_sp_join.predict(table=None,
>>>                       query=query,
>>>                       tbl_name=["customer", "product"],
>>>                       db_table_schema=db_table_schema)
SELECT T1.name, T2.name FROM "customer" as T1 JOIN "product" as T2 WHERE T1.name == "Simone"

Source code in qatch/models/llama2/llama2_SP_join.py

class LLama2_SP_join(AbstractLLama2):
    """
    Implementation of the Llama2 model specialized for semantic parsing (SP)
    with JOIN statements. Inherits from the Abstract Llama2 model class.

    This model processes the provided schemas and queries, and after transformation,
    predicts the appropriate SQL statements.

    Attributes:
        model_name (str): Name of the Llama model.
        hugging_face_token (str, None): Token for the Hugging Face.
        force_cpu (bool, optional): To force usage of cpu. Defaults to False.

    Methods:
        name: Property attribute which returns the model name.
        prompt: Property attribute which provides instructions for the model in a defined format:
                Database table names: ["customer", "product"], Schema table "customer": [CustomerID, name, surname]
                Schema table "product": [ProductID, CustomerID, name, surname, price]
                Question: "which products did Simone buy?"
        process_input: Processes given inputs into a form that model can consume.
                       Extracts and structures relevant data for the SP task.
        _normalize_output: Normalizes the text received from model predictions.
                           Strips away unnecessary characters from the result SQL statement.

    Note: For this model, the `table` parameter in predict and process_input methods
            is not used and can be set to None.

    Examples:
        >>> llama_sp_join = LLama2_SP_join(model_name="codellama/CodeLlama-7b-Instruct-hf",
        >>>                                hugging_face_token=credentials['hugging_face_token'])
        >>> # you need to specify all the database table schema
        >>> # if you are using QATCH, you can use database.get_all_table_schema_given(db_id='name_of_the_database')
        >>> db_table_schema = {
        ...                    "student": {"name": ["StudentID", "Grade", "PhoneNumbers"]},
        ...                    "customer": {"name": ["CustomerID", "name", "surname"]},
        ...                    "product": {"name": ["ProductID", "CustomerID", "name", "surname", "price"]}
        ...                   }
        >>> query = "which products did Simone buy?"
        >>> llama_sp_join.predict(table=None,
        >>>                       query=query,
        >>>                       tbl_name=["customer", "product"],
        >>>                       db_table_schema=db_table_schema)
        SELECT T1.name, T2.name FROM "customer" as T1 JOIN "product" as T2 WHERE T1.name == "Simone"
    """

    @property
    def name(self):
        return 'LLama2_SP_join_code'

    @property
    def prompt(self):
        return \
            """[INST]I want you to act as a text to SQL model for tabular data.
            I will pass you as prompt: 
            - all the table names and the respective tables schema present in the database 
            - one question. 
            I want you to parse the question into the SQL command.
            The SQL command must be executable over the database.
            Do not write explanations. Do not type commands. 
            REPLY ONLY WITH THE SQL COMMAND.
            This is an Example:
            Database table names: ["body-builder"] 
            Table schema "body-builder": [Name, Surname] 
            Question: "Show all information about each body builder"
            I want you to output:
            "SELECT * FROM "body-builder"" 
            [/INST] 
            [INST] Database table names: ["student"]
            Schema table "body-builder": [StudentID, Grade, PhoneNumbers]
            Question: "what are all the phone numbers?"
            [/INST]
            'SELECT "PhoneNumbers" FROM student'"
            [INST] Database table names: ["student"]
            Schema table "student": [StudentID, Grade, PhoneNumbers]
            Question: "what is the average grade?"
            [/INST]
            "SELECT AVG(Grade) FROM student"
            [INST] Database table names: ["customer", "product"]
            Schema table "customer": [CustomerID, name, surname]
            Schema table "product": [ProductID, CustomerID, name, surname, price]
            Question: "which products did Simone buy?"
            [/INST]
            SELECT T1.name, T2.name FROM "customer" as T1 JOIN "product" as T2 WHERE T1.name == "Simone"
            """

    def process_input(self,
                      table: pd.DataFrame | None,
                      db_table_schema: dict | None,
                      query: str,
                      query_tbl_name: str | list[str]) -> Any | None:
        if not db_table_schema:
            raise ValueError('For Semantic Parsing JOIN, it is needed the schema of the database')

        prompts = [f'[INST] Database table names: {list(db_table_schema.keys())}']

        for name, schema in db_table_schema.items():
            prompts.append(f'Schema table "{name}": {schema["name"].tolist()}')

        prompts.append(f'Question: "{query}"\n[/INST]')
        return "\n".join(prompts)

    def _normalize_output(self, text):
        """
        Normalizes the text received from model predictions.

        Simplifies the predicted SQL command by removing any new lines and extra quotations.

        Args:
            text (str): The raw text prediction from the underlying model.

        Returns:
            str: The normalized text.
        """
        return text.replace('\n', '').replace('"', '').strip()

`Omnitab`

Bases: AbstractModel

The Omnitab class inherits from the AbstractModel and specializes it to parse tables using the Omnitab model.

Attributes:

Name	Type	Description
`tokenizer`	`AutoTokenizer`	The tokenizer for input preprocessing.
`model`	`AutoModelForSeq2SeqLM`	The model used to answer the queries from the table.

Note

The model used in this class is 'neulab/omnitab-large-finetuned-wtq'.
The Omnitab model works specifically with tables that only contain strings and has a model input limit of 1024 tokens.

Examples:

>>>import pandas as pd
>>>from qatch.models import Tapas
>>>
>>> data = pd.DataFrame([
...     ["John Doe", "123-456-7890"],
...     ["Jane Doe", "098-765-4321"]
... ], columns=["Name", "Phone Number"])
>>>
>>> omnitab_model = Omnitab("'neulab/omnitab-large-finetuned-wtq'")
>>> query = "What is John Doe's phone number?"
>>> answer = omnitab_model.predict(table=data, query=query, tbl_name='Contact Info')
>>> print(answer)
[['123-456-7890']]

Source code in qatch/models/omnitab.py

class Omnitab(AbstractModel):
    """
    The Omnitab class inherits from the AbstractModel and specializes it to parse tables using the Omnitab model.

    Attributes:
        tokenizer (AutoTokenizer): The tokenizer for input preprocessing.
        model (AutoModelForSeq2SeqLM): The model used to answer the queries from the table.

    Note:
        - The model used in this class is 'neulab/omnitab-large-finetuned-wtq'.
        - The Omnitab model works specifically with tables that only contain strings
         and has a model input limit of 1024 tokens.

    Examples:
        >>>import pandas as pd
        >>>from qatch.models import Tapas
        >>>
        >>> data = pd.DataFrame([
        ...     ["John Doe", "123-456-7890"],
        ...     ["Jane Doe", "098-765-4321"]
        ... ], columns=["Name", "Phone Number"])
        >>>
        >>> omnitab_model = Omnitab("'neulab/omnitab-large-finetuned-wtq'")
        >>> query = "What is John Doe's phone number?"
        >>> answer = omnitab_model.predict(table=data, query=query, tbl_name='Contact Info')
        >>> print(answer)
        [['123-456-7890']]
    """

    def __init__(self, model_name: str, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.model.to(self.device)

    @override
    def process_input(self, table: pd.DataFrame,
                      query: str,
                      tbl_name: str) -> Any | None:
        # Check dimensions of the table (rows*columns). If the size is larger than 1024,
        # we return None with a warning log. This is due to the limitation of the model input size.
        if table.shape[0] * table.shape[1] > 1024:
            logging.warning("Input is too long for model")
            return None

        # The attributes of the DataFrame 'table' are converted to strings to make sure the table
        # representation is in a consistent format.
        table = table.astype(str)

        # The content of the DataFrame 'table' is turned to lowercase in order to standardize the information.
        for col in table.columns:
            table[col] = table[col].str.lower()

        # Also, the query is transformed into lowercase for the same standardization reasons.
        query = query.lower()

        try:
            # The 'table' and 'query' are tokenized using the tokenizer.
            model_input = self.tokenizer(table=table, queries=query, return_tensors="pt")
        except ValueError as e:
            # A ValueError is expected if the tokenized input exceeds the length accepted by the model.
            # If such error is raised, a warning log is returned, and the function returns None.
            logging.warning(e)
            return None

        # Before returning the tokenized input, it checks once again the length of the input.
        # If it's longer than 1024, a warning is logged and the function returns None.
        if model_input.input_ids.shape[1] > 1024:
            logging.warning("Input is too long for model")
            return None

        # If all the above checks pass, the processed input is returned, set to the appropriate device for feeding
        # into the model subsequently.
        return model_input.to(self.device)

    @override
    def predict_input(self, model_input, table) -> list[list[list[str]]]:
        outputs = self.model.generate(**model_input)
        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

`Tapas`

Bases: AbstractModel

The Tapas class inherits from the AbstractModel and specializes it to parse tables using the TAPAS model.

Attributes:

Name	Type	Description
`tokenizer`	`TapasTokenizer`	The tokenizer for input preprocessing.
`model`	`TapasForQuestionAnswering`	The model used to answer the queries from the table.

Note

The model used in this class is google/tapas-large-finetuned-wtq.
The TAPAS model works specifically with tables that only contain strings and has a model input limit of 512 tokens.

Examples:

>>>import pandas as pd
>>>from qatch.models import Tapas
>>>
>>> data = pd.DataFrame([
...     ["John Doe", "123-456-7890"],
...     ["Jane Doe", "098-765-4321"]
... ], columns=["Name", "Phone Number"])
>>>
>>> tapas_model = Tapas("google/tapas-large-finetuned-wtq")
>>> query = "What is John Doe's phone number?"
>>> answer = tapas_model.predict(table=data, query=query, tbl_name='Contact Info')
>>> print(answer)
[['123-456-7890']]

Source code in qatch/models/tapas.py

class Tapas(AbstractModel):
    """
    The Tapas class inherits from the AbstractModel and specializes it to parse tables using the TAPAS model.

    Attributes:
        tokenizer (TapasTokenizer): The tokenizer for input preprocessing.
        model (TapasForQuestionAnswering): The model used to answer the queries from the table.

    Note:
        - The model used in this class is `google/tapas-large-finetuned-wtq`.
        - The TAPAS model works specifically with tables that only contain strings
         and has a model input limit of 512 tokens.

    Examples:
        >>>import pandas as pd
        >>>from qatch.models import Tapas
        >>>
        >>> data = pd.DataFrame([
        ...     ["John Doe", "123-456-7890"],
        ...     ["Jane Doe", "098-765-4321"]
        ... ], columns=["Name", "Phone Number"])
        >>>
        >>> tapas_model = Tapas("google/tapas-large-finetuned-wtq")
        >>> query = "What is John Doe's phone number?"
        >>> answer = tapas_model.predict(table=data, query=query, tbl_name='Contact Info')
        >>> print(answer)
        [['123-456-7890']]
    """

    def __init__(self, model_name: str, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.tokenizer = TapasTokenizer.from_pretrained(model_name)
        self.model = TapasForQuestionAnswering.from_pretrained(model_name)
        self.model.to(self.device)

    def process_input(self, table: pd.DataFrame,
                      query: str,
                      tbl_name: str) -> Any | None:
        if table.shape[0] * table.shape[1] > 512:
            return None

        # convert table to string
        table = table.astype(str)
        # tokenize inputs
        try:
            model_input = self.tokenizer(table=table,
                                         queries=query,
                                         padding="max_length",
                                         return_tensors="pt")
        except ValueError as e:
            # we get error when the tokenized input is longer than accepted from model
            logging.warning(e)
            return None

        return model_input.to(self.device)

    def predict_input(self, model_input, table) -> list[Any]:
        # Step 1: Pass the model_input to the model for forward propagation to generate outputs.
        outputs = self.model(**model_input)
        # Step 2: Move the model_input tensor to cpu.
        model_input.to('cpu')
        # Step 3: Move the output tensors to CPU and detach them from the computational graph.
        outputs = {idx: outputs[idx].cpu().detach() for idx in outputs}
        # Step 4: Detach any tensors in model_input from the computational graph.
        [model_input[idx].detach() for idx in model_input]

        # Step 5: Retrieve the coordinates from the logits using a tokenizer function.
        pred_query_cords, _ = self.tokenizer.convert_logits_to_predictions(
            model_input,
            outputs['logits'],
            outputs['logits_aggregation']
        )
        # Step 6: Construct a list of answers.
        answers = []
        for tbl_cords in pred_query_cords:
            query_answer = defaultdict(list)
            # For each coordinate set in the predicted query coordinates,
            # construct a dictionary of row-wise query answers.
            [query_answer[row].append(table.iat[(row, col)])
             for row, col in tbl_cords]
            answers.extend(list(query_answer.values()))
        # Step 7: Dispose of potentially memory-heavy variables.
        del model_input
        del outputs
        # Step 8: Return the list of processed query answers derived from the model's predictions.
        return answers

`Tapex`

Bases: AbstractModel

The Tapex class inherits from the AbstractModel and specializes it to parse tables using the TAPEX model.

Attributes:

Name	Type	Description
`tokenizer`	`TapexTokenizer`	The tokenizer for input preprocessing.
`model`	`BartForConditionalGeneration`	The model used to answer the queries from the table.

Note

The model used in this class is 'microsoft/tapex-large-finetuned-wtq'.
The TAPEX model works specifically with tables that only contain strings and has a model input limit of 1024 tokens.

Examples:

>>> import pandas as pd
>>> from qatch.models import Tapex
>>>
>>> data = pd.DataFrame([
...     ["John Doe", "123-456-7890"],
...     ["Jane Doe", "098-765-4321"]
... ], columns=["Name", "Phone Number"])
>>>
>>> tapex_model = Tapex("microsoft/tapex-large-finetuned-wtq")
>>> query = "What is John Doe's phone number?"
>>> answer = tapex_model.predict(table=data, query=query, tbl_name='Contact Info')
>>> print(answer)
[['123-456-7890']]

Source code in qatch/models/tapex.py

class Tapex(AbstractModel):
    """
    The Tapex class inherits from the AbstractModel and specializes it to parse tables using the TAPEX model.

    Attributes:
        tokenizer (TapexTokenizer): The tokenizer for input preprocessing.
        model (BartForConditionalGeneration): The model used to answer the queries from the table.

    Note:
        - The model used in this class is 'microsoft/tapex-large-finetuned-wtq'.
        - The TAPEX model works specifically with tables that only contain strings
         and has a model input limit of 1024 tokens.

    Examples:
        >>> import pandas as pd
        >>> from qatch.models import Tapex
        >>>
        >>> data = pd.DataFrame([
        ...     ["John Doe", "123-456-7890"],
        ...     ["Jane Doe", "098-765-4321"]
        ... ], columns=["Name", "Phone Number"])
        >>>
        >>> tapex_model = Tapex("microsoft/tapex-large-finetuned-wtq")
        >>> query = "What is John Doe's phone number?"
        >>> answer = tapex_model.predict(table=data, query=query, tbl_name='Contact Info')
        >>> print(answer)
        [['123-456-7890']]
    """

    def __init__(self, model_name: str, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.tokenizer = TapexTokenizer.from_pretrained(model_name)
        self.model = BartForConditionalGeneration.from_pretrained(model_name)
        self.model.to(self.device)

    def process_input(self, table: pd.DataFrame,
                      query: str,
                      tbl_name: str) -> Any | None:
        if table.shape[0] * table.shape[1] > 1024:
            return None

        # convert table to string
        table = table.astype(str)
        # process table
        for col in table.columns:
            table[col] = table[col].str.lower()

        # tapex accepts uncased input since it is pre-trained on the uncased corpus
        query = query.lower()
        try:
            model_input = self.tokenizer(table=table, query=query,
                                         padding=True, return_tensors="pt")
        except ValueError as e:
            # we get error when the tokenized input is longer than accepted from model
            logging.warning(e)
            return None

        if model_input.input_ids.shape[1] > 1024:
            warnings.warn(f'After tokenization'
                          f' the input is longer than 1024 tokens: '
                          f'{model_input.input_ids.shape[1]}. '
                          'the input will be skipped')
            return None

        return model_input.to(self.device)

    def predict_input(self, model_input, table) -> list[Any]:
        outputs = self.model.generate(**model_input)
        model_input.to('cpu')

        [model_input[idx].detach() for idx in model_input]
        outputs = outputs.detach().cpu()

        # decode back to text
        pred_cells_queries = self.tokenizer.batch_decode(outputs,
                                                         skip_special_tokens=True)
        # the output contains list of string for each query. Manually transform the output
        answers = []
        for pred_query in pred_cells_queries:
            query_ans = self._return_cells_aggr_by_row(table, pred_query)
            answers.extend(query_ans)
        del model_input
        del outputs
        return answers

    @staticmethod
    def _return_cells_aggr_by_row(table, pred_query):
        """
        Perform an aggregation operation by row of the cells in a specified table based on a predicate query.

        Args:
            table (np.array): The table to perform the operation on.
            pred_query (str): The predicate query used for aggregation operation. It should be a string of comma
                              separated cell values, e.g., "cell1,cell2,cell3".

        Returns:
            list: Returns a list of lists where each sublist contains aggregated cell values from a single row of the table.

        Example:
            Let's assume we have a table as below:

            [["cell1", "cell2"],
            ["cell3", "cell1"],
            ["cell1", "cell2"]]

            And pred_query as "cell1,cell1,cell2"

            Calling _return_cells_aggr_by_row(table, pred_query) will give:

            [["cell1", "cell2"], ["cell1"], ["cell1", "cell2"]]

        Note:
            If a cell from the pred_query is not present in the table, the method treats it as if it's in an imaginary
            row indexed as -1. Therefore, if you see a [-1] in the result, it means one or more cells in your pred_query
            did not appear in the table.
        """
        # Initializing a defaultdict to store the results of the query
        query_ans = defaultdict(list)
        # Splitting the query into cells
        cells: list = pred_query.split(",")
        # Counting the occurrences of each cell in the query
        counted_cells = Counter(cells)
        # Iterating over each cell type and its count from the counted_cells
        for cell, count in counted_cells.items():
            # Finding the row ids where the current cell type exists in the table
            row_ids = np.where(table == cell.strip())[0]
            if len(row_ids) == 0:
                # If the cell is not present in the table, set the row_id as -1
                row_ids = [-1]

            # If the count of cell in the query > 1, select the first 'count' number of rows
            if count > 1:
                row_ids = row_ids[:count]
            # Appending the cell to the rows in the query_ans for each row_id
            [query_ans[idx].append(cell.strip()) for idx in row_ids]
        # Return the aggregated cells from each row as a list of lists
        return list(query_ans.values())