QATCH Evaluate

`MetricEvaluator`

Class for evaluating SQL query prediction metrics using target results and predicted outputs.

Attributes:

Name	Type	Description
`databases`	`MultipleDatabases`	Object representing database connections. This attribute stores information about multiple database connections.
`metrics`	`list[str]`	List of metric names to be evaluated. Default metrics include: ['cell_precision', 'cell_recall', 'tuple_cardinality', 'tuple_constraint', 'tuple_order']

Source code in qatch/metric_evaluator.py

class MetricEvaluator:
    """
    Class for evaluating SQL query prediction metrics using target results and predicted outputs.

    Attributes:
        databases (MultipleDatabases): Object representing database connections.
            This attribute stores information about multiple database connections.

        metrics (list[str]): List of metric names to be evaluated. Default metrics include:
            ['cell_precision', 'cell_recall', 'tuple_cardinality', 'tuple_constraint', 'tuple_order']
    """

    def __init__(self, databases: MultipleDatabases | None = None, metrics: list[str] | str | None = None):
        if metrics is None:
            metrics = ['cell_precision', 'cell_recall',
                       'tuple_cardinality', 'tuple_constraint',
                       'tuple_order']

        self.metrics = metrics if isinstance(metrics, list) else [metrics]

        self._tags_generator = {
            'cell_precision': CellPrecisionTag(),
            'cell_recall': CellRecallTag(),
            'tuple_cardinality': TupleCardinalityTag(),
            'tuple_constraint': TupleConstraintTag(),
            'tuple_order': TupleOrderTag(),
        }
        self.databases = databases

    def evaluate_with_df(self, df, prediction_col_name: str, task: str, target_col_name: str = 'query',
                         keep_target: bool = False
                         ) -> pd.DataFrame:
        """Evaluates SQL queries for various metrics including cell precision, cell recall,
        tuple cardinality, tuple constraint, tuple order.

        For each row in the input DataFrame, it evaluates either the task as a QA
        (Question Answering) or SP (Semantic Parsing). Then, it concatenates the original DataFrame
        and the evaluated metric DataFrame.

        Notes:
            - df must contains at lest two columns 'target_col_name' and 'prediction_col_name'
            - 'target_col_name' is the target SQL query that anwers the NL question | the target cell tables
            - 'prediction_col_name' can be either the predicted SQL or the predicted cells
            - for QA, return zeros if predicted cells are not compliant with expected format: [["wales", "scotland"], ["england"]]
            - for both tasks, return zeros if the 'target_col_name' SQL query cannot be executed over the input databases
            - if 'target_col_name' contains the table cells, Tuple Order is calculated by default. Check if it is necessary.

        Args:
            df (pd.DataFrame): Input DataFrame where each row represents a test.
            prediction_col_name (str): Name of the column in the DataFrame that contains predictions.
            task (str): Type of evaluation task. Could be `QA` or `SP`.
            target_col_name (str): Name of the column in the DataFrame that contains target queries.
            Default is 'query'.
            keep_target (bool): FALSE by default. If TRUE, keeps the target query.

        Returns:
            pd.DataFrame: Output DataFrame that has the original DataFrame along with the evaluated metric DataFrame.

        Examples:
            You do not have to specify the "databases" in case the "target" and "predictions" are already executed for QA:

            >>> eval_task = MetricEvaluator(databases=None, metrics=['cell_precision', 'cell_recall'])
            >>> test = {"sql_tags": "SELECT",
            ...         "prediction": [["wales", "scotland"], ["england"]],
            ...         "target": [["scotland", "wales"], ["england"]]}
            >>> df = pd.DataFrame(test)
            >>> prediction_col_name = "prediction"
            >>> target_col_name = "target"
            >>> result = eval_task.evaluate_with_df(df, prediction_col_name, 'QA', target_col_name)
            >>> print(result)
            {'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}

            If this is not the case, you have to load the "databases" to execute the "target" queries.

            >>> eval_task = MetricEvaluator(databases=databases, metrics=['cell_precision', 'cell_recall'])
            >>> test = {"sql_tags": "SELECT",
            ...         "prediction": [["wales", "scotland"], ["england"]],
            ...         "target": ['SELECT * FROM table']}
            >>> df = pd.DataFrame(test)
            >>> prediction_col_name = "prediction"
            >>> target_col_name = "target"
            >>> result = eval_task.evaluate_with_df(df, prediction_col_name, 'QA', target_col_name)
            >>> print(result)
            {'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}

        Note:
            For SP, if you have both the target and the predictions already executed, you have to specify the task as 'QA'

            This because when using task 'SP' there are automatic controls on the query syntactic which are not available if they have
            already been executed.

        """
        tqdm.pandas(desc=f'Evaluating {task} tests')
        if task.upper() == 'QA':
            # add the new metrics at the bottom of the dataframe
            df_metrics = df.progress_apply(lambda row: self.evaluate_single_test_QA(row.to_dict(),
                                                                                    prediction_col_name,
                                                                                    target_col_name),
                                           axis=1, result_type='expand')
        else:
            df_metrics = df.progress_apply(lambda row: self.evaluate_single_test_SP(row.to_dict(),
                                                                                    prediction_col_name,
                                                                                    target_col_name), axis=1,
                                           result_type='expand')
        return pd.concat([df, df_metrics], axis=1).replace({np.nan: None})

    def evaluate_single_test_QA(self, test: dict, prediction_col_name: str, target_col_name: str) -> dict:
        """
        Evaluates metric scores on a single test QA task where a test is a dictionary (or pd.Series) and the
        `prediction_col_name` and `target_col_name` are the column names in the test data containing model predictions
        and actual target values respectively.

        Args:
            test (dict | pd.Series): A dictionary or pandas Series containing a single test data. The keys (columns for Series)
                should include `prediction_col_name` and `target_col_name`.
            prediction_col_name (str): String representing the key in `test` dictionary (or column in `test` pandas Series)
                where the predicted values are.
            target_col_name (str): String representing the key in `test` dictionary (or column in `test` pandas Series)
                where the actual target values are.

        Returns:
            dict: A dictionary with keys are metric name and value is the evaluated metric score for each metric in `self.metrics`.

        Notes:
            - return zeros if prediction is not compliant with expected format: [["wales", "scotland"], ["england"]]
            - return zeros if target query cannot be executed over the databases

        Examples:
            >>> eval_task = MetricEvaluator(databases, metrics=['cell_precision', 'cell_recall'])
            >>> test = {"sql_tags": "SELECT",
            ...         "prediction": [["wales", "scotland"], ["england"]],
            ...         "target": [["scotland", "wales"], ["england"]]}
            >>> prediction_col_name = "prediction"
            >>> target_col_name = "target"
            >>>result = eval_task.evaluate_single_test_QA(test, prediction_col_name, target_col_name)
            >>> print(result)
            {'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}
        """
        output_in_case_error = {f'{metric}_{prediction_col_name}': 0 for metric in self.metrics}
        if not CellPrecisionTag.is_table_well_structured(test[prediction_col_name]):
            return output_in_case_error

        # Runs the target query on the database only if necessary
        new_target_col = f'{target_col_name}_result'
        if isinstance(test[target_col_name], list):
            logging.warning('The target tables is passed as input, '
                            'the TUPLE ORDER is calucated by default because there is no way to check if it is an ORDERBY test')
            # if the target is already a list of list, we do not need to run the SQL over the databases
            if not CellPrecisionTag.is_table_well_structured(test[target_col_name]):
                return output_in_case_error
            test[new_target_col] = test[target_col_name]
        else:
            if self.databases is None:
                raise ValueError(
                    f'The {target_col_name} is a query but no database is specified in the MetricEvaluator.'
                    f'Plese initialize is as MetricEvaluator(databases)')
            try:
                test[new_target_col] = self.databases.run_query(test['db_id'], test[target_col_name])
            except sqlite3.Error as e:
                # catch any possible error of prediction and return all zeros
                logging.error(e)
                return output_in_case_error

        # if there are no errors,  compute the metric results
        metric2evaluation = {f'{metric}_{prediction_col_name}': None for metric in self.metrics}
        for metric in self.metrics:
            generator = self._tags_generator[metric]
            # initialize the metric column
            # evaluate the metric only for the test where the prediction is not equal to the target
            tqdm.pandas(desc=f'Evaluating {metric}')
            if metric == 'tuple_order' and not isinstance(test[target_col_name], list) and 'order by' not in test[
                target_col_name].lower():
                continue
            evaluation = generator.evaluate_single_test_metric(test[new_target_col], test[prediction_col_name])
            metric2evaluation[f'{metric}_{prediction_col_name}'] = evaluation
        return metric2evaluation

    def evaluate_single_test_SP(self, test: dict, prediction_col_name: str, target_col_name: str) -> dict:
        """
        Evaluates metrics for a single SQL prediction test by fetching the results of the predicted and
        target queries from the database.

        This function fetches results based on provided `prediction_col_name` and `target_col_name`. Then it evaluates
        performance of the prediction by invoking `evaluate_single_test_QA`.

        Args:
            self (MetricEvaluator): The object instance the method is called on.
            test (dict | pd.Series): The test data as a dictionary or pandas Series. It contains the 'db_id' (database identifier).
                                     It is expected to have 'predictions_SP' and 'target_SP' keys/columns updated in process.
            prediction_col_name (str): The name of column where prediction is stored.
            target_col_name (str): The name of column where the target is stored.

        Returns:
            dict: A dictionary containing evaluation results obtained from `evaluate_single_test_QA`.

        Notes:
            If the predicted query cannot be run on the db, the resulting metrics are all zeros

        Examples:
            >>> test = {'db_id': 'database1', 'target': 'SELECT DISTINCT emailisfree FROM fraud', 'prediction': 'SELECT emailsisfree, income FROM fraud'}
            >>> evaluator = MetricEvaluator(databases)
            >>> results = evaluator.evaluate_single_test_SP(test, 'prediction', 'target')
            >>> print(results)
            {'cell_precision_prediction': 0.50, 'cell_recall_prediction': 1.0}
        """
        # Compares the target and predicted SQL queries after cleaning and formatting. If they are identical, it returns metrics as 1
        if self.are_cleaned_sql_identical(test[target_col_name], test[prediction_col_name]):
            metrics_result = {f'{metric}_{prediction_col_name}': 1 for metric in self.metrics}
            metrics_result[f'tuple_order_{prediction_col_name}'] = None \
                if 'order' not in test[target_col_name].lower() else 1
            return metrics_result

        # Tries to run the predicted query on the database. If there is an error (e.g. syntax error in the query),
        # it logs the error and returns metrics as 0
        try:
            test[prediction_col_name] = self.databases.run_query(test['db_id'], test[prediction_col_name])
        except sqlite3.Error as e:
            # catch any possible error of prediction and return all zeros
            logging.error(e)
            return {f'{metric}_{prediction_col_name}': 0 for metric in self.metrics}
        # Evaluates the results of the target and predicted queries using the evaluate_single_test_QA function
        return self.evaluate_single_test_QA(test, prediction_col_name, target_col_name)

    @staticmethod
    def are_cleaned_sql_identical(target: str, prediction: str) -> bool:
        """
        Create a mask based on whether the target and prediction strings are equal after cleaning.

        Args:
            target (str): The target string.
            prediction (str): The prediction string.

        Returns:
            bool: True if cleaned prediction equals cleaned target, False otherwise.
        """
        new_target = (target.lower()
                      .replace(" ,", ",")
                      .replace("  ", " ")
                      .replace('"', '')
                      .replace("'", '')
                      .strip())

        new_pred = (prediction.lower()
                    .replace(" ,", ",")
                    .replace("  ", " ")
                    .replace('"', '')
                    .replace("'", '')
                    .replace(' ( ', '(')
                    .replace(' )', ')')
                    .strip())
        return new_pred == new_target

`are_cleaned_sql_identical(target, prediction)` `staticmethod`

Create a mask based on whether the target and prediction strings are equal after cleaning.

Parameters:

Name	Type	Description	Default
`target`	`str`	The target string.	required
`prediction`	`str`	The prediction string.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if cleaned prediction equals cleaned target, False otherwise.

Source code in qatch/metric_evaluator.py

@staticmethod
def are_cleaned_sql_identical(target: str, prediction: str) -> bool:
    """
    Create a mask based on whether the target and prediction strings are equal after cleaning.

    Args:
        target (str): The target string.
        prediction (str): The prediction string.

    Returns:
        bool: True if cleaned prediction equals cleaned target, False otherwise.
    """
    new_target = (target.lower()
                  .replace(" ,", ",")
                  .replace("  ", " ")
                  .replace('"', '')
                  .replace("'", '')
                  .strip())

    new_pred = (prediction.lower()
                .replace(" ,", ",")
                .replace("  ", " ")
                .replace('"', '')
                .replace("'", '')
                .replace(' ( ', '(')
                .replace(' )', ')')
                .strip())
    return new_pred == new_target

`evaluate_single_test_QA(test, prediction_col_name, target_col_name)`

Evaluates metric scores on a single test QA task where a test is a dictionary (or pd.Series) and the prediction_col_name and target_col_name are the column names in the test data containing model predictions and actual target values respectively.

Parameters:

Name	Type	Description	Default
`test`	`dict \| Series`	A dictionary or pandas Series containing a single test data. The keys (columns for Series) should include `prediction_col_name` and `target_col_name`.	required
`prediction_col_name`	`str`	String representing the key in `test` dictionary (or column in `test` pandas Series) where the predicted values are.	required
`target_col_name`	`str`	String representing the key in `test` dictionary (or column in `test` pandas Series) where the actual target values are.	required

Returns:

Name	Type	Description
`dict`	`dict`	A dictionary with keys are metric name and value is the evaluated metric score for each metric in `self.metrics`.

Notes

return zeros if prediction is not compliant with expected format: [["wales", "scotland"], ["england"]]
return zeros if target query cannot be executed over the databases

Examples:

>>> eval_task = MetricEvaluator(databases, metrics=['cell_precision', 'cell_recall'])
>>> test = {"sql_tags": "SELECT",
...         "prediction": [["wales", "scotland"], ["england"]],
...         "target": [["scotland", "wales"], ["england"]]}
>>> prediction_col_name = "prediction"
>>> target_col_name = "target"
>>>result = eval_task.evaluate_single_test_QA(test, prediction_col_name, target_col_name)
>>> print(result)
{'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}

Source code in qatch/metric_evaluator.py

def evaluate_single_test_QA(self, test: dict, prediction_col_name: str, target_col_name: str) -> dict:
    """
    Evaluates metric scores on a single test QA task where a test is a dictionary (or pd.Series) and the
    `prediction_col_name` and `target_col_name` are the column names in the test data containing model predictions
    and actual target values respectively.

    Args:
        test (dict | pd.Series): A dictionary or pandas Series containing a single test data. The keys (columns for Series)
            should include `prediction_col_name` and `target_col_name`.
        prediction_col_name (str): String representing the key in `test` dictionary (or column in `test` pandas Series)
            where the predicted values are.
        target_col_name (str): String representing the key in `test` dictionary (or column in `test` pandas Series)
            where the actual target values are.

    Returns:
        dict: A dictionary with keys are metric name and value is the evaluated metric score for each metric in `self.metrics`.

    Notes:
        - return zeros if prediction is not compliant with expected format: [["wales", "scotland"], ["england"]]
        - return zeros if target query cannot be executed over the databases

    Examples:
        >>> eval_task = MetricEvaluator(databases, metrics=['cell_precision', 'cell_recall'])
        >>> test = {"sql_tags": "SELECT",
        ...         "prediction": [["wales", "scotland"], ["england"]],
        ...         "target": [["scotland", "wales"], ["england"]]}
        >>> prediction_col_name = "prediction"
        >>> target_col_name = "target"
        >>>result = eval_task.evaluate_single_test_QA(test, prediction_col_name, target_col_name)
        >>> print(result)
        {'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}
    """
    output_in_case_error = {f'{metric}_{prediction_col_name}': 0 for metric in self.metrics}
    if not CellPrecisionTag.is_table_well_structured(test[prediction_col_name]):
        return output_in_case_error

    # Runs the target query on the database only if necessary
    new_target_col = f'{target_col_name}_result'
    if isinstance(test[target_col_name], list):
        logging.warning('The target tables is passed as input, '
                        'the TUPLE ORDER is calucated by default because there is no way to check if it is an ORDERBY test')
        # if the target is already a list of list, we do not need to run the SQL over the databases
        if not CellPrecisionTag.is_table_well_structured(test[target_col_name]):
            return output_in_case_error
        test[new_target_col] = test[target_col_name]
    else:
        if self.databases is None:
            raise ValueError(
                f'The {target_col_name} is a query but no database is specified in the MetricEvaluator.'
                f'Plese initialize is as MetricEvaluator(databases)')
        try:
            test[new_target_col] = self.databases.run_query(test['db_id'], test[target_col_name])
        except sqlite3.Error as e:
            # catch any possible error of prediction and return all zeros
            logging.error(e)
            return output_in_case_error

    # if there are no errors,  compute the metric results
    metric2evaluation = {f'{metric}_{prediction_col_name}': None for metric in self.metrics}
    for metric in self.metrics:
        generator = self._tags_generator[metric]
        # initialize the metric column
        # evaluate the metric only for the test where the prediction is not equal to the target
        tqdm.pandas(desc=f'Evaluating {metric}')
        if metric == 'tuple_order' and not isinstance(test[target_col_name], list) and 'order by' not in test[
            target_col_name].lower():
            continue
        evaluation = generator.evaluate_single_test_metric(test[new_target_col], test[prediction_col_name])
        metric2evaluation[f'{metric}_{prediction_col_name}'] = evaluation
    return metric2evaluation

`evaluate_single_test_SP(test, prediction_col_name, target_col_name)`

Evaluates metrics for a single SQL prediction test by fetching the results of the predicted and target queries from the database.

This function fetches results based on provided prediction_col_name and target_col_name. Then it evaluates performance of the prediction by invoking evaluate_single_test_QA.

Parameters:

Name	Type	Description	Default
`self`	`MetricEvaluator`	The object instance the method is called on.	required
`test`	`dict \| Series`	The test data as a dictionary or pandas Series. It contains the 'db_id' (database identifier). It is expected to have 'predictions_SP' and 'target_SP' keys/columns updated in process.	required
`prediction_col_name`	`str`	The name of column where prediction is stored.	required
`target_col_name`	`str`	The name of column where the target is stored.	required

Returns:

Name	Type	Description
`dict`	`dict`	A dictionary containing evaluation results obtained from `evaluate_single_test_QA`.

Notes

If the predicted query cannot be run on the db, the resulting metrics are all zeros

Examples:

>>> test = {'db_id': 'database1', 'target': 'SELECT DISTINCT emailisfree FROM fraud', 'prediction': 'SELECT emailsisfree, income FROM fraud'}
>>> evaluator = MetricEvaluator(databases)
>>> results = evaluator.evaluate_single_test_SP(test, 'prediction', 'target')
>>> print(results)
{'cell_precision_prediction': 0.50, 'cell_recall_prediction': 1.0}

Source code in qatch/metric_evaluator.py

def evaluate_single_test_SP(self, test: dict, prediction_col_name: str, target_col_name: str) -> dict:
    """
    Evaluates metrics for a single SQL prediction test by fetching the results of the predicted and
    target queries from the database.

    This function fetches results based on provided `prediction_col_name` and `target_col_name`. Then it evaluates
    performance of the prediction by invoking `evaluate_single_test_QA`.

    Args:
        self (MetricEvaluator): The object instance the method is called on.
        test (dict | pd.Series): The test data as a dictionary or pandas Series. It contains the 'db_id' (database identifier).
                                 It is expected to have 'predictions_SP' and 'target_SP' keys/columns updated in process.
        prediction_col_name (str): The name of column where prediction is stored.
        target_col_name (str): The name of column where the target is stored.

    Returns:
        dict: A dictionary containing evaluation results obtained from `evaluate_single_test_QA`.

    Notes:
        If the predicted query cannot be run on the db, the resulting metrics are all zeros

    Examples:
        >>> test = {'db_id': 'database1', 'target': 'SELECT DISTINCT emailisfree FROM fraud', 'prediction': 'SELECT emailsisfree, income FROM fraud'}
        >>> evaluator = MetricEvaluator(databases)
        >>> results = evaluator.evaluate_single_test_SP(test, 'prediction', 'target')
        >>> print(results)
        {'cell_precision_prediction': 0.50, 'cell_recall_prediction': 1.0}
    """
    # Compares the target and predicted SQL queries after cleaning and formatting. If they are identical, it returns metrics as 1
    if self.are_cleaned_sql_identical(test[target_col_name], test[prediction_col_name]):
        metrics_result = {f'{metric}_{prediction_col_name}': 1 for metric in self.metrics}
        metrics_result[f'tuple_order_{prediction_col_name}'] = None \
            if 'order' not in test[target_col_name].lower() else 1
        return metrics_result

    # Tries to run the predicted query on the database. If there is an error (e.g. syntax error in the query),
    # it logs the error and returns metrics as 0
    try:
        test[prediction_col_name] = self.databases.run_query(test['db_id'], test[prediction_col_name])
    except sqlite3.Error as e:
        # catch any possible error of prediction and return all zeros
        logging.error(e)
        return {f'{metric}_{prediction_col_name}': 0 for metric in self.metrics}
    # Evaluates the results of the target and predicted queries using the evaluate_single_test_QA function
    return self.evaluate_single_test_QA(test, prediction_col_name, target_col_name)

`evaluate_with_df(df, prediction_col_name, task, target_col_name='query', keep_target=False)`

Evaluates SQL queries for various metrics including cell precision, cell recall, tuple cardinality, tuple constraint, tuple order.

For each row in the input DataFrame, it evaluates either the task as a QA (Question Answering) or SP (Semantic Parsing). Then, it concatenates the original DataFrame and the evaluated metric DataFrame.

Notes

df must contains at lest two columns 'target_col_name' and 'prediction_col_name'
'target_col_name' is the target SQL query that anwers the NL question | the target cell tables
'prediction_col_name' can be either the predicted SQL or the predicted cells
for QA, return zeros if predicted cells are not compliant with expected format: [["wales", "scotland"], ["england"]]
for both tasks, return zeros if the 'target_col_name' SQL query cannot be executed over the input databases
if 'target_col_name' contains the table cells, Tuple Order is calculated by default. Check if it is necessary.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame where each row represents a test.	required
`prediction_col_name`	`str`	Name of the column in the DataFrame that contains predictions.	required
`task`	`str`	Type of evaluation task. Could be `QA` or `SP`.	required
`target_col_name`	`str`	Name of the column in the DataFrame that contains target queries.	`'query'`
`keep_target`	`bool`	FALSE by default. If TRUE, keeps the target query.	`False`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Output DataFrame that has the original DataFrame along with the evaluated metric DataFrame.

Examples:

You do not have to specify the "databases" in case the "target" and "predictions" are already executed for QA:

>>> eval_task = MetricEvaluator(databases=None, metrics=['cell_precision', 'cell_recall'])
>>> test = {"sql_tags": "SELECT",
...         "prediction": [["wales", "scotland"], ["england"]],
...         "target": [["scotland", "wales"], ["england"]]}
>>> df = pd.DataFrame(test)
>>> prediction_col_name = "prediction"
>>> target_col_name = "target"
>>> result = eval_task.evaluate_with_df(df, prediction_col_name, 'QA', target_col_name)
>>> print(result)
{'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}

If this is not the case, you have to load the "databases" to execute the "target" queries.

>>> eval_task = MetricEvaluator(databases=databases, metrics=['cell_precision', 'cell_recall'])
>>> test = {"sql_tags": "SELECT",
...         "prediction": [["wales", "scotland"], ["england"]],
...         "target": ['SELECT * FROM table']}
>>> df = pd.DataFrame(test)
>>> prediction_col_name = "prediction"
>>> target_col_name = "target"
>>> result = eval_task.evaluate_with_df(df, prediction_col_name, 'QA', target_col_name)
>>> print(result)
{'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}

Note

For SP, if you have both the target and the predictions already executed, you have to specify the task as 'QA'

This because when using task 'SP' there are automatic controls on the query syntactic which are not available if they have already been executed.

Source code in qatch/metric_evaluator.py

def evaluate_with_df(self, df, prediction_col_name: str, task: str, target_col_name: str = 'query',
                     keep_target: bool = False
                     ) -> pd.DataFrame:
    """Evaluates SQL queries for various metrics including cell precision, cell recall,
    tuple cardinality, tuple constraint, tuple order.

    For each row in the input DataFrame, it evaluates either the task as a QA
    (Question Answering) or SP (Semantic Parsing). Then, it concatenates the original DataFrame
    and the evaluated metric DataFrame.

    Notes:
        - df must contains at lest two columns 'target_col_name' and 'prediction_col_name'
        - 'target_col_name' is the target SQL query that anwers the NL question | the target cell tables
        - 'prediction_col_name' can be either the predicted SQL or the predicted cells
        - for QA, return zeros if predicted cells are not compliant with expected format: [["wales", "scotland"], ["england"]]
        - for both tasks, return zeros if the 'target_col_name' SQL query cannot be executed over the input databases
        - if 'target_col_name' contains the table cells, Tuple Order is calculated by default. Check if it is necessary.

    Args:
        df (pd.DataFrame): Input DataFrame where each row represents a test.
        prediction_col_name (str): Name of the column in the DataFrame that contains predictions.
        task (str): Type of evaluation task. Could be `QA` or `SP`.
        target_col_name (str): Name of the column in the DataFrame that contains target queries.
        Default is 'query'.
        keep_target (bool): FALSE by default. If TRUE, keeps the target query.

    Returns:
        pd.DataFrame: Output DataFrame that has the original DataFrame along with the evaluated metric DataFrame.

    Examples:
        You do not have to specify the "databases" in case the "target" and "predictions" are already executed for QA:

        >>> eval_task = MetricEvaluator(databases=None, metrics=['cell_precision', 'cell_recall'])
        >>> test = {"sql_tags": "SELECT",
        ...         "prediction": [["wales", "scotland"], ["england"]],
        ...         "target": [["scotland", "wales"], ["england"]]}
        >>> df = pd.DataFrame(test)
        >>> prediction_col_name = "prediction"
        >>> target_col_name = "target"
        >>> result = eval_task.evaluate_with_df(df, prediction_col_name, 'QA', target_col_name)
        >>> print(result)
        {'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}

        If this is not the case, you have to load the "databases" to execute the "target" queries.

        >>> eval_task = MetricEvaluator(databases=databases, metrics=['cell_precision', 'cell_recall'])
        >>> test = {"sql_tags": "SELECT",
        ...         "prediction": [["wales", "scotland"], ["england"]],
        ...         "target": ['SELECT * FROM table']}
        >>> df = pd.DataFrame(test)
        >>> prediction_col_name = "prediction"
        >>> target_col_name = "target"
        >>> result = eval_task.evaluate_with_df(df, prediction_col_name, 'QA', target_col_name)
        >>> print(result)
        {'cell_precision_prediction': 1.0, 'cell_recall_prediction': 1.0}

    Note:
        For SP, if you have both the target and the predictions already executed, you have to specify the task as 'QA'

        This because when using task 'SP' there are automatic controls on the query syntactic which are not available if they have
        already been executed.

    """
    tqdm.pandas(desc=f'Evaluating {task} tests')
    if task.upper() == 'QA':
        # add the new metrics at the bottom of the dataframe
        df_metrics = df.progress_apply(lambda row: self.evaluate_single_test_QA(row.to_dict(),
                                                                                prediction_col_name,
                                                                                target_col_name),
                                       axis=1, result_type='expand')
    else:
        df_metrics = df.progress_apply(lambda row: self.evaluate_single_test_SP(row.to_dict(),
                                                                                prediction_col_name,
                                                                                target_col_name), axis=1,
                                       result_type='expand')
    return pd.concat([df, df_metrics], axis=1).replace({np.nan: None})

`CellPrecisionTag`

Bases: AbstractMetric

Source code in qatch/metrics/cell_precision_tag.py

class CellPrecisionTag(AbstractMetric):
    def evaluate_single_no_special_case(self, target: list[list],
                                        prediction: list[list]) -> float:
        """
        Calculates the ratio of predicted cells that are in the target.
        Does not consider cardinality (measured by other tags).
        High precision indicates that the model is good at identifying relevant instances
        and has a low false positive rate.

        Args:
            target (list[list]): Target table to be compared with the prediction table.
            prediction (list[list]): Prediction table to be compared with the target table.

        Returns:
            float: Precision score between [0, 1].
                - 0 indicates no cell in the prediction is in the target.
                - 1 indicates all cells in the prediction are in the target.

        Examples:
            >>> evaluator = CellPrecisionTag()
            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a', 'b'], ['c', 'd']
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            1.0

            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a', 'b'], ['c', 'e']
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            0.75

            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a'], ['b'], ['c'], ['d']]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            1.0  # it is one even if the schema does not match (we introduce tuple constraints for this)
        """
        target = np.array(target)
        prediction = np.array(prediction)

        sum_cell_match = np.sum(np.isin(prediction, target))
        return round(sum_cell_match / prediction.size, 3)

`evaluate_single_no_special_case(target, prediction)`

Calculates the ratio of predicted cells that are in the target. Does not consider cardinality (measured by other tags). High precision indicates that the model is good at identifying relevant instances and has a low false positive rate.

Parameters:

Name	Type	Description	Default
`target`	`list[list]`	Target table to be compared with the prediction table.	required
`prediction`	`list[list]`	Prediction table to be compared with the target table.	required

Returns:

Name	Type	Description
`float`	`float`	Precision score between [0, 1]. - 0 indicates no cell in the prediction is in the target. - 1 indicates all cells in the prediction are in the target.

Examples:

>>> evaluator = CellPrecisionTag()
>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a', 'b'], ['c', 'd']
>>> evaluator.evaluate_single_no_special_case(target, prediction)
1.0

>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a', 'b'], ['c', 'e']
>>> evaluator.evaluate_single_no_special_case(target, prediction)
0.75

>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a'], ['b'], ['c'], ['d']]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
1.0  # it is one even if the schema does not match (we introduce tuple constraints for this)

Source code in qatch/metrics/cell_precision_tag.py

def evaluate_single_no_special_case(self, target: list[list],
                                    prediction: list[list]) -> float:
    """
    Calculates the ratio of predicted cells that are in the target.
    Does not consider cardinality (measured by other tags).
    High precision indicates that the model is good at identifying relevant instances
    and has a low false positive rate.

    Args:
        target (list[list]): Target table to be compared with the prediction table.
        prediction (list[list]): Prediction table to be compared with the target table.

    Returns:
        float: Precision score between [0, 1].
            - 0 indicates no cell in the prediction is in the target.
            - 1 indicates all cells in the prediction are in the target.

    Examples:
        >>> evaluator = CellPrecisionTag()
        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a', 'b'], ['c', 'd']
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        1.0

        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a', 'b'], ['c', 'e']
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        0.75

        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a'], ['b'], ['c'], ['d']]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        1.0  # it is one even if the schema does not match (we introduce tuple constraints for this)
    """
    target = np.array(target)
    prediction = np.array(prediction)

    sum_cell_match = np.sum(np.isin(prediction, target))
    return round(sum_cell_match / prediction.size, 3)

`CellRecallTag`

Bases: AbstractMetric

Source code in qatch/metrics/cell_recall_tag.py

class CellRecallTag(AbstractMetric):
    def evaluate_single_no_special_case(self, target: list[list],
                                        prediction: list[list]) -> float:
        """
        Calculates the ratio of target cells that are in the prediction.
        High recall indicates that the model is good at identifying all relevant instances
        and has a low false negative rate.

        Args:
            target (list[list]): Target table to be compared with the prediction table.
            prediction (list[list]): Prediction table to be compared with the target table.

        Returns:
            float: Recall score between [0, 1].
                - 0 indicates no cell in the target is in the prediction.
                - 1 indicates all cells in the target are in the prediction.

        Examples:
            >>> evaluator = CellRecallTag()
            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a', 'b'], ['c', 'd']
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            1.0

            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a', 'x'], ['y', 'd']]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            0.5

            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a', 'a'], ['b', 'b'], ['c', 'd']]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            1.0
        """
        target = np.array(target)
        prediction = np.array(prediction)
        sum_cell_match = np.sum(np.isin(target, prediction))
        return round(sum_cell_match / target.size, 3)

`evaluate_single_no_special_case(target, prediction)`

Calculates the ratio of target cells that are in the prediction. High recall indicates that the model is good at identifying all relevant instances and has a low false negative rate.

Parameters:

Name	Type	Description	Default
`target`	`list[list]`	Target table to be compared with the prediction table.	required
`prediction`	`list[list]`	Prediction table to be compared with the target table.	required

Returns:

Name	Type	Description
`float`	`float`	Recall score between [0, 1]. - 0 indicates no cell in the target is in the prediction. - 1 indicates all cells in the target are in the prediction.

Examples:

>>> evaluator = CellRecallTag()
>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a', 'b'], ['c', 'd']
>>> evaluator.evaluate_single_no_special_case(target, prediction)
1.0

>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a', 'x'], ['y', 'd']]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
0.5

>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a', 'a'], ['b', 'b'], ['c', 'd']]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
1.0

Source code in qatch/metrics/cell_recall_tag.py

def evaluate_single_no_special_case(self, target: list[list],
                                    prediction: list[list]) -> float:
    """
    Calculates the ratio of target cells that are in the prediction.
    High recall indicates that the model is good at identifying all relevant instances
    and has a low false negative rate.

    Args:
        target (list[list]): Target table to be compared with the prediction table.
        prediction (list[list]): Prediction table to be compared with the target table.

    Returns:
        float: Recall score between [0, 1].
            - 0 indicates no cell in the target is in the prediction.
            - 1 indicates all cells in the target are in the prediction.

    Examples:
        >>> evaluator = CellRecallTag()
        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a', 'b'], ['c', 'd']
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        1.0

        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a', 'x'], ['y', 'd']]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        0.5

        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a', 'a'], ['b', 'b'], ['c', 'd']]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        1.0
    """
    target = np.array(target)
    prediction = np.array(prediction)
    sum_cell_match = np.sum(np.isin(target, prediction))
    return round(sum_cell_match / target.size, 3)

`TupleCardinalityTag`

Bases: AbstractMetric

Source code in qatch/metrics/tuple_cardinality_tag.py

class TupleCardinalityTag(AbstractMetric):
    def evaluate_single_no_special_case(self,
                                        target: list[list],
                                        prediction: list[list]) -> float:
        """
        Evaluates the ratio of the length of the smaller list to the length of the larger list.

        Calculates the ratio of the length of the target table to the length of the prediction table
        or vice-versa based on the maximum length to ensure the score falls between 0 and 1.

        Args:
            target (list[list]): Target table to be compared with the prediction table.
            prediction (list[list]): Prediction table to be compared with the target table.

        Returns:
            float: Score between [0, 1].
                - 0 indicates the target/prediction is zero and the other is not.
                - 1 indicates the target/prediction is the same size as the other.

        Examples:
            >>> evaluator = TupleCardinalityTag()
            >>> target = [[a, b], [c, d], [c, d], [f, g]]
            >>> prediction = [[a, b], [3, 2]]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            0.5  # 2/4

            >>> evaluator = TupleCardinalityTag()
            >>> target = [[a, b], [3, 2]]
            >>> prediction = [[a, b], [c, d], [c, d], [f, g]]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            0.5

            >>> evaluator = TupleCardinalityTag()
            >>> target = [[a, b], [3, 2]]
            >>> prediction = [[a, b], ['c', 'd']]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            1.0
        """
        if len(prediction) >= len(target):
            # in case we have more elements in the prediction than in the target
            return round(len(target) / len(prediction), 3)

        # in case we have more elements in the target than in the prediction
        elif len(prediction) < len(target):
            return round(len(prediction) / len(target), 3)

`evaluate_single_no_special_case(target, prediction)`

Evaluates the ratio of the length of the smaller list to the length of the larger list.

Calculates the ratio of the length of the target table to the length of the prediction table or vice-versa based on the maximum length to ensure the score falls between 0 and 1.

Parameters:

Name	Type	Description	Default
`target`	`list[list]`	Target table to be compared with the prediction table.	required
`prediction`	`list[list]`	Prediction table to be compared with the target table.	required

Returns:

Name	Type	Description
`float`	`float`	Score between [0, 1]. - 0 indicates the target/prediction is zero and the other is not. - 1 indicates the target/prediction is the same size as the other.

Examples:

>>> evaluator = TupleCardinalityTag()
>>> target = [[a, b], [c, d], [c, d], [f, g]]
>>> prediction = [[a, b], [3, 2]]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
0.5  # 2/4

>>> evaluator = TupleCardinalityTag()
>>> target = [[a, b], [3, 2]]
>>> prediction = [[a, b], [c, d], [c, d], [f, g]]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
0.5

>>> evaluator = TupleCardinalityTag()
>>> target = [[a, b], [3, 2]]
>>> prediction = [[a, b], ['c', 'd']]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
1.0

Source code in qatch/metrics/tuple_cardinality_tag.py

def evaluate_single_no_special_case(self,
                                    target: list[list],
                                    prediction: list[list]) -> float:
    """
    Evaluates the ratio of the length of the smaller list to the length of the larger list.

    Calculates the ratio of the length of the target table to the length of the prediction table
    or vice-versa based on the maximum length to ensure the score falls between 0 and 1.

    Args:
        target (list[list]): Target table to be compared with the prediction table.
        prediction (list[list]): Prediction table to be compared with the target table.

    Returns:
        float: Score between [0, 1].
            - 0 indicates the target/prediction is zero and the other is not.
            - 1 indicates the target/prediction is the same size as the other.

    Examples:
        >>> evaluator = TupleCardinalityTag()
        >>> target = [[a, b], [c, d], [c, d], [f, g]]
        >>> prediction = [[a, b], [3, 2]]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        0.5  # 2/4

        >>> evaluator = TupleCardinalityTag()
        >>> target = [[a, b], [3, 2]]
        >>> prediction = [[a, b], [c, d], [c, d], [f, g]]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        0.5

        >>> evaluator = TupleCardinalityTag()
        >>> target = [[a, b], [3, 2]]
        >>> prediction = [[a, b], ['c', 'd']]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        1.0
    """
    if len(prediction) >= len(target):
        # in case we have more elements in the prediction than in the target
        return round(len(target) / len(prediction), 3)

    # in case we have more elements in the target than in the prediction
    elif len(prediction) < len(target):
        return round(len(prediction) / len(target), 3)

`TupleConstraintTag`

Bases: AbstractMetric

Source code in qatch/metrics/tuple_constraint_tag.py

class TupleConstraintTag(AbstractMetric):
    def evaluate_single_no_special_case(self,
                                        target: list[list],
                                        prediction: list[list]
                                        ) -> float:
        """
        Evaluates the ratio between the cardinality of the target tuples and the prediction.
        Returns a score between 0 and 1. It is 1 if the schema, the cardinality and the cell values are equal.

        Args:
            target (list[list]): Target table to be compared with the prediction table.
            prediction (list[list]): Prediction table to be compared with the target table.

        Returns:
            float: Score between [0, 1].
                - 0 indicates NONE of the schema/cardinality/cell_values  are the same in prediction.
                - 1 indicates the schema, the cardinality and the cell values of
                    the prediction tuples are equal to the target ones.

        Examples:
            >>> evaluator = TupleConstraintTag()
            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a', 'b'], ['c', 'd']]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            1.0

            >>> evaluator = TupleConstraintTag()
            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a', 'b'], ['a', 'b'], ['c', 'd']]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            0.5  # only ['c', 'd'] is the same in both tables

            >>> evaluator = TupleConstraintTag()
            >>> target = [['a', 'b'], ['c', 'd']]
            >>> prediction = [['a', 'b'], ['a', 'b'], ['c', 'd'], ['c', 'd']]
            >>> evaluator.evaluate_single_no_special_case(target, prediction)
            0.0
        """
        target = map(sorted, target)
        prediction = map(sorted, prediction)

        target = map(tuple, target)
        prediction = map(tuple, prediction)

        count_targ_dict = Counter(target)
        count_pred_dict = Counter(prediction)

        cardinality = [count_pred_dict[key] == count for key, count in count_targ_dict.items()]

        return round(sum(cardinality) / len(cardinality), 3)

`evaluate_single_no_special_case(target, prediction)`

Evaluates the ratio between the cardinality of the target tuples and the prediction. Returns a score between 0 and 1. It is 1 if the schema, the cardinality and the cell values are equal.

Parameters:

Name	Type	Description	Default
`target`	`list[list]`	Target table to be compared with the prediction table.	required
`prediction`	`list[list]`	Prediction table to be compared with the target table.	required

Returns:

Name	Type	Description
`float`	`float`	Score between [0, 1]. - 0 indicates NONE of the schema/cardinality/cell_values are the same in prediction. - 1 indicates the schema, the cardinality and the cell values of the prediction tuples are equal to the target ones.

Examples:

>>> evaluator = TupleConstraintTag()
>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a', 'b'], ['c', 'd']]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
1.0

>>> evaluator = TupleConstraintTag()
>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a', 'b'], ['a', 'b'], ['c', 'd']]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
0.5  # only ['c', 'd'] is the same in both tables

>>> evaluator = TupleConstraintTag()
>>> target = [['a', 'b'], ['c', 'd']]
>>> prediction = [['a', 'b'], ['a', 'b'], ['c', 'd'], ['c', 'd']]
>>> evaluator.evaluate_single_no_special_case(target, prediction)
0.0

Source code in qatch/metrics/tuple_constraint_tag.py

def evaluate_single_no_special_case(self,
                                    target: list[list],
                                    prediction: list[list]
                                    ) -> float:
    """
    Evaluates the ratio between the cardinality of the target tuples and the prediction.
    Returns a score between 0 and 1. It is 1 if the schema, the cardinality and the cell values are equal.

    Args:
        target (list[list]): Target table to be compared with the prediction table.
        prediction (list[list]): Prediction table to be compared with the target table.

    Returns:
        float: Score between [0, 1].
            - 0 indicates NONE of the schema/cardinality/cell_values  are the same in prediction.
            - 1 indicates the schema, the cardinality and the cell values of
                the prediction tuples are equal to the target ones.

    Examples:
        >>> evaluator = TupleConstraintTag()
        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a', 'b'], ['c', 'd']]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        1.0

        >>> evaluator = TupleConstraintTag()
        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a', 'b'], ['a', 'b'], ['c', 'd']]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        0.5  # only ['c', 'd'] is the same in both tables

        >>> evaluator = TupleConstraintTag()
        >>> target = [['a', 'b'], ['c', 'd']]
        >>> prediction = [['a', 'b'], ['a', 'b'], ['c', 'd'], ['c', 'd']]
        >>> evaluator.evaluate_single_no_special_case(target, prediction)
        0.0
    """
    target = map(sorted, target)
    prediction = map(sorted, prediction)

    target = map(tuple, target)
    prediction = map(tuple, prediction)

    count_targ_dict = Counter(target)
    count_pred_dict = Counter(prediction)

    cardinality = [count_pred_dict[key] == count for key, count in count_targ_dict.items()]

    return round(sum(cardinality) / len(cardinality), 3)

`TupleOrderTag`

Bases: AbstractMetric

Source code in qatch/metrics/tuple_order_tag.py

class TupleOrderTag(AbstractMetric):
    def evaluate_single_no_special_case(self,
                                        target: list[list],
                                        prediction: list[list]) -> float:
        """
        Evaluates the similarity in tuple order between the target and prediction.
        The score is based on the Spearman rank correlation coefficient normalized between 0 and 1.
        This metric ONLY checks whether the order of the tuples is the same in the target and prediction.
        Therefore, the elements that are in predictions but nor in target are ignored (and viceversa).

        Args:
            target (list[list]): Target table to be compared with the prediction table.
            prediction (list[list]): Prediction table to be compared with the target table.

        Returns:
            float: Score between [-1, 1].
            - 1 indicates that the order of rows in prediction is the same as in the target.
            - 0.5 indicates that there is no correlation between the two lists.
            - 0 indicates the order of rows in prediction is opposite to the target.

        Examples:
            >>> evaluator = TupleOrderTag()
            >>>  target = [['a', 'b'], ['c', 'd']]
            >>>  prediction = [['c', 'd'], ['a', 'b']]
            >>> evaluator.evaluate(target, prediction)
            0.0

            >>> evaluator = TupleOrderTag()
            >>>  target = [['apple', 'orange'], ['pear']]
            >>>  prediction = [['pear'], ['apple', 'orange']]
            >>> evaluator.evaluate(target, prediction)
            0.0

            >>> evaluator = TupleOrderTag()
            >>>  target = [['apple', 'orange'], ['pear']]
            >>>  prediction = [['pear']]
            >>> evaluator.evaluate(target, prediction)
            1.0
        """

        # take only prediction that are in target without duplicates
        # MAINTAINING the order
        new_pred = []
        [new_pred.append(pred) for pred in prediction
         if pred in target and pred not in new_pred]
        # same for target
        new_target = []
        [new_target.append(tar) for tar in target
         if tar in prediction and tar not in new_target]

        if len(new_target) == 0:

            rho = 0.0
        else:
            target_ranks = [i for i in range(len(new_target))]
            pred_ranks = [new_target.index(row) for row in new_pred]

            diff_rank_squared = [(tar - pred) ** 2
                                 for tar, pred in zip(target_ranks, pred_ranks)]

            sum_diff_rank_squared = sum(diff_rank_squared)

            n = len(new_target) if len(new_target) > 1 else 2
            rho = 1 - 6 * sum_diff_rank_squared / (n * (n ** 2 - 1))

        return self.normalize(round(rho, 3))

    @staticmethod
    def normalize(data: float):
        data = [-1, data, 1]
        data = (data - np.min(data)) / (np.max(data) - np.min(data))
        return data[1]

`evaluate_single_no_special_case(target, prediction)`

Evaluates the similarity in tuple order between the target and prediction. The score is based on the Spearman rank correlation coefficient normalized between 0 and 1. This metric ONLY checks whether the order of the tuples is the same in the target and prediction. Therefore, the elements that are in predictions but nor in target are ignored (and viceversa).

Parameters:

Name	Type	Description	Default
`target`	`list[list]`	Target table to be compared with the prediction table.	required
`prediction`	`list[list]`	Prediction table to be compared with the target table.	required

Returns:

Name	Type	Description
`float`	`float`	Score between [-1, 1].
	`float`	1 indicates that the order of rows in prediction is the same as in the target.
	`float`	0.5 indicates that there is no correlation between the two lists.
	`float`	0 indicates the order of rows in prediction is opposite to the target.

Examples:

>>> evaluator = TupleOrderTag()
>>>  target = [['a', 'b'], ['c', 'd']]
>>>  prediction = [['c', 'd'], ['a', 'b']]
>>> evaluator.evaluate(target, prediction)
0.0

>>> evaluator = TupleOrderTag()
>>>  target = [['apple', 'orange'], ['pear']]
>>>  prediction = [['pear'], ['apple', 'orange']]
>>> evaluator.evaluate(target, prediction)
0.0

>>> evaluator = TupleOrderTag()
>>>  target = [['apple', 'orange'], ['pear']]
>>>  prediction = [['pear']]
>>> evaluator.evaluate(target, prediction)
1.0

Source code in qatch/metrics/tuple_order_tag.py

def evaluate_single_no_special_case(self,
                                    target: list[list],
                                    prediction: list[list]) -> float:
    """
    Evaluates the similarity in tuple order between the target and prediction.
    The score is based on the Spearman rank correlation coefficient normalized between 0 and 1.
    This metric ONLY checks whether the order of the tuples is the same in the target and prediction.
    Therefore, the elements that are in predictions but nor in target are ignored (and viceversa).

    Args:
        target (list[list]): Target table to be compared with the prediction table.
        prediction (list[list]): Prediction table to be compared with the target table.

    Returns:
        float: Score between [-1, 1].
        - 1 indicates that the order of rows in prediction is the same as in the target.
        - 0.5 indicates that there is no correlation between the two lists.
        - 0 indicates the order of rows in prediction is opposite to the target.

    Examples:
        >>> evaluator = TupleOrderTag()
        >>>  target = [['a', 'b'], ['c', 'd']]
        >>>  prediction = [['c', 'd'], ['a', 'b']]
        >>> evaluator.evaluate(target, prediction)
        0.0

        >>> evaluator = TupleOrderTag()
        >>>  target = [['apple', 'orange'], ['pear']]
        >>>  prediction = [['pear'], ['apple', 'orange']]
        >>> evaluator.evaluate(target, prediction)
        0.0

        >>> evaluator = TupleOrderTag()
        >>>  target = [['apple', 'orange'], ['pear']]
        >>>  prediction = [['pear']]
        >>> evaluator.evaluate(target, prediction)
        1.0
    """

    # take only prediction that are in target without duplicates
    # MAINTAINING the order
    new_pred = []
    [new_pred.append(pred) for pred in prediction
     if pred in target and pred not in new_pred]
    # same for target
    new_target = []
    [new_target.append(tar) for tar in target
     if tar in prediction and tar not in new_target]

    if len(new_target) == 0:

        rho = 0.0
    else:
        target_ranks = [i for i in range(len(new_target))]
        pred_ranks = [new_target.index(row) for row in new_pred]

        diff_rank_squared = [(tar - pred) ** 2
                             for tar, pred in zip(target_ranks, pred_ranks)]

        sum_diff_rank_squared = sum(diff_rank_squared)

        n = len(new_target) if len(new_target) > 1 else 2
        rho = 1 - 6 * sum_diff_rank_squared / (n * (n ** 2 - 1))

    return self.normalize(round(rho, 3))

QATCH Evaluate

MetricEvaluator

are_cleaned_sql_identical(target, prediction) staticmethod

evaluate_single_test_QA(test, prediction_col_name, target_col_name)

evaluate_single_test_SP(test, prediction_col_name, target_col_name)

evaluate_with_df(df, prediction_col_name, task, target_col_name='query', keep_target=False)

CellPrecisionTag

evaluate_single_no_special_case(target, prediction)

CellRecallTag

evaluate_single_no_special_case(target, prediction)

TupleCardinalityTag

evaluate_single_no_special_case(target, prediction)

TupleConstraintTag

evaluate_single_no_special_case(target, prediction)

TupleOrderTag

evaluate_single_no_special_case(target, prediction)

`MetricEvaluator`

`are_cleaned_sql_identical(target, prediction)` `staticmethod`

`evaluate_single_test_QA(test, prediction_col_name, target_col_name)`

`evaluate_single_test_SP(test, prediction_col_name, target_col_name)`

`evaluate_with_df(df, prediction_col_name, task, target_col_name='query', keep_target=False)`

`CellPrecisionTag`

`evaluate_single_no_special_case(target, prediction)`

`CellRecallTag`

`evaluate_single_no_special_case(target, prediction)`

`TupleCardinalityTag`

`evaluate_single_no_special_case(target, prediction)`

`TupleConstraintTag`

`evaluate_single_no_special_case(target, prediction)`

`TupleOrderTag`

`evaluate_single_no_special_case(target, prediction)`