medmodels.matching.metrics

Metrics for comparing vectors in the context of matching classes.

absolute_metric

def absolute_metric(vector1: NDArray[np.float64],
                    vector2: NDArray[np.float64]) -> float

Calculates the Manhattan distance (L1 norm) between two vectors, providing a measure of the absolute difference between them. This distance is the sum of the absolute differences between each corresponding pair of elements in the two vectors.

The calculation is based on the formula:

D(x, y) = ||x - y||_1 = \sum_{i=1}^n |x_i - y_i| for x, y \in \mathbb{R}^n

Arguments:

vector1 NDArray[np.float64] - The first vector to be compared.
vector2 NDArray[np.float64] - The second vector to be compared.

Returns:

float - The Manhattan distance between the two vectors.

exact_metric

def exact_metric(vector1: NDArray[np.float64],
                 vector2: NDArray[np.float64]) -> float

Computes the exact metric for matching, which is particularly applicable for discrete or categorical covariates rather than continuous ones. This metric returns 0 if the two vectors are exactly identical, and infinity otherwise, making it suitable for scenarios where exact matches are necessary.

The exact metric is defined as:

D(x, y) = \begin{cases} 0 & \text{if } x = y, \\ \infty & \text{otherwise}. \end{cases}

Arguments:

vector1 NDArray[np.float64] - The first vector to be compared.
vector2 NDArray[np.float64] - The second vector to be compared.

Returns:

float - 0 if the vectors are equal, infinity if they are not.

Notes:

This function is designed for exactly two input vectors.

mahalanobis_metric

def mahalanobis_metric(vector1: NDArray[np.float64],
                       vector2: NDArray[np.float64],
                       inv_cov: NDArray[np.float64]) -> float

Returns mahalanobis metric for matching. Works better with continuous covariates.

D(x, y) = \sqrt{(x-y)^T S^{-1} (x-y)} \\ \text{where } S \text{ is the covariance matrix of the whole distribution}

The covariance matrix and its inverse are calculated at most once per item to be paired, hence, they won't be included inside of the method in order to avoid the repeated computation.

By matching without replacement the found paired item will be removed from the set, hence, the covariance matrix of the whole distribution and its inverse need to be recalculated for every entry. This can be time consuming for big data sets (esp. with a big amount of features).

Arguments:

vector1 NDArray[np.float64] - The first vector to be compared.
vector2 NDArray[np.float64] - The second vector to be compared. Must have the same shape as vector1 and belong to the same distribution.
inv_cov NDArray[np.float64] - The inverse of the covariance matrix of the whole distribution (data set).

Returns:

float - The Mahalanobis distance between the two vectors.

medmodels.matching.evaluation

calculate_relative_diff

def calculate_relative_diff(row: pd.Series[float]) -> float

Calculates the absolute relative difference for a single feature, expressed as a percentage of the control's mean. Handles division by zero by returning the absolute difference when the control mean is zero.

Arguments:

row pd.Series[float] - A Series object representing a row from the DataFrame of means, containing 'control_mean' and 'treated_mean' for a feature.

Returns:

float - The absolute relative difference in means, as a percentage.

relative_diff_in_means

def relative_diff_in_means(control_set: pd.DataFrame,
                           treated_set: pd.DataFrame) -> pd.DataFrame

Calculates the absolute relative mean difference for each feature between control and treated sets, expressed as a percentage of the control set's mean. This measure provides an understanding of how much each feature's average value changes from the control to the treated group relative to the control.

Arguments:

control_set pd.DataFrame - DataFrame representing the control group.
treated_set pd.DataFrame - DataFrame representing the treated group.

Returns:

pd.DataFrame - A DataFrame containing the mean values of the control and treated sets for all features and the absolute relative difference in means, expressed as a percentage.

The function internally computes the relative difference for each feature, handling cases where the control mean is zero by simply calculating the absolute difference times 100. It provides insights into the percentage change in feature means due to treatment.

average_value_over_features

def average_value_over_features(df: pd.DataFrame) -> float

Calculates the average of the values in the last row of a DataFrame. This function is particularly useful for aggregating measures like differences or percentages across multiple features, providing a single summary statistic.

Arguments:

df pd.DataFrame - The DataFrame on which the calculation is to be performed.

Returns:

float - The average value of the last row across all columns.

Example:

Given a DataFrame with the last row containing differences in percentages between treated and control means across features 'a' and 'b', e.g., 75.0% for 'a' and 250.0% for 'b', this function will return the average difference, which is (75.0 + 250.0) / 2 = 162.5.

average_abs_relative_diff

def average_abs_relative_diff(
    control_set: pd.DataFrame,
    treated_set: pd.DataFrame,
    covariates: Optional[Union[List[str], pd.Index[str]]] = None
) -> Tuple[float, pd.DataFrame]

Calculates the average absolute relative difference in means over specified covariates between control and treated sets. If covariates are not specified, the calculation includes all features.

This function is designed to assess the impact of a treatment across multiple features by computing the mean of absolute relative differences. It returns both a summary metric and a detailed DataFrame for further analysis.

Arguments:

control_set pd.DataFrame - DataFrame for the control group.
treated_set pd.DataFrame - DataFrame for the treated group.
covariates Optional[Union[List[str], pd.Index[str]]] optional - List of covariate names to include. If None, considers all features.

Returns:

Tuple[float, pd.DataFrame]: A tuple containing the average absolute relative difference as a float and a DataFrame with detailed mean values and absolute relative differences for all features.

The detailed DataFrame includes means for both control and treated sets and the absolute relative difference for each feature.

medmodels.matching.matching

Matching Objects

class Matching(metaclass=ABCMeta)

The Base Class for matching.

Covariates