medmodels.matching.metrics
Metrics for comparing vectors in the context of matching classes.
absolute_metric
def absolute_metric(vector1: NDArray[np.float64],
vector2: NDArray[np.float64]) -> float
Calculates the Manhattan distance (L1 norm) between two vectors, providing a measure of the absolute difference between them. This distance is the sum of the absolute differences between each corresponding pair of elements in the two vectors.
The calculation is based on the formula:
Arguments:
vector1
NDArray[np.float64] - The first vector to be compared.vector2
NDArray[np.float64] - The second vector to be compared.
Returns:
float
- The Manhattan distance between the two vectors.
exact_metric
def exact_metric(vector1: NDArray[np.float64],
vector2: NDArray[np.float64]) -> float
Computes the exact metric for matching, which is particularly applicable for discrete or categorical covariates rather than continuous ones. This metric returns 0 if the two vectors are exactly identical, and infinity otherwise, making it suitable for scenarios where exact matches are necessary.
The exact metric is defined as:
Arguments:
vector1
NDArray[np.float64] - The first vector to be compared.vector2
NDArray[np.float64] - The second vector to be compared.
Returns:
float
- 0 if the vectors are equal, infinity if they are not.
Notes:
This function is designed for exactly two input vectors.
mahalanobis_metric
def mahalanobis_metric(vector1: NDArray[np.float64],
vector2: NDArray[np.float64],
inv_cov: NDArray[np.float64]) -> float
Returns mahalanobis metric for matching. Works better with continuous covariates.
The covariance matrix and its inverse are calculated at most once per item to be paired, hence, they won't be included inside of the method in order to avoid the repeated computation.
By matching without replacement the found paired item will be removed from the set, hence, the covariance matrix of the whole distribution and its inverse need to be recalculated for every entry. This can be time consuming for big data sets (esp. with a big amount of features).
Arguments:
vector1
NDArray[np.float64] - The first vector to be compared.vector2
NDArray[np.float64] - The second vector to be compared. Must have the same shape asvector1
and belong to the same distribution.inv_cov
NDArray[np.float64] - The inverse of the covariance matrix of the whole distribution (data set).
Returns:
float
- The Mahalanobis distance between the two vectors.
medmodels.matching.evaluation
calculate_relative_diff
def calculate_relative_diff(row: pd.Series[float]) -> float
Calculates the absolute relative difference for a single feature, expressed as a percentage of the control's mean. Handles division by zero by returning the absolute difference when the control mean is zero.
Arguments:
row
pd.Series[float] - A Series object representing a row from the DataFrame of means, containing 'control_mean' and 'treated_mean' for a feature.
Returns:
float
- The absolute relative difference in means, as a percentage.
relative_diff_in_means
def relative_diff_in_means(control_set: pd.DataFrame,
treated_set: pd.DataFrame) -> pd.DataFrame
Calculates the absolute relative mean difference for each feature between control and treated sets, expressed as a percentage of the control set's mean. This measure provides an understanding of how much each feature's average value changes from the control to the treated group relative to the control.
Arguments:
control_set
pd.DataFrame - DataFrame representing the control group.treated_set
pd.DataFrame - DataFrame representing the treated group.
Returns:
-
pd.DataFrame
- A DataFrame containing the mean values of the control and treated sets for all features and the absolute relative difference in means, expressed as a percentage.The function internally computes the relative difference for each feature, handling cases where the control mean is zero by simply calculating the absolute difference times 100. It provides insights into the percentage change in feature means due to treatment.
average_value_over_features
def average_value_over_features(df: pd.DataFrame) -> float
Calculates the average of the values in the last row of a DataFrame. This function is particularly useful for aggregating measures like differences or percentages across multiple features, providing a single summary statistic.
Arguments:
df
pd.DataFrame - The DataFrame on which the calculation is to be performed.
Returns:
float
- The average value of the last row across all columns.
Example:
Given a DataFrame with the last row containing differences in percentages between treated and control means across features 'a' and 'b', e.g., 75.0% for 'a' and 250.0% for 'b', this function will return the average difference, which is (75.0 + 250.0) / 2 = 162.5.
average_abs_relative_diff
def average_abs_relative_diff(
control_set: pd.DataFrame,
treated_set: pd.DataFrame,
covariates: Optional[Union[List[str], pd.Index[str]]] = None
) -> Tuple[float, pd.DataFrame]
Calculates the average absolute relative difference in means over specified covariates between control and treated sets. If covariates are not specified, the calculation includes all features.
This function is designed to assess the impact of a treatment across multiple features by computing the mean of absolute relative differences. It returns both a summary metric and a detailed DataFrame for further analysis.
Arguments:
control_set
pd.DataFrame - DataFrame for the control group.treated_set
pd.DataFrame - DataFrame for the treated group.covariates
Optional[Union[List[str], pd.Index[str]]] optional - List of covariate names to include. If None, considers all features.
Returns:
Tuple[float, pd.DataFrame]: A tuple containing the average absolute relative difference as a float and a DataFrame with detailed mean values and absolute relative differences for all features.
The detailed DataFrame includes means for both control and treated sets and the absolute relative difference for each feature.
medmodels.matching.matching
Matching Objects
class Matching(metaclass=ABCMeta)
The Base Class for matching.