`distance.Rd`

Several matching methods require or can involve the distance between treated and control units. Options include the Mahalanobis distance, propensity score distance, or distance between user-supplied values. Propensity scores are also used for common support via the `discard`

options and for defining calipers. This page documents the options that can be supplied to the `distance`

argument to `matchit()`

.

There are four ways to specify the `distance`

argument: 1) as the string `"mahalanobis"`

, 2) as a string containing the name of a method for estimating propensity scores, 3) as a vector of values whose pairwise differences define the distance between units, or 4) as a distance matrix containing all pairwise differences.

When `distance`

is specified as one of the allowed strings (described below) other than `"mahalanobis"`

, a propensity score is estimated using the variables in `formula`

and the method corresponding to the given argument. This propensity score can be used to compute the distance between units as the absolute difference between the propensity scores of pairs of units. In this respect, the propensity score is more like a "position" measure than a distance measure, since it is the pairwise difference that form the distance rather than the propensity scores themselves. Still, this naming convention is used to reflect their primary purpose without committing to the status of the estimated values as propensity scores, since transformations of the scores are allowed and user-supplied values that are not propensity scores can also be supplied (detailed below). Propensity scores can also be used to create calipers and common support restrictions, whether or not they are used in the actual distance measure used in the matching, if any.

In addition to the `distance`

argument, two other arguments can be specified that relate to the estimation and manipulation of the propensity scores. The `link`

argument allows for different links to be used in models that require them such as generalized linear models, for which the logit and probit links are allowed, among others. In addition to specifying the link, the `link`

argument can be used to specify whether the propensity score or the linearized version of the propensity score should be used; by specifying `link = "linear.{link}"`

, the linearized version will be used.

The `distance.options`

argument can also be specified, which should be a list of values passed to the propensity score-estimating function, for example, to choose specific options or tuning parameters for the estimation method. If `formula`

, `data`

, or `verbose`

are not supplied to `distance.options`

, the corresponding arguments from `matchit()`

will be automatically supplied. See the Examples for demonstrations of the uses of `link`

and `distance.options`

. When `s.weights`

is supplied in the call to `matchit()`

, it will automatically be passed to the propensity score-estimating function as the `weights`

argument unless otherwise described below.

Below are the allowed options for `distance`

:

`"glm"`

The propensity scores are estimated using a generalized linear model (e.g., logistic regression). The

`formula`

supplied to`matchit()`

is passed directly to`glm()`

, and`predict.glm()`

is used to compute the propensity scores. The`link`

argument can be specified as a link function supplied to`binomial()`

, e.g.,`"logit"`

, which is the default. When`link`

is prepended by`"linear."`

, the linear predictor is used instead of the predicted probabilities.`distance = "glm"`

with`link = "logit"`

(logistic regression) is the default in`matchit()`

.`"gam"`

The propensity scores are estimated using a generalized additive model. The

`formula`

supplied to`matchit()`

is passed directly to`mgcv::gam()`

, and`mgcv::predict.gam()`

is used to compute the propensity scores. The`link`

argument can be specified as a link function supplied to`binomial()`

, e.g.,`"logit"`

, which is the default. When`link`

is prepended by`"linear."`

, the linear predictor is used instead of the predicted probabilities. Note that unless the smoothing functions`s()`

,`te()`

,`ti()`

, or`t2()`

are used in`formula`

, a generalized additive model is identical to a generalized linear model and will estimate the same propensity scores as`glm`

. See the documentation for`mgcv::gam()`

,`mgcv::formula.gam()`

, and`mgcv::gam.models()`

for more information on how to specify these models. Also note that the formula returned in the`matchit()`

output object will be a simplified version of the supplied formula with smoothing terms removed (but all named variables present).`"gbm"`

The propensity scores are estimated using a generalized boosted model. The

`formula`

supplied to`matchit()`

is passed directly to`gbm::gbm()`

, and`gbm::predict.gbm()`

is used to compute the propensity scores. The optimal tree is chosen using 5-fold cross-validation by default, and this can be changed by supplying an argument to`method`

to`distance.options`

; see`gbm::gbm.perf()`

for details. The`link`

argument can be specified as`"linear"`

to use the linear predictor instead of the predicted probabilities. No other links are allowed. The tuning parameter defaults differ from`gbm::gbm()`

; they are as follows:`n.trees = 1e4`

,`interaction.depth = 3`

,`shrinkage = .01`

,`bag.fraction = 1`

,`cv.folds = 5`

,`keep.data = FALSE`

. These are the same defaults as used in WeightIt and twang, except for`cv.folds`

and`keep.data`

. Note this is not the same use of generalized boosted modeling as in twang; here, the number of trees is chosen based on cross-validation or out-of-bag error, rather than based on optimizing balance. twang should not be cited when using this method to estimate propensity scores.`"lasso"`

,`"ridge"`

,`"elasticnet"`

The propensity scores are estimated using a lasso, ridge, or elastic net model, respectively. The

`formula`

supplied to`matchit()`

is processed with`model.matrix()`

and passed to`glmnet::cv.glmnet()`

, and`glmnet::predict.cv.glmnet()`

is used to compute the propensity scores. The`link`

argument can be specified as a link function supplied to`binomial()`

, e.g.,`"logit"`

, which is the default. When`link`

is prepended by`"linear."`

, the linear predictor is used instead of the predicted probabilities. When`link = "log"`

, a Poisson model is used. For`distance = "elasticnet"`

, the`alpha`

argument, which controls how to prioritize the lasso and ridge penalties in the elastic net, is set to .5 by default and can be changed by supplying an argument to`alpha`

in`distance.options`

. For`"lasso"`

and`"ridge"`

,`alpha`

is set to 1 and 0, respectively, and cannot be changed. The`cv.glmnet()`

defaults are used to select the tuning parameters and generate predictions and can be modified using`distance.options`

. If the`s`

argument is passed to`distance.options`

, it will be passed to`predict.cv.glmnet()`

. Note that because there is a random component to choosing the tuning parameter, results will vary across runs unless a seed is set.`"rpart"`

The propensity scores are estimated using a classification tree. The

`formula`

supplied to`matchit()`

is passed directly to`rpart::rpart()`

, and`rpart::predict.rpart()`

is used to compute the propensity scores. The`link`

argument is ignored, and predicted probabilities are always returned as the distance measure.`"randomforest"`

The propensity scores are estimated using a random forest. The

`formula`

supplied to`matchit()`

is passed directly to`randomForest::randomForest()`

, and`randomForest::predict.randomForest()`

is used to compute the propensity scores. The`link`

argument is ignored, and predicted probabilities are always returned as the distance measure. When`s.weights`

is supplied to`matchit()`

, it will not be passed to`randomForest`

because`randomForest`

does not accept weights.`"nnet"`

The propensity scores are estimated using a single-hidden-layer neural network. The

`formula`

supplied to`matchit()`

is passed directly to`nnet::nnet()`

, and`fitted()`

is used to compute the propensity scores. The`link`

argument is ignored, and predicted probabilities are always returned as the distance measure. An argument to`size`

must be supplied to`distance.options`

when using`method = "nnet"`

.`"cbps"`

The propensity scores are estimated using the covariate balancing propensity score (CBPS) algorithm, which is a form of logistic regression where balance constraints are incorporated to a generalized method of moments estimation of of the model coefficients. The

`formula`

supplied to`matchit()`

is passed directly to`CBPS::CBPS()`

, and`fitted`

is used to compute the propensity scores. The`link`

argument can be specified as`"linear"`

to use the linear predictor instead of the predicted probabilities. No other links are allowed. The`estimand`

argument supplied to`matchit()`

will be used to select the appropriate estimand for use in defining the balance constraints, so no argument needs to be supplied to`ATT`

in`CBPS`

.`"bart"`

The propensity scores are estimated using Bayesian additive regression trees (BART). The

`formula`

supplied to`matchit()`

is passed directly to`dbarts::bart2()`

, and`dbarts::fitted()`

is used to compute the propensity scores. The`link`

argument can be specified as`"linear"`

to use the linear predictor instead of the predicted probabilities. When`s.weights`

is supplied to`matchit()`

, it will not be passed to`bart2`

because the`weights`

argument in`bart2`

does not correspond to sampling weights.`"mahalanobis"`

No propensity scores are estimated. Rather than using the propensity score difference as the distance between units, the Mahalanobis distance is used instead. See

`mahalanobis()`

for details on how it is computed. The Mahalanobis distance is always computed using all the variables in`formula`

. With this specification, calipers and common support restrictions cannot be used and the`distance`

component of the output object will be empty because no propensity scores are estimated. The`link`

and`distance.options`

arguments are ignored. See individual methods pages for whether the Mahalanobis distance is allowed and how it is used. Sometimes this setting is just a placeholder to indicate that no propensity score is to be estimated (e.g., with`method = "genetic"`

). To perform Mahalanobis distance matching*and*estimate propensity scores to be used for a purpose other than matching, the`mahvars`

argument should be used along with a different specification to`distance`

. See the individual matching method pages for details on how to use`mahvars`

.

`distance`

can also be supplied as a numeric vector whose values will be taken to function like propensity scores; their pairwise difference will define the distance between units. This might be useful for supplying propensity scores computed outside `matchit()`

or resupplying `matchit()`

with propensity scores estimated before without having to recompute them. `distance`

can also be supplied as a matrix whose values represent the pairwise distances between units. The matrix should either be a square, with a row and column for each unit (e.g., as the output of a call to `as.matrix(dist(.))`

), or have as many rows as there are treated units and as many columns as there are control units (e.g., as the output of a call to `optmatch::match_on()`

). Distance values of `Inf`

will disallow the corresponding units to be matched. When `distance`

is a supplied as a numeric vector or matrix, `link`

and `distance.options`

are ignored.

When specifying an argument to `distance`

that estimates a propensity score, the output of the function called to estimate the propensity score (e.g., the `glm`

object when `distance = "glm"`

) will be included in the `matchit()`

output object in the `model`

component. When `distance`

is anything other than `"mahalanobis"`

and not matrix, the estimated or supplied distance measures will be included in the `matchit()`

output object in the `distance`

component.

In versions of *MatchIt* prior to 4.0.0, `distance`

was specified in a slightly different way. When specifying arguments using the old syntax, they will automatically be converted to the corresponding method in the new syntax but a warning will be thrown. `distance = "logit"`

, the old default, will still work in the new syntax, though `distance = "glm", link = "logit"`

is preferred (note that these are the default settings and don't need to be made explicit).

data("lalonde") # Linearized probit regression PS: m.out1 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = "glm", link = "linear.probit") if (requireNamespace("mgcv", quietly = TRUE)) { # GAM logistic PS with smoothing splines (s()): m.out2 <- matchit(treat ~ s(age) + s(educ) + race + married + nodegree + re74 + re75, data = lalonde, distance = "gam") summary(m.out2$model) }; if (requireNamespace("CBPS", quietly = TRUE)) { # CBPS for ATC matching w/replacement, using the just- # identified version of CBPS (setting method = "exact"): m.out3 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = "cbps", estimand = "ATC", distance.options = list(method = "exact"), replace = TRUE) } #> #> Family: quasibinomial #> Link function: logit #> #> Formula: #> treat ~ s(age) + s(educ) + race + married + nodegree + re74 + #> re75 #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 5.436e-01 3.950e-01 1.376 0.16923 #> racehispan -2.447e+00 4.323e-01 -5.661 2.34e-08 *** #> racewhite -2.995e+00 3.136e-01 -9.552 < 2e-16 *** #> married -1.643e+00 3.437e-01 -4.781 2.20e-06 *** #> nodegree 7.893e-01 4.800e-01 1.645 0.10060 #> re74 -9.838e-05 3.245e-05 -3.031 0.00254 ** #> re75 5.113e-05 5.001e-05 1.022 0.30706 #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(age) 7.488 8.143 6.782 <2e-16 *** #> s(educ) 2.647 3.359 2.311 0.0628 . #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> R-sq.(adj) = 0.5 Deviance explained = 46.1% #> GCV = 0.69813 Scale est. = 1.0287 n = 614 # Mahalanobis distance matching - no PS estimated m.out4 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = "mahalanobis") m.out4$distance #NULL #> NULL # Mahalanobis distance matching with PS estimated # for use in a caliper; matching done on mahvars m.out5 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = "glm", caliper = .1, mahvars = ~ age + educ + race + married + nodegree + re74 + re75) summary(m.out5) #> #> Call: #> matchit(formula = treat ~ age + educ + race + married + nodegree + #> re74 + re75, data = lalonde, distance = "glm", mahvars = ~age + #> educ + race + married + nodegree + re74 + re75, caliper = 0.1) #> #> Summary of Balance for All Data: #> Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean #> distance 0.5774 0.1822 1.7941 0.9211 0.3774 #> age 25.8162 28.0303 -0.3094 0.4400 0.0813 #> educ 10.3459 10.2354 0.0550 0.4959 0.0347 #> raceblack 0.8432 0.2028 1.7615 . 0.6404 #> racehispan 0.0595 0.1422 -0.3498 . 0.0827 #> racewhite 0.0973 0.6550 -1.8819 . 0.5577 #> married 0.1892 0.5128 -0.8263 . 0.3236 #> nodegree 0.7081 0.5967 0.2450 . 0.1114 #> re74 2095.5737 5619.2365 -0.7211 0.5181 0.2248 #> re75 1532.0553 2466.4844 -0.2903 0.9563 0.1342 #> eCDF Max #> distance 0.6444 #> age 0.1577 #> educ 0.1114 #> raceblack 0.6404 #> racehispan 0.0827 #> racewhite 0.5577 #> married 0.3236 #> nodegree 0.1114 #> re74 0.4470 #> re75 0.2876 #> #> #> Summary of Balance for Matched Data: #> Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean #> distance 0.5096 0.4905 0.0865 1.0661 0.0244 #> age 25.9459 25.0450 0.1259 0.4271 0.0878 #> educ 10.4865 10.2793 0.1031 0.6640 0.0175 #> raceblack 0.7387 0.7207 0.0496 . 0.0180 #> racehispan 0.0991 0.0991 0.0000 . 0.0000 #> racewhite 0.1622 0.1802 -0.0608 . 0.0180 #> married 0.2072 0.2342 -0.0690 . 0.0270 #> nodegree 0.6486 0.6577 -0.0198 . 0.0090 #> re74 2667.1135 2215.3307 0.0925 1.8804 0.0429 #> re75 1811.2988 1529.5967 0.0875 1.8724 0.0244 #> eCDF Max Std. Pair Dist. #> distance 0.1441 0.0933 #> age 0.3243 0.9292 #> educ 0.0811 0.7841 #> raceblack 0.0180 0.0496 #> racehispan 0.0000 0.0541 #> racewhite 0.0180 0.1216 #> married 0.0270 0.5751 #> nodegree 0.0090 0.6143 #> re74 0.2432 0.5596 #> re75 0.0991 0.5166 #> #> Percent Balance Improvement: #> Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max #> distance 95.2 22.0 93.5 77.6 #> age 59.3 -3.6 -8.0 -105.6 #> educ -87.5 41.6 49.5 27.2 #> raceblack 97.2 . 97.2 97.2 #> racehispan 100.0 . 100.0 100.0 #> racewhite 96.8 . 96.8 96.8 #> married 91.6 . 91.6 91.6 #> nodegree 91.9 . 91.9 91.9 #> re74 87.2 4.0 80.9 45.6 #> re75 69.9 -1303.5 81.8 65.5 #> #> Sample Sizes: #> Control Treated #> All 429 185 #> Matched 111 111 #> Unmatched 318 74 #> Discarded 0 0 #> # User-supplied propensity scores p.score <- fitted(glm(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, family = binomial)) m.out6 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = p.score) # User-supplied distance matrix using optmatch::match_on() if (requireNamespace("optmatch", quietly = TRUE)) { dist_mat <- optmatch::match_on( treat ~ age + educ + race + nodegree + married + re74 + re75, data = lalonde, method = "rank_mahalanobis") m.out7 <- matchit(treat ~ age + educ + race + nodegree + married + re74 + re75, data = lalonde, distance = dist_mat) }