Title: | E-Statistics: Multivariate Inference via the Energy of Data |
---|---|
Description: | E-statistics (energy) tests and statistics for multivariate and univariate inference, including distance correlation, one-sample, two-sample, and multi-sample tests for comparing multivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, k-groups and hierarchical clustering based on energy distance, testing for multivariate normality, distance components (disco) for non-parametric analysis of structured data, and other energy statistics/methods are implemented. |
Authors: | Maria Rizzo [aut, cre], Gabor Szekely [aut] |
Maintainer: | Maria Rizzo <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.7-12 |
Built: | 2024-10-26 05:11:08 UTC |
Source: | https://github.com/mariarizzo/energy |
Description: E-statistics (energy) tests and statistics for multivariate and univariate inference, including distance correlation, one-sample, two-sample, and multi-sample tests for comparing multivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, clustering based on energy distance, testing for multivariate normality, distance components (disco) for non-parametric analysis of structured data, and other energy statistics/methods are implemented.
Maria L. Rizzo and Gabor J. Szekely
G. J. Szekely and M. L. Rizzo (2013). Energy statistics: A class of statistics based on distances, Journal of Statistical Planning and Inference.
M. L. Rizzo and G. J. Szekely (2016). Energy Distance, WIRES Computational Statistics, Wiley, Volume 8 Issue 1, 27-38. Available online Dec., 2015, doi:10.1002/wics.1375.
G. J. Szekely and M. L. Rizzo (2017). The Energy of Data. The Annual Review of Statistics and Its Application 4:447-79.
G. J. Szekely and M. L. Rizzo (2023). The Energy of Data and Distance Correlation. Chapman & Hall/CRC Monographs on Statistics and Applied Probability. ISBN 9781482242744. https://www.routledge.com/The-Energy-of-Data-and-Distance-Correlation/Szekely-Rizzo/p/book/9781482242744.
Stand-alone double centering and U-centering functions that are applied in unbiased distance covariance, bias corrected distance correlation, and partial distance correlation.
Dcenter(x) Ucenter(x) U_center(Dx) D_center(Dx)
Dcenter(x) Ucenter(x) U_center(Dx) D_center(Dx)
x |
dist object or data matrix |
Dx |
distance or dissimilarity matrix |
In Dcenter
and Ucenter
, x
must be
a dist
object or a data matrix. Both functions return
a doubly centered distance matrix.
Note that pdcor
, etc. functions include the
centering operations (in C), so that these stand alone versions
of centering functions are not needed except in case one
wants to compute just a double-centered or U-centered matrix.
U_center
is the Rcpp export of the cpp function.
D_center
is the Rcpp export of the cpp function.
All functions return a square symmetric matrix.
Dcenter
returns a matrix
as in classical multidimensional scaling. Ucenter
returns a matrix
with zero diagonal,
and this is the double centering applied in pdcov
and
pdcor
as well as the unbiased dCov and bias corrected
dCor statistics.
The c++ versions D_center
and U_center
should typically
be faster. R versions are retained for historical reasons.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities, Annals of Statistics, Vol. 42, No. 6, pp. 2382-2412.
x <- iris[1:10, 1:4] dx <- dist(x) Dx <- as.matrix(dx) M <- U_center(Dx) all.equal(M, U_center(M)) #idempotence all.equal(M, D_center(M)) #invariance
x <- iris[1:10, 1:4] dx <- dist(x) Dx <- as.matrix(dx) M <- U_center(Dx) all.equal(M, U_center(M)) #idempotence all.equal(M, D_center(M)) #invariance
Defunct: use dcorT.test
and dcorT
.
dcor.t(x, y, distance = FALSE) dcor.ttest(x, y, distance = FALSE)
dcor.t(x, y, distance = FALSE) dcor.ttest(x, y, distance = FALSE)
x |
data or distances of first sample |
y |
data or distances of second sample |
distance |
TRUE if x and y are distances, otherwise FALSE |
See dcorT
.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Distance correlation t-test of multivariate independence for high dimension.
dcorT.test(x, y) dcorT(x, y)
dcorT.test(x, y) dcorT(x, y)
x |
data or distances of first sample |
y |
data or distances of second sample |
dcorT.test
performs a nonparametric t-test of
multivariate independence in high dimension (dimension is close to
or larger than sample size). As dimension goes to infinity, the
asymptotic distribution of the test statistic is approximately Student t with degrees of freedom and for
the statistic is approximately distributed as standard normal.
The sample sizes (number of rows) of the two samples must agree, and samples must not contain missing values.
The t statistic (dcorT) is a transformation of a bias corrected version of distance correlation (see SR 2013 for details).
Large values (upper tail) of the dcorT statistic are significant.
dcorT
returns the dcor t statistic, and
dcorT.test
returns a list with class htest
containing
method |
description of test |
statistic |
observed value of the test statistic |
parameter |
degrees of freedom |
estimate |
(bias corrected) squared dCor(x,y) |
p.value |
p-value of the t-test |
data.name |
description of data |
dcor.t
and dcor.ttest
are deprecated.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J. and Rizzo, M.L. (2013). The distance correlation t-test of independence in high dimension. Journal of Multivariate Analysis, Volume 117, pp. 193-213.
doi:10.1016/j.jmva.2013.02.012
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
Measuring and Testing Dependence by Correlation of Distances,
Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
Brownian Distance Covariance,
Annals of Applied Statistics,
Vol. 3, No. 4, 1236-1265.
doi:10.1214/09-AOAS312
x <- matrix(rnorm(100), 10, 10) y <- matrix(runif(100), 10, 10) dcorT(x, y) dcorT.test(x, y)
x <- matrix(rnorm(100), 10, 10) y <- matrix(runif(100), 10, 10) dcorT(x, y) dcorT.test(x, y)
Distance covariance test and distance correlation test of multivariate independence. Distance covariance and distance correlation are multivariate measures of dependence.
dcov.test(x, y, index = 1.0, R = NULL) dcor.test(x, y, index = 1.0, R)
dcov.test(x, y, index = 1.0, R = NULL) dcor.test(x, y, index = 1.0, R)
x |
data or distances of first sample |
y |
data or distances of second sample |
R |
number of replicates |
index |
exponent on Euclidean distance, in (0,2] |
dcov.test
and dcor.test
are nonparametric
tests of multivariate independence. The test decision is
obtained via permutation bootstrap, with R
replicates.
The sample sizes (number of rows) of the two samples must agree, and samples must not contain missing values.
The index
is an optional exponent on Euclidean distance.
Valid exponents for energy are in (0, 2) excluding 2.
Argument types supported are numeric data matrix, data.frame, or tibble, with observations in rows; numeric vector; ordered or unordered factors. In case of unordered factors a 0-1 distance matrix is computed.
Optionally pre-computed distances can be input as class "dist" objects or as distance matrices. For data types of arguments, distance matrices are computed internally.
The dcov
test statistic is
where
= dcov(x,y),
which is based on interpoint Euclidean distances
. The
index
is an optional exponent on Euclidean distance.
Similarly, the dcor
test statistic is based on the normalized
coefficient, the distance correlation. (See the manual page for dcor
.)
Distance correlation is a new measure of dependence between random
vectors introduced by Szekely, Rizzo, and Bakirov (2007).
For all distributions with finite first moments, distance
correlation generalizes the idea of correlation in two
fundamental ways:
(1) is defined for
and
in arbitrary dimension.
(2) characterizes independence of
and
.
Characterization (2) also holds for powers of Euclidean distance , where
, but (2) does not hold when
.
Distance correlation satisfies , and
only if
and
are independent. Distance
covariance
provides a new approach to the problem of
testing the joint independence of random vectors. The formal
definitions of the population coefficients
and
are given in (SRB 2007). The definitions of the
empirical coefficients are given in the energy
dcov
topic.
For all values of the index in (0,2), under independence
the asymptotic distribution of
is a quadratic form of centered Gaussian random variables,
with coefficients that depend on the distributions of
and
. For the general problem of testing independence when the distributions of
and
are unknown, the test based on
can be implemented as a permutation test. See (SRB 2007) for
theoretical properties of the test, including statistical consistency.
dcov.test
or dcor.test
returns a list with class htest
containing
method |
description of test |
statistic |
observed value of the test statistic |
estimate |
dCov(x,y) or dCor(x,y) |
estimates |
a vector: [dCov(x,y), dCor(x,y), dVar(x), dVar(y)] |
condition |
logical, permutation test applied |
replicates |
replicates of the test statistic |
p.value |
approximate p-value of the test |
n |
sample size |
data.name |
description of data |
For the dcov test of independence,
the distance covariance test statistic is the V-statistic
(not dCov).
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
Measuring and Testing Dependence by Correlation of Distances,
Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
Brownian Distance Covariance,
Annals of Applied Statistics,
Vol. 3, No. 4, 1236-1265.
doi:10.1214/09-AOAS312
Szekely, G.J. and Rizzo, M.L. (2009), Rejoinder: Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308.
dcov
dcor
pdcov.test
pdcor.test
dcor.ttest
x <- iris[1:50, 1:4] y <- iris[51:100, 1:4] set.seed(1) dcor.test(dist(x), dist(y), R=199) set.seed(1) dcov.test(x, y, R=199)
x <- iris[1:50, 1:4] y <- iris[51:100, 1:4] set.seed(1) dcor.test(dist(x), dist(y), R=199) set.seed(1) dcov.test(x, y, R=199)
For bivariate data only, these are fast O(n log n) implementations of distance correlation and distance covariance statistics. The U-statistic for dcov^2 is unbiased; the V-statistic is the original definition in SRB 2007. These algorithms do not store the distance matrices, so they are suitable for large samples.
dcor2d(x, y, type = c("V", "U")) dcov2d(x, y, type = c("V", "U"), all.stats = FALSE)
dcor2d(x, y, type = c("V", "U")) dcov2d(x, y, type = c("V", "U"), all.stats = FALSE)
x |
numeric vector |
y |
numeric vector |
type |
"V" or "U", for V- or U-statistics |
all.stats |
logical |
The unbiased (squared) dcov is documented in dcovU
, for multivariate data in arbitrary, not necessarily equal dimensions. dcov2d
and dcor2d
provide a faster O(n log n) algorithm for bivariate (x, y) only (X and Y are real-valued random vectors). The O(n log n) algorithm was proposed by Huo and Szekely (2016). The algorithm is faster above a certain sample size n. It does not store the distance matrix so the sample size can be very large.
By default, dcov2d
returns the V-statistic , and if type="U", it returns the U-statistic, unbiased for
. The argument all.stats=TRUE is used internally when the function is called from
dcor2d
.
By default, dcor2d
returns , and if type="U", it returns a bias-corrected estimator of squared dcor equivalent to
bcdcor
.
These functions do not store the distance matrices so they are helpful when sample size is large and the data is bivariate.
The U-statistic can be negative in the lower tail so
the square root of the U-statistic is not applied.
Similarly,
dcor2d(x, y, "U")
is bias-corrected and can be
negative in the lower tail, so we do not take the
square root. The original definitions of dCov and dCor
(SRB2007, SR2009) were based on V-statistics, which are non-negative,
and defined using the square root of V-statistics.
It has been suggested that instead of taking the square root of the U-statistic, one could take the root of before applying the sign, but that introduces more bias than the original dCor, and should never be used.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Huo, X. and Szekely, G.J. (2016). Fast computing for distance covariance. Technometrics, 58(4), 435-447.
Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities. Annals of Statistics, Vol. 42 No. 6, 2382-2412.
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
Measuring and Testing Dependence by Correlation of Distances,
Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505
dcov
dcov.test
dcor
dcor.test
(multivariate statistics and permutation test)
## these are equivalent, but 2d is faster for n > 50 n <- 100 x <- rnorm(100) y <- rnorm(100) all.equal(dcov(x, y)^2, dcov2d(x, y), check.attributes = FALSE) all.equal(bcdcor(x, y), dcor2d(x, y, "U"), check.attributes = FALSE) x <- rlnorm(400) y <- rexp(400) dcov.test(x, y, R=199) #permutation test dcor.test(x, y, R=199)
## these are equivalent, but 2d is faster for n > 50 n <- 100 x <- rnorm(100) y <- rnorm(100) all.equal(dcov(x, y)^2, dcov2d(x, y), check.attributes = FALSE) all.equal(bcdcor(x, y), dcor2d(x, y, "U"), check.attributes = FALSE) x <- rlnorm(400) y <- rexp(400) dcov.test(x, y, R=199) #permutation test dcor.test(x, y, R=199)
This function computes unbiased estimators of squared distance covariance, distance variance, and a bias-corrected estimator of (squared) distance correlation.
dcovU_stats(Dx, Dy)
dcovU_stats(Dx, Dy)
Dx |
distance matrix of first sample |
Dy |
distance matrix of second sample |
The unbiased (squared) dcov is inner product definition of dCov, in the Hilbert space of U-centered distance matrices.
The sample sizes (number of rows) of the two samples must agree, and samples must not contain missing values. The arguments must be square symmetric matrices.
dcovU_stats
returns a vector of the components of bias-corrected
dcor: [dCovU, bcdcor, dVarXU, dVarYU].
Unbiased distance covariance (SR2014) corresponds to the biased
(original) . Since
dcovU
is an
unbiased statistic, it is signed and we do not take the square root.
For the original distance covariance test of independence (SRB2007,
SR2009), the distance covariance test statistic is the V-statistic
(not dCov).
Similarly,
bcdcor
is bias-corrected, so we do not take the
square root as with dCor.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities. Annals of Statistics, Vol. 42 No. 6, 2382-2412.
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
Measuring and Testing Dependence by Correlation of Distances,
Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
Brownian Distance Covariance,
Annals of Applied Statistics,
Vol. 3, No. 4, 1236-1265.
doi:10.1214/09-AOAS312
x <- iris[1:50, 1:4] y <- iris[51:100, 1:4] Dx <- as.matrix(dist(x)) Dy <- as.matrix(dist(y)) dcovU_stats(Dx, Dy)
x <- iris[1:50, 1:4] y <- iris[51:100, 1:4] Dx <- as.matrix(dist(x)) Dy <- as.matrix(dist(y)) dcovU_stats(Dx, Dy)
E-statistics DIStance COmponents and tests, analogous to variance components and anova.
disco(x, factors, distance, index=1.0, R, method=c("disco","discoB","discoF")) disco.between(x, factors, distance, index=1.0, R)
disco(x, factors, distance, index=1.0, R, method=c("disco","discoB","discoF")) disco.between(x, factors, distance, index=1.0, R)
x |
data matrix or distance matrix or dist object |
factors |
matrix or data frame of factor labels or integers (not design matrix) |
distance |
logical, TRUE if x is distance matrix |
index |
exponent on Euclidean distance in (0,2] |
R |
number of replicates for a permutation test |
method |
test statistic |
disco
calculates the distance components decomposition of
total dispersion and if R > 0 tests for significance using the test statistic
disco "F" ratio (default method="disco"
),
or using the between component statistic (method="discoB"
),
each implemented by permutation test.
If x
is a dist
object, argument distance
is
ignored. If x
is a distance matrix, set distance=TRUE
.
In the current release disco
computes the decomposition for one-way models
only.
When method="discoF"
, disco
returns a list similar to the
return value from anova.lm
, and the print.disco
method is
provided to format the output into a similar table. Details:
disco
returns a class disco
object, which is a list containing
call |
call |
method |
method |
statistic |
vector of observed statistics |
p.value |
vector of p-values |
k |
number of factors |
N |
number of observations |
between |
between-sample distance components |
withins |
one-way within-sample distance components |
within |
within-sample distance component |
total |
total dispersion |
Df.trt |
degrees of freedom for treatments |
Df.e |
degrees of freedom for error |
index |
index (exponent on distance) |
factor.names |
factor names |
factor.levels |
factor levels |
sample.sizes |
sample sizes |
stats |
matrix containing decomposition |
When method="discoB"
, disco
passes the arguments to
disco.between
, which returns a class htest
object.
disco.between
returns a class htest
object, where the test
statistic is the between-sample statistic (proportional to the numerator of the F ratio
of the disco
test.
The current version does all calculations via matrix arithmetic and boot function. Support for more general additive models and a formula interface is under development.
disco
methods have been added to the cluster distance summary
function edist
, and energy tests for equality of distribution
(see eqdist.etest
).
Maria L. Rizzo [email protected] and Gabor J. Szekely
M. L. Rizzo and G. J. Szekely (2010).
DISCO Analysis: A Nonparametric Extension of
Analysis of Variance, Annals of Applied Statistics,
Vol. 4, No. 2, 1034-1055.
doi:10.1214/09-AOAS245
edist
eqdist.e
eqdist.etest
ksample.e
## warpbreaks one-way decompositions data(warpbreaks) attach(warpbreaks) disco(breaks, factors=wool, R=99) ## warpbreaks two-way wool+tension disco(breaks, factors=data.frame(wool, tension), R=0) ## warpbreaks two-way wool*tension disco(breaks, factors=data.frame(wool, tension, wool:tension), R=0) ## When index=2 for univariate data, we get ANOVA decomposition disco(breaks, factors=tension, index=2.0, R=99) aov(breaks ~ tension) ## Multivariate response ## Example on producing plastic film from Krzanowski (1998, p. 381) tear <- c(6.5, 6.2, 5.8, 6.5, 6.5, 6.9, 7.2, 6.9, 6.1, 6.3, 6.7, 6.6, 7.2, 7.1, 6.8, 7.1, 7.0, 7.2, 7.5, 7.6) gloss <- c(9.5, 9.9, 9.6, 9.6, 9.2, 9.1, 10.0, 9.9, 9.5, 9.4, 9.1, 9.3, 8.3, 8.4, 8.5, 9.2, 8.8, 9.7, 10.1, 9.2) opacity <- c(4.4, 6.4, 3.0, 4.1, 0.8, 5.7, 2.0, 3.9, 1.9, 5.7, 2.8, 4.1, 3.8, 1.6, 3.4, 8.4, 5.2, 6.9, 2.7, 1.9) Y <- cbind(tear, gloss, opacity) rate <- factor(gl(2,10), labels=c("Low", "High")) ## test for equal distributions by rate disco(Y, factors=rate, R=99) disco(Y, factors=rate, R=99, method="discoB") ## Just extract the decomposition table disco(Y, factors=rate, R=0)$stats ## Compare eqdist.e methods for rate ## disco between stat is half of original when sample sizes equal eqdist.e(Y, sizes=c(10, 10), method="original") eqdist.e(Y, sizes=c(10, 10), method="discoB") ## The between-sample distance component disco.between(Y, factors=rate, R=0)
## warpbreaks one-way decompositions data(warpbreaks) attach(warpbreaks) disco(breaks, factors=wool, R=99) ## warpbreaks two-way wool+tension disco(breaks, factors=data.frame(wool, tension), R=0) ## warpbreaks two-way wool*tension disco(breaks, factors=data.frame(wool, tension, wool:tension), R=0) ## When index=2 for univariate data, we get ANOVA decomposition disco(breaks, factors=tension, index=2.0, R=99) aov(breaks ~ tension) ## Multivariate response ## Example on producing plastic film from Krzanowski (1998, p. 381) tear <- c(6.5, 6.2, 5.8, 6.5, 6.5, 6.9, 7.2, 6.9, 6.1, 6.3, 6.7, 6.6, 7.2, 7.1, 6.8, 7.1, 7.0, 7.2, 7.5, 7.6) gloss <- c(9.5, 9.9, 9.6, 9.6, 9.2, 9.1, 10.0, 9.9, 9.5, 9.4, 9.1, 9.3, 8.3, 8.4, 8.5, 9.2, 8.8, 9.7, 10.1, 9.2) opacity <- c(4.4, 6.4, 3.0, 4.1, 0.8, 5.7, 2.0, 3.9, 1.9, 5.7, 2.8, 4.1, 3.8, 1.6, 3.4, 8.4, 5.2, 6.9, 2.7, 1.9) Y <- cbind(tear, gloss, opacity) rate <- factor(gl(2,10), labels=c("Low", "High")) ## test for equal distributions by rate disco(Y, factors=rate, R=99) disco(Y, factors=rate, R=99, method="discoB") ## Just extract the decomposition table disco(Y, factors=rate, R=0)$stats ## Compare eqdist.e methods for rate ## disco between stat is half of original when sample sizes equal eqdist.e(Y, sizes=c(10, 10), method="original") eqdist.e(Y, sizes=c(10, 10), method="discoB") ## The between-sample distance component disco.between(Y, factors=rate, R=0)
Computes distance covariance and distance correlation statistics, which are multivariate measures of dependence.
dcov(x, y, index = 1.0) dcor(x, y, index = 1.0)
dcov(x, y, index = 1.0) dcor(x, y, index = 1.0)
x |
data or distances of first sample |
y |
data or distances of second sample |
index |
exponent on Euclidean distance, in (0,2] |
dcov
and dcor
compute distance
covariance and distance correlation statistics.
The sample sizes (number of rows) of the two samples must agree, and samples must not contain missing values.
The index
is an optional exponent on Euclidean distance.
Valid exponents for energy are in (0, 2) excluding 2.
Argument types supported are numeric data matrix, data.frame, or tibble, with observations in rows; numeric vector; ordered or unordered factors. In case of unordered factors a 0-1 distance matrix is computed.
Optionally pre-computed distances can be input as class "dist" objects or as distance matrices. For data types of arguments, distance matrices are computed internally.
Distance correlation is a new measure of dependence between random
vectors introduced by Szekely, Rizzo, and Bakirov (2007).
For all distributions with finite first moments, distance
correlation generalizes the idea of correlation in two
fundamental ways:
(1)
is defined for
and
in arbitrary dimension.
(2)
characterizes independence of
and
.
Distance correlation satisfies , and
only if
and
are independent. Distance
covariance
provides a new approach to the problem of
testing the joint independence of random vectors. The formal
definitions of the population coefficients
and
are given in (SRB 2007). The definitions of the
empirical coefficients are as follows.
The empirical distance covariance
with index 1 is
the nonnegative number defined by
where and
are
Here
and the subscript .
denotes that the mean is computed for the
index that it replaces. Similarly,
is the nonnegative number defined by
The empirical distance correlation is
the square root of
See dcov.test
for a test of multivariate independence
based on the distance covariance statistic.
dcov
returns the sample distance covariance and
dcor
returns the sample distance correlation.
Note that it is inefficient to compute dCor by:
square root of
dcov(x,y)/sqrt(dcov(x,x)*dcov(y,y))
because the individual
calls to dcov
involve unnecessary repetition of calculations.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
Measuring and Testing Dependence by Correlation of Distances,
Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
Brownian Distance Covariance,
Annals of Applied Statistics,
Vol. 3, No. 4, 1236-1265.
doi:10.1214/09-AOAS312
Szekely, G.J. and Rizzo, M.L. (2009), Rejoinder: Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308.
dcov2d
dcor2d
bcdcor
dcovU
pdcor
dcov.test
dcor.test
pdcor.test
x <- iris[1:50, 1:4] y <- iris[51:100, 1:4] dcov(x, y) dcov(dist(x), dist(y)) #same thing
x <- iris[1:50, 1:4] y <- iris[51:100, 1:4] dcov(x, y) dcov(dist(x), dist(y)) #same thing
Utilities for working with distance matrices.
is.dmatrix
is a utility that checks whether the argument is a distance or dissimilarity matrix; is it square symmetric, non-negative, with zero diagonal? calc_dist
computes a distance matrix directly from a data matrix.
is.dmatrix(x, tol = 100 * .Machine$double.eps) calc_dist(x)
is.dmatrix(x, tol = 100 * .Machine$double.eps) calc_dist(x)
x |
numeric matrix |
tol |
tolerance for checking required conditions |
Energy functions work with the distance matrices of samples. The is.dmatrix
function is used internally when converting arguments to distance matrices. The default tol
is the same as default tolerance of isSymmetric
.
calc_dist
is an exported Rcpp function that returns a Euclidean distance matrix from the input data matrix.
is.dmatrix
returns TRUE if (within tolerance) x
is a distance/dissimilarity matrix; otherwise FALSE. It will return FALSE if x
is a class dist
object.
calc_dist
returns the Euclidean distance matrix for the data matrix x
, which has observations in rows.
In practice, if dist(x)
is not yet computed, calc_dist(x)
will be faster than as.matrix(dist(x))
.
On working with non-Euclidean dissimilarities, see the references.
Maria L. Rizzo [email protected]
Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities. Annals of Statistics, Vol. 42 No. 6, 2382-2412.
x <- matrix(rnorm(20), 10, 2) D <- calc_dist(x) is.dmatrix(D) is.dmatrix(cov(x))
x <- matrix(rnorm(20), 10, 2) D <- calc_dist(x) is.dmatrix(D) is.dmatrix(cov(x))
Returns the E-distances (energy statistics) between clusters.
edist(x, sizes, distance = FALSE, ix = 1:sum(sizes), alpha = 1, method = c("cluster","discoB"))
edist(x, sizes, distance = FALSE, ix = 1:sum(sizes), alpha = 1, method = c("cluster","discoB"))
x |
data matrix of pooled sample or Euclidean distances |
sizes |
vector of sample sizes |
distance |
logical: if TRUE, x is a distance matrix |
ix |
a permutation of the row indices of x |
alpha |
distance exponent in (0,2] |
method |
how to weight the statistics |
A vector containing the pairwise two-sample multivariate
-statistics for comparing clusters or samples is returned.
The e-distance between clusters is computed from the original pooled data,
stacked in matrix
x
where each row is a multivariate observation, or
from the distance matrix x
of the original data, or distance object
returned by dist
. The first sizes[1]
rows of the original data
matrix are the first sample, the next sizes[2]
rows are the second
sample, etc. The permutation vector ix
may be used to obtain
e-distances corresponding to a clustering solution at a given level in
the hierarchy.
The default method cluster
summarizes the e-distances between
clusters in a table.
The e-distance between two clusters
of size
proposed by Szekely and Rizzo (2005)
is the e-distance
, defined by
where
denotes Euclidean norm,
alpha
, and denotes the p-th observation in the i-th cluster. The
exponent
alpha
should be in the interval (0,2].
The coefficient
is one-half of the harmonic mean of the sample sizes. The
discoB
method is related but with
different ways of summarizing the pairwise differences between samples.
The disco
methods apply the coefficient
where N is the total number
of observations. This weights each (i,j) statistic by sample size
relative to N. See the
disco
topic for more details.
A object of class dist
containing the lower triangle of the
e-distance matrix of cluster distances corresponding to the permutation
of indices ix
is returned. The method
attribute of the
distance object is assigned a value of type, index.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering
via Joint Between-Within Distances: Extending Ward's Minimum
Variance Method, Journal of Classification 22(2) 151-183.
doi:10.1007/s00357-005-0012-9
M. L. Rizzo and G. J. Szekely (2010).
DISCO Analysis: A Nonparametric Extension of
Analysis of Variance, Annals of Applied Statistics,
Vol. 4, No. 2, 1034-1055.
doi:10.1214/09-AOAS245
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).
Szekely, G. J. (2000) Technical Report 03-05,
-statistics: Energy of
Statistical Samples, Department of Mathematics and Statistics,
Bowling Green State University.
energy.hclust
eqdist.etest
ksample.e
disco
## compute cluster e-distances for 3 samples of iris data data(iris) edist(iris[,1:4], c(50,50,50)) ## pairwise disco statistics edist(iris[,1:4], c(50,50,50), method="discoB") ## compute e-distances from a distance object data(iris) edist(dist(iris[,1:4]), c(50, 50, 50), distance=TRUE, alpha = 1) ## compute e-distances from a distance matrix data(iris) d <- as.matrix(dist(iris[,1:4])) edist(d, c(50, 50, 50), distance=TRUE, alpha = 1)
## compute cluster e-distances for 3 samples of iris data data(iris) edist(iris[,1:4], c(50,50,50)) ## pairwise disco statistics edist(iris[,1:4], c(50,50,50), method="discoB") ## compute e-distances from a distance object data(iris) edist(dist(iris[,1:4]), c(50, 50, 50), distance=TRUE, alpha = 1) ## compute e-distances from a distance matrix data(iris) d <- as.matrix(dist(iris[,1:4])) edist(d, c(50, 50, 50), distance=TRUE, alpha = 1)
These deprecated functions have been replaced by revised functions and will be removed in future releases of the energy package.
DCOR(x, y, index=1.0)
DCOR(x, y, index=1.0)
x |
data or distances of first sample |
y |
data or distances of second sample |
index |
exponent on Euclidean distance in (0, 2) |
DCOR is an R version replaced by faster compiled code.
Performs hierarchical clustering by minimum (energy) E-distance method.
energy.hclust(dst, alpha = 1)
energy.hclust(dst, alpha = 1)
dst |
|
alpha |
distance exponent |
Dissimilarities are ,
where the exponent
is in the interval (0,2].
This function performs agglomerative hierarchical clustering.
Initially, each of the n singletons is a cluster. At each of n-1 steps, the
procedure merges the pair of clusters with minimum e-distance.
The e-distance between two clusters
of sizes
is given by
where
denotes Euclidean norm, and
denotes the p-th observation in the i-th cluster.
The return value is an object of class hclust
, so hclust
methods such as print or plot methods, plclust
, and cutree
are available. See the documentation for hclust
.
The e-distance measures both the heterogeneity between clusters and the
homogeneity within clusters. -clustering
(
) is particularly effective in
high dimension, and is more effective than some standard hierarchical
methods when clusters have equal means (see example below).
For other advantages see the references.
edist
computes the energy distances for the result (or any partition)
and returns the cluster distances in a dist
object. See the edist
examples.
An object of class hclust
which describes the tree produced by
the clustering process. The object is a list with components:
merge: |
an n-1 by 2 matrix, where row i of |
height: |
the clustering height: a vector of n-1 non-decreasing real numbers (the e-distance between merging clusters) |
order: |
a vector giving a permutation of the indices of
original observations suitable for plotting, in the sense that a
cluster plot using this ordering and matrix |
labels: |
labels for each of the objects being clustered. |
call: |
the call which produced the result. |
method: |
the cluster method that has been used (e-distance). |
dist.method: |
the distance that has been used to create |
Currently stats::hclust
implements Ward's method by method="ward.D2"
,
which applies the squared distances. That method was previously "ward"
.
Because both hclust
and energy use the same type of Lance-Williams recursive formula to update cluster distances, now with the additional option method="ward.D"
in hclust
, the
energy distance method is easily implemented by hclust
. (Some "Ward" algorithms do not use Lance-Williams, however). Energy clustering (with alpha=1
) and "ward.D" now return the same result, except that the cluster heights of energy hierarchical clustering with alpha=1
are two times the heights from hclust
. In order to ensure compatibility with hclust methods, energy.hclust
now passes arguments through to hclust
after possibly applying the optional exponent to distance.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering
via Joint Between-Within Distances: Extending Ward's Minimum
Variance Method, Journal of Classification 22(2) 151-183.
doi:10.1007/s00357-005-0012-9
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).
Szekely, G. J. (2000) Technical Report 03-05:
-statistics: Energy of
Statistical Samples, Department of Mathematics and Statistics, Bowling
Green State University.
edist
ksample.e
eqdist.etest
hclust
## Not run: library(cluster) data(animals) plot(energy.hclust(dist(animals))) data(USArrests) ecl <- energy.hclust(dist(USArrests)) print(ecl) plot(ecl) cutree(ecl, k=3) cutree(ecl, h=150) ## compare performance of e-clustering, Ward's method, group average method ## when sampled populations have equal means: n=200, d=5, two groups z <- rbind(matrix(rnorm(1000), nrow=200), matrix(rnorm(1000, 0, 5), nrow=200)) g <- c(rep(1, 200), rep(2, 200)) d <- dist(z) e <- energy.hclust(d) a <- hclust(d, method="average") w <- hclust(d^2, method="ward.D2") list("E" = table(cutree(e, k=2) == g), "Ward" = table(cutree(w, k=2) == g), "Avg" = table(cutree(a, k=2) == g)) ## End(Not run)
## Not run: library(cluster) data(animals) plot(energy.hclust(dist(animals))) data(USArrests) ecl <- energy.hclust(dist(USArrests)) print(ecl) plot(ecl) cutree(ecl, k=3) cutree(ecl, h=150) ## compare performance of e-clustering, Ward's method, group average method ## when sampled populations have equal means: n=200, d=5, two groups z <- rbind(matrix(rnorm(1000), nrow=200), matrix(rnorm(1000, 0, 5), nrow=200)) g <- c(rep(1, 200), rep(2, 200)) d <- dist(z) e <- energy.hclust(d) a <- hclust(d, method="average") w <- hclust(d^2, method="ward.D2") list("E" = table(cutree(e, k=2) == g), "Ward" = table(cutree(w, k=2) == g), "Avg" = table(cutree(a, k=2) == g)) ## End(Not run)
Performs the nonparametric multisample E-statistic (energy) test for equality of multivariate distributions.
eqdist.etest(x, sizes, distance = FALSE, method=c("original","discoB","discoF"), R) eqdist.e(x, sizes, distance = FALSE, method=c("original","discoB","discoF")) ksample.e(x, sizes, distance = FALSE, method=c("original","discoB","discoF"), ix = 1:sum(sizes))
eqdist.etest(x, sizes, distance = FALSE, method=c("original","discoB","discoF"), R) eqdist.e(x, sizes, distance = FALSE, method=c("original","discoB","discoF")) ksample.e(x, sizes, distance = FALSE, method=c("original","discoB","discoF"), ix = 1:sum(sizes))
x |
data matrix of pooled sample |
sizes |
vector of sample sizes |
distance |
logical: if TRUE, first argument is a distance matrix |
method |
use original (default) or distance components (discoB, discoF) |
R |
number of bootstrap replicates |
ix |
a permutation of the row indices of x |
The k-sample multivariate -test of equal distributions
is performed. The statistic is computed from the original
pooled samples, stacked in matrix
x
where each row
is a multivariate observation, or the corresponding distance matrix. The
first sizes[1]
rows of x
are the first sample, the next
sizes[2]
rows of x
are the second sample, etc.
The test is implemented by nonparametric bootstrap, an approximate
permutation test with R
replicates.
The function eqdist.e
returns the test statistic only; it simply
passes the arguments through to eqdist.etest
with R = 0
.
The k-sample multivariate -statistic for testing equal distributions
is returned. The statistic is computed from the original pooled samples, stacked in
matrix
x
where each row is a multivariate observation, or from the distance
matrix x
of the original data. The
first sizes[1]
rows of x
are the first sample, the next
sizes[2]
rows of x
are the second sample, etc.
The two-sample -statistic proposed by
Szekely and Rizzo (2004)
is the e-distance
, defined for two samples
of size
by
where
denotes Euclidean norm, and
denotes the p-th observation in the i-th sample.
The original (default method) k-sample
-statistic is defined by summing the pairwise e-distances over
all
pairs
of samples:
Large values of are significant.
The discoB
method computes the between-sample disco statistic.
For a one-way analysis, it is related to the original statistic as follows.
In the above equation, the weights
are replaced with
where N is the total number of observations: .
The discoF
method is based on the disco F ratio, while the discoB
method is based on the between sample component.
Also see disco
and disco.between
functions.
A list with class htest
containing
method |
description of test |
statistic |
observed value of the test statistic |
p.value |
approximate p-value of the test |
data.name |
description of data |
eqdist.e
returns test statistic only.
The pairwise e-distances between samples can be conveniently
computed by the edist
function, which returns a dist
object.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).
M. L. Rizzo and G. J. Szekely (2010).
DISCO Analysis: A Nonparametric Extension of
Analysis of Variance, Annals of Applied Statistics,
Vol. 4, No. 2, 1034-1055.
doi:10.1214/09-AOAS245
Szekely, G. J. (2000) Technical Report 03-05:
-statistics: Energy of
Statistical Samples, Department of Mathematics and Statistics, Bowling
Green State University.
ksample.e
,
edist
,
disco
,
disco.between
,
energy.hclust
.
data(iris) ## test if the 3 varieties of iris data (d=4) have equal distributions eqdist.etest(iris[,1:4], c(50,50,50), R = 199) ## example that uses method="disco" x <- matrix(rnorm(100), nrow=20) y <- matrix(rnorm(100), nrow=20) X <- rbind(x, y) d <- dist(X) # should match edist default statistic set.seed(1234) eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199) # comparison with edist edist(d, sizes=c(20, 10), distance=TRUE) # for comparison g <- as.factor(rep(1:2, c(20, 20))) set.seed(1234) disco(d, factors=g, distance=TRUE, R=199) # should match statistic in edist method="discoB", above set.seed(1234) disco.between(d, factors=g, distance=TRUE, R=199)
data(iris) ## test if the 3 varieties of iris data (d=4) have equal distributions eqdist.etest(iris[,1:4], c(50,50,50), R = 199) ## example that uses method="disco" x <- matrix(rnorm(100), nrow=20) y <- matrix(rnorm(100), nrow=20) X <- rbind(x, y) d <- dist(X) # should match edist default statistic set.seed(1234) eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199) # comparison with edist edist(d, sizes=c(20, 10), distance=TRUE) # for comparison g <- as.factor(rep(1:2, c(20, 20))) set.seed(1234) disco(d, factors=g, distance=TRUE, R=199) # should match statistic in edist method="discoB", above set.seed(1234) disco.between(d, factors=g, distance=TRUE, R=199)
Pre-computed eigenvalues corresponding to the asymptotic sampling distribution of the energy test statistic for univariate normality, under the null hypothesis. Four Cases are computed:
Simple hypothesis, known parameters.
Estimated mean, known variance.
Known mean, estimated variance.
Composite hypothesis, estimated parameters.
Case 4 eigenvalues are used in the test function normal.test
when method=="limit"
.
data(EVnormal)
data(EVnormal)
Numeric matrix with 125 rows and 5 columns; column 1 is the index, and columns 2-5 are the eigenvalues of Cases 1-4.
Computed
Szekely, G. J. and Rizzo, M. L. (2005) A New Test for Multivariate Normality, Journal of Multivariate Analysis, 93/1, 58-80, doi:10.1016/j.jmva.2003.12.002.
Computes a multivariate nonparametric test of independence.
The default method implements the distance covariance test
dcov.test
.
indep.test(x, y, method = c("dcov","mvI"), index = 1, R)
indep.test(x, y, method = c("dcov","mvI"), index = 1, R)
x |
matrix: first sample, observations in rows |
y |
matrix: second sample, observations in rows |
method |
a character string giving the name of the test |
index |
exponent on Euclidean distances |
R |
number of replicates |
indep.test
with the default method = "dcov"
computes
the distance
covariance test of independence. index
is an exponent on
the Euclidean distances. Valid choices for index
are in (0,2],
with default value 1 (Euclidean distance). The arguments are passed
to the dcov.test
function. See the help topic dcov.test
for
the description and documentation and also see the references below.
indep.test
with method = "mvI"
computes the coefficient and performs a nonparametric
-test of independence. The arguments are passed to
mvI.test
. The
index
argument is ignored (index = 1
is applied).
See the help topic mvI.test
and also
see the reference (2006) below for details.
The test decision is obtained via
bootstrap, with R
replicates.
The sample sizes (number of rows) of the two samples must agree, and
samples must not contain missing values.
These energy tests of independence are based on related theoretical
results, but different test statistics.
The dcov
method is faster than mvI
method by
approximately a factor of O(n).
indep.test
returns a list with class
htest
containing
method |
description of test |
statistic |
observed value of the
test statistic |
estimate |
|
estimates |
a vector [dCov(x,y), dCor(x,y), dVar(x), dVar(y)] (method dcov) |
replicates |
replicates of the test statistic |
p.value |
approximate p-value of the test |
data.name |
description of data |
As of energy-1.1-0,
indep.etest
is deprecated and replaced by indep.test
, which
has methods for two different energy tests of independence. indep.test
applies
the distance covariance test (see dcov.test
) by default (method = "dcov"
).
The original indep.etest
applied the independence coefficient
, which is now obtained by
method = "mvI"
.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J. and Rizzo, M.L. (2009),
Brownian Distance Covariance,
Annals of Applied Statistics, Vol. 3 No. 4, pp.
1236-1265. (Also see discussion and rejoinder.)
doi:10.1214/09-AOAS312
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
Measuring and Testing Dependence by Correlation of Distances,
Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505
Bakirov, N.K., Rizzo, M.L., and Szekely, G.J. (2006), A Multivariate
Nonparametric Test of Independence, Journal of Multivariate Analysis
93/1, 58-80,
doi:10.1016/j.jmva.2005.10.005
## independent multivariate data x <- matrix(rnorm(60), nrow=20, ncol=3) y <- matrix(rnorm(40), nrow=20, ncol=2) indep.test(x, y, method = "dcov", R = 99) indep.test(x, y, method = "mvI", R = 99) ## dependent multivariate data if (require(MASS)) { Sigma <- matrix(c(1, .1, 0, 0 , 1, 0, 0 ,.1, 1), 3, 3) x <- mvrnorm(30, c(0, 0, 0), diag(3)) y <- mvrnorm(30, c(0, 0, 0), Sigma) * x indep.test(x, y, R = 99) #dcov method indep.test(x, y, method = "mvI", R = 99) }
## independent multivariate data x <- matrix(rnorm(60), nrow=20, ncol=3) y <- matrix(rnorm(40), nrow=20, ncol=2) indep.test(x, y, method = "dcov", R = 99) indep.test(x, y, method = "mvI", R = 99) ## dependent multivariate data if (require(MASS)) { Sigma <- matrix(c(1, .1, 0, 0 , 1, 0, 0 ,.1, 1), 3, 3) x <- mvrnorm(30, c(0, 0, 0), diag(3)) y <- mvrnorm(30, c(0, 0, 0), Sigma) * x indep.test(x, y, R = 99) #dcov method indep.test(x, y, method = "mvI", R = 99) }
Perform k-groups clustering by energy distance.
kgroups(x, k, iter.max = 10, nstart = 1, cluster = NULL)
kgroups(x, k, iter.max = 10, nstart = 1, cluster = NULL)
x |
Data frame or data matrix or distance object |
k |
number of clusters |
iter.max |
maximum number of iterations |
nstart |
number of restarts |
cluster |
initial clustering vector |
K-groups is based on the multisample energy distance for comparing distributions. Based on the disco decomposition of total dispersion (a Gini type mean distance) the objective function should either maximize the total between cluster energy distance, or equivalently, minimize the total within cluster energy distance. It is more computationally efficient to minimize within distances, and that makes it possible to use a modified version of the Hartigan-Wong algorithm (1979) to implement K-groups clustering.
The within cluster Gini mean distance is
and the K-groups within cluster distance is
If z is the data matrix for cluster , then
could be computed as
sum(dist(z)) / nrow(z)
.
If cluster is not NULL, the clusters are initialized by this vector (can be a factor or integer vector). Otherwise clusters are initialized with random labels in k approximately equal size clusters.
If x
is not a distance object (class(x) == "dist") then x
is converted to a data matrix for analysis.
Run up to iter.max
complete passes through the data set until a local min is reached. If nstart > 1
, on second and later starts, clusters are initialized at random, and the best result is returned.
An object of class kgroups
containing the components
call |
the function call |
cluster |
vector of cluster indices |
sizes |
cluster sizes |
within |
vector of Gini within cluster distances |
W |
sum of within cluster distances |
count |
number of moves |
iterations |
number of iterations |
k |
number of clusters |
cluster
is a vector containing the group labels, 1 to k. print.kgroups
prints some of the components of the kgroups object.
Expect that count is 0 if the algorithm converged to a local min (that is, 0 moves happened on the last iteration). If iterations equals iter.max and count is positive, then the algorithm did not converge to a local min.
Maria Rizzo and Songzi Li
Li, Songzi (2015). "K-groups: A Generalization of K-means by Energy Distance." Ph.D. thesis, Bowling Green State University.
Li, S. and Rizzo, M. L. (2017). "K-groups: A Generalization of K-means Clustering". ArXiv e-print 1711.04359. https://arxiv.org/abs/1711.04359
Szekely, G. J., and M. L. Rizzo. "Testing for equal distributions in high dimension." InterStat 5, no. 16.10 (2004).
Rizzo, M. L., and G. J. Szekely. "Disco analysis: A nonparametric extension of analysis of variance." The Annals of Applied Statistics (2010): 1034-1055.
Hartigan, J. A. and Wong, M. A. (1979). "Algorithm AS 136: A K-means clustering algorithm." Applied Statistics, 28, 100-108. doi: 10.2307/2346830.
x <- as.matrix(iris[ ,1:4]) set.seed(123) kg <- kgroups(x, k = 3, iter.max = 5, nstart = 2) kg fitted(kg) d <- dist(x) set.seed(123) kg <- kgroups(d, k = 3, iter.max = 5, nstart = 2) kg kg$cluster fitted(kg) fitted(kg, method = "groups")
x <- as.matrix(iris[ ,1:4]) set.seed(123) kg <- kgroups(x, k = 3, iter.max = 5, nstart = 2) kg fitted(kg) d <- dist(x) set.seed(123) kg <- kgroups(d, k = 3, iter.max = 5, nstart = 2) kg kg$cluster fitted(kg) fitted(kg, method = "groups")
The test statistic is the sum of d-1 bias-corrected squared dcor statistics where the number of variables is d. Implementation is by permuation test.
mutualIndep.test(x, R)
mutualIndep.test(x, R)
x |
data matrix or data frame |
R |
number of permutation replicates |
A population coefficient for mutual independence of d random variables, , is
which is non-negative and equals zero iff mutual independence holds. For example, if d=4 the population coefficient is
A permutation test is implemented based on the corresponding sample coefficient. To test mutual independence of
the test statistic is the sum of the d-1
statistics (bias-corrected statistics):
.
mutualIndep.test
returns an object of class power.htest
.
See Szekely and Rizzo (2014) for details on unbiased and bias-corrected
(
bcdcor
) statistics.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
Measuring and Testing Dependence by Correlation of Distances,
Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities. Annals of Statistics, Vol. 42 No. 6, 2382-2412.
x <- matrix(rnorm(100), nrow=20, ncol=5) mutualIndep.test(x, 199)
x <- matrix(rnorm(100), nrow=20, ncol=5) mutualIndep.test(x, 199)
Computes a type of multivariate nonparametric E-statistic and test of independence
based on independence coefficient . This coefficient pre-dates and is different from distance covariance or distance correlation.
mvI.test(x, y, R) mvI(x, y)
mvI.test(x, y, R) mvI(x, y)
x |
matrix: first sample, observations in rows |
y |
matrix: second sample, observations in rows |
R |
number of replicates |
mvI
computes the coefficient and
mvI.test
performs a nonparametric test of independence. The test decision is obtained via permutation
bootstrap, with R
replicates.
The sample sizes (number of rows) of the two samples must agree, and
samples must not contain missing values.
Historically this is the first energy test of independence. The
distance covariance test dcov.test
, distance correlation dcor
, and related methods are more recent (2007, 2009).
The distance covariance test dcov.test
and distance correlation test dcor.test
are much faster and have different properties than mvI.test
. All are based on a population independence coefficient that characterizes independence and all of these tests are statistically consistent. However, dCor is scale invariant while is not. In applications
dcor.test
or dcov.test
are the recommended tests.
Computing formula from Bakirov, Rizzo, and Szekely (2006), equation (2):
Suppose the two samples are and
. Define
The independence coefficient is defined
where
Some properties:
(Theorem 1).
Large values of (or
) support the alternative hypothesis that the sampled random variables are dependent.
is invariant to shifts and orthogonal transformations of X and Y.
determines a statistically consistent test of independence against all fixed dependent alternatives (Corollary 1).
The population independence coefficient is a normalized distance between the joint characteristic function and the product of the marginal characteristic functions.
converges almost surely to
as
. X and Y are independent if and only if
.
See the 2006 reference below for more details.
mvI
returns the statistic. mvI.test
returns
a list with class
htest
containing
method |
description of test |
statistic |
observed value of the test statistic |
estimate |
|
replicates |
permutation replicates |
p.value |
p-value of the test |
data.name |
description of data |
On scale invariance: Distance correlation (dcor
) has the property that if we change the scale of X from e.g., meters to kilometers, and the scale of Y from e.g. grams to ounces, the statistic and the test are not changed. does not have this property; it is invariant only under a common rescaling of X and Y by the same constant. Thus, if the units of measurement change for either or both variables, dCor is invariant, but
and possibly the
mvI.test
decision changes.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Bakirov, N.K., Rizzo, M.L., and Szekely, G.J. (2006), A Multivariate Nonparametric Test of Independence, Journal of Multivariate Analysis 93/1, 58-80.
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007), Measuring and Testing Dependence by Correlation of Distances, Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
Szekely, G.J. and Rizzo, M.L. (2009), Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1236-1265.
dcov.test
dcov
dcor.test
dcor
dcov2d
dcor2d
mvI(iris[1:25, 1], iris[1:25, 2]) mvI.test(iris[1:25, 1], iris[1:25, 2], R=99)
mvI(iris[1:25, 1], iris[1:25, 2]) mvI.test(iris[1:25, 1], iris[1:25, 2], R=99)
Performs the E-statistic (energy) test of multivariate or univariate normality.
mvnorm.test(x, R) mvnorm.etest(x, R) mvnorm.e(x)
mvnorm.test(x, R) mvnorm.etest(x, R) mvnorm.e(x)
x |
data matrix of multivariate sample, or univariate data vector |
R |
number of bootstrap replicates |
If x
is a matrix, each row is a multivariate observation. The
data will be standardized to zero mean and identity covariance matrix
using the sample mean vector and sample covariance matrix. If x
is a vector, mvnorm.e
returns the univariate statistic
normal.e(x)
.
If the data contains missing values or the sample covariance matrix is
singular, mvnorm.e
returns NA.
The -test of multivariate normality was proposed
and implemented by Szekely and Rizzo (2005). The test statistic for
d-variate normality is given by
where is the standardized sample,
are iid standard d-variate normal, and
denotes Euclidean norm.
The -test of multivariate (univariate) normality
is implemented by parametric bootstrap with
R
replicates.
The value of the -statistic for multivariate
normality is returned by
mvnorm.e
.
mvnorm.test
returns a list with class htest
containing
method |
description of test |
statistic |
observed value of the test statistic |
p.value |
approximate p-value of the test |
data.name |
description of data |
mvnorm.etest
is replaced by mvnorm.test
.
If the data is univariate, the test statistic is formally
the same as the multivariate case, but a more efficient computational
formula is applied in normal.e
.
normal.test
also provides an optional method for the
test based on the asymptotic sampling distribution of the test
statistic.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G. J. and Rizzo, M. L. (2005) A New Test for Multivariate Normality, Journal of Multivariate Analysis, 93/1, 58-80, doi:10.1016/j.jmva.2003.12.002.
Mori, T. F., Szekely, G. J. and Rizzo, M. L. "On energy tests of normality." Journal of Statistical Planning and Inference 213 (2021): 1-15.
Rizzo, M. L. (2002). A New Rotation Invariant Goodness-of-Fit Test, Ph.D. dissertation, Bowling Green State University.
Szekely, G. J. (1989) Potential and Kinetic Energy in Statistics, Lecture Notes, Budapest Institute of Technology (Technical University).
normal.test
for the energy test of univariate
normality and normal.e
for the statistic.
## compute normality test statistic for iris Setosa data data(iris) mvnorm.e(iris[1:50, 1:4]) ## test if the iris Setosa data has multivariate normal distribution mvnorm.test(iris[1:50,1:4], R = 199)
## compute normality test statistic for iris Setosa data data(iris) mvnorm.e(iris[1:50, 1:4]) ## test if the iris Setosa data has multivariate normal distribution mvnorm.test(iris[1:50,1:4], R = 199)
Performs the energy test of univariate normality for the composite hypothesis Case 4, estimated parameters.
normal.test(x, method=c("mc","limit"), R) normal.e(x)
normal.test(x, method=c("mc","limit"), R) normal.e(x)
x |
univariate data vector |
method |
method for p-value |
R |
number of replications if Monte Carlo method |
If method="mc"
this test function applies the parametric
bootstrap method implemented in mvnorm.test
.
If method="limit"
, the p-value of the test is computed from
the asymptotic distribution of the test statistic under the null
hypothesis. The asymptotic
distribution is a quadratic form of centered Gaussian random variables,
which has the form
where are positive constants (eigenvalues) and
are iid standard normal variables. Eigenvalues are
pre-computed and stored internally.
A p-value is computed using Imhof's method as implemented in the
CompQuadForm package.
Note that the "limit" method is intended for moderately large samples because it applies the asymptotic distribution.
The energy test of normality was proposed
and implemented by Szekely and Rizzo (2005).
See mvnorm.test
for more details.
normal.e
returns the energy goodness-of-fit statistic for
a univariate sample.
normal.test
returns a list with class htest
containing
statistic |
observed value of the test statistic |
p.value |
p-value of the test |
estimate |
sample estimates: mean, sd |
data.name |
description of data |
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G. J. and Rizzo, M. L. (2005) A New Test for Multivariate Normality, Journal of Multivariate Analysis, 93/1, 58-80, doi:10.1016/j.jmva.2003.12.002.
Mori, T. F., Szekely, G. J. and Rizzo, M. L. "On energy tests of normality." Journal of Statistical Planning and Inference 213 (2021): 1-15.
Rizzo, M. L. (2002). A New Rotation Invariant Goodness-of-Fit Test, Ph.D. dissertation, Bowling Green State University.
J. P. Imhof (1961). Computing the Distribution of Quadratic Forms in Normal Variables, Biometrika, Volume 48, Issue 3/4, 419-426.
mvnorm.test
and mvnorm.e
for the
energy test of multivariate normality and the test statistic
for multivariate samples.
x <- iris[1:50, 1] normal.e(x) normal.test(x, R=199) normal.test(x, method="limit")
x <- iris[1:50, 1] normal.e(x) normal.test(x, R=199) normal.test(x, method="limit")
Partial distance correlation pdcor, pdcov, and tests.
pdcov.test(x, y, z, R) pdcor.test(x, y, z, R) pdcor(x, y, z) pdcov(x, y, z)
pdcov.test(x, y, z, R) pdcor.test(x, y, z, R) pdcor(x, y, z) pdcov(x, y, z)
x |
data or dist object of first sample |
y |
data or dist object of second sample |
z |
data or dist object of third sample |
R |
replicates for permutation test |
pdcor(x, y, z)
and pdcov(x, y, z)
compute the partial distance
correlation and partial distance covariance, respectively,
of x and y removing z.
A test for zero partial distance correlation (or zero partial distance covariance) is implemented in pdcor.test
, and pdcov.test
.
Argument types supported are numeric data matrix, data.frame, tibble, numeric vector, class "dist" object, or factor. For unordered factors a 0-1 distance matrix is computed.
Each test returns an object of class htest
.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities. Annals of Statistics, Vol. 42 No. 6, 2382-2412.
n = 30 R <- 199 ## mutually independent standard normal vectors x <- rnorm(n) y <- rnorm(n) z <- rnorm(n) pdcor(x, y, z) pdcov(x, y, z) set.seed(1) pdcov.test(x, y, z, R=R) set.seed(1) pdcor.test(x, y, z, R=R) if (require(MASS)) { p = 4 mu <- rep(0, p) Sigma <- diag(p) ## linear dependence y <- mvrnorm(n, mu, Sigma) + x print(pdcov.test(x, y, z, R=R)) ## non-linear dependence y <- mvrnorm(n, mu, Sigma) * x print(pdcov.test(x, y, z, R=R)) }
n = 30 R <- 199 ## mutually independent standard normal vectors x <- rnorm(n) y <- rnorm(n) z <- rnorm(n) pdcor(x, y, z) pdcov(x, y, z) set.seed(1) pdcov.test(x, y, z, R=R) set.seed(1) pdcor.test(x, y, z, R=R) if (require(MASS)) { p = 4 mu <- rep(0, p) Sigma <- diag(p) ## linear dependence y <- mvrnorm(n, mu, Sigma) + x print(pdcov.test(x, y, z, R=R)) ## non-linear dependence y <- mvrnorm(n, mu, Sigma) * x print(pdcov.test(x, y, z, R=R)) }
Performs the mean distance goodness-of-fit test and the energy goodness-of-fit test of Poisson distribution with unknown parameter.
poisson.e(x) poisson.m(x) poisson.etest(x, R) poisson.mtest(x, R) poisson.tests(x, R, test="all")
poisson.e(x) poisson.m(x) poisson.etest(x, R) poisson.mtest(x, R) poisson.tests(x, R, test="all")
x |
vector of nonnegative integers, the sample data |
R |
number of bootstrap replicates |
test |
name of test(s) |
Two distance-based tests of Poissonity are applied in poisson.tests
, "M" and "E". The default is to
do all tests and return results in a data frame.
Valid choices for test
are "M", "E", or "all" with
default "all".
If "all" tests, all tests are performed by a single parametric bootstrap computing all test statistics on each sample.
The "M" choice is two tests, one based on a Cramer-von Mises distance and the other an Anderson-Darling distance. The "E" choice is the energy goodness-of-fit test.
R
must be a positive integer for a test. If R
is missing or 0, a warning is printed but test statistics are computed (without testing).
The mean distance test of Poissonity (M-test) is based on the result that the sequence
of expected values E|X-j|, j=0,1,2,... characterizes the distribution of
the random variable X. As an application of this characterization one can
get an estimator of the CDF. The test statistic
(see
poisson.m
) is a Cramer-von Mises type of distance, with
M-estimates replacing the usual EDF estimates of the CDF:
In poisson.tests
, an Anderson-Darling type of weight is also applied when test="M"
or test="all"
.
The tests are implemented by parametric bootstrap with
R
replicates.
An energy goodness-of-fit test (E) is based on the test statistic
where X and X' are iid with the hypothesized null distribution. For a test of H: X ~ Poisson(), we can express E|X-X'| in terms of Bessel functions, and E|x_i - X| in terms of the CDF of Poisson(
).
If test=="all" or not specified, all tests are run with a single parametric bootstrap. poisson.mtest
implements only the Poisson M-test with Cramer-von Mises type distance. poisson.etest
implements only the Poisson energy test.
The functions poisson.m
and poisson.e
return the test statistics. The function
poisson.mtest
or poisson.etest
return an htest
object containing
method |
Description of test |
statistic |
observed value of the test statistic |
p.value |
approximate p-value of the test |
data.name |
replicates R |
estimate |
sample mean |
poisson.tests
returns "M-CvM test", "M-AD test" and "Energy test" results in a data frame with columns
estimate |
sample mean |
statistic |
observed value of the test statistic |
p.value |
approximate p-value of the test |
method |
Description of test |
which can be coerced to a tibble
.
The running time of the M test is much faster than the E-test.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G. J. and Rizzo, M. L. (2004) Mean Distance Test of Poisson Distribution, Statistics and Probability Letters, 67/3, 241-247. doi:10.1016/j.spl.2004.01.005.
Szekely, G. J. and Rizzo, M. L. (2005) A New Test for Multivariate Normality, Journal of Multivariate Analysis, 93/1, 58-80, doi:10.1016/j.jmva.2003.12.002.
x <- rpois(50, 2) poisson.m(x) poisson.e(x) poisson.etest(x, R=199) poisson.mtest(x, R=199) poisson.tests(x, R=199)
x <- rpois(50, 2) poisson.m(x) poisson.e(x) poisson.etest(x, R=199) poisson.mtest(x, R=199) poisson.tests(x, R=199)
A utility that returns a list with the components equivalent to sort(x), order(x), rank(x, ties.method = "first").
sortrank(x)
sortrank(x)
x |
vector compatible with sort(x) |
This utility exists to save a little time on large vectors when two or all three of the sort(), order(), rank() results are required. In case of ties, the ranks component matches rank(x, ties.method = "first")
.
A list with components
x |
the sorted input vector x |
ix |
the permutation = order(x) which rearranges x into ascending order |
r |
the ranks of x |
This function was benchmarked faster than the combined calls to sort
and rank
.
Maria L. Rizzo [email protected]
See sort
.
sortrank(rnorm(5))
sortrank(rnorm(5))
Stand-alone function to compute the inner product in the Hilbert space of U-centered distance matrices, as in the definition of partial distance covariance.
U_product(U, V)
U_product(U, V)
U |
U-centered distance matrix |
V |
U-centered distance matrix |
Note that pdcor
, etc. functions include the centering and
projection operations, so that these stand alone versions are not
needed except in case one wants to check the internal computations.
Exported from U_product.cpp.
U_product
returns the inner product, a scalar.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities, Annals of Statistics, Vol. 42, No. 6, pp. 2382-2412.
x <- iris[1:10, 1:4] y <- iris[11:20, 1:4] M1 <- as.matrix(dist(x)) M2 <- as.matrix(dist(y)) U <- U_center(M1) V <- U_center(M2) U_product(U, V) dcovU_stats(M1, M2)
x <- iris[1:10, 1:4] y <- iris[11:20, 1:4] M1 <- as.matrix(dist(x)) M2 <- as.matrix(dist(y)) U <- U_center(M1) V <- U_center(M2) U_product(U, V) dcovU_stats(M1, M2)
These functions compute unbiased estimators of squared distance covariance and a bias-corrected estimator of (squared) distance correlation.
bcdcor(x, y) dcovU(x, y)
bcdcor(x, y) dcovU(x, y)
x |
data or dist object of first sample |
y |
data or dist object of second sample |
The unbiased (squared) dcov is inner product definition of dCov, in the Hilbert space of U-centered distance matrices.
The sample sizes (number of rows) of the two samples must agree, and samples must not contain missing values.
Argument types supported are numeric data matrix, data.frame, or tibble, with observations in rows; numeric vector; ordered or unordered factors. In case of unordered factors a 0-1 distance matrix is computed.
dcovU
returns the unbiased estimator of squared dcov.
bcdcor
returns a bias-corrected estimator of squared dcor.
Unbiased distance covariance (SR2014) corresponds to the biased
(original) . Since
dcovU
is an
unbiased statistic, it is signed and we do not take the square root.
For the original distance covariance test of independence (SRB2007,
SR2009), the distance covariance test statistic is the V-statistic
(not dCov).
Similarly,
bcdcor
is bias-corrected, so we do not take the
square root as with dCor.
Maria L. Rizzo [email protected] and Gabor J. Szekely
Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities. Annals of Statistics, Vol. 42 No. 6, 2382-2412.
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007),
Measuring and Testing Dependence by Correlation of Distances,
Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505
Szekely, G.J. and Rizzo, M.L. (2009),
Brownian Distance Covariance,
Annals of Applied Statistics,
Vol. 3, No. 4, 1236-1265.
doi:10.1214/09-AOAS312
x <- iris[1:50, 1:4] y <- iris[51:100, 1:4] dcovU(x, y) bcdcor(x, y)
x <- iris[1:50, 1:4] y <- iris[51:100, 1:4] dcovU(x, y) bcdcor(x, y)