Title: | Scalable Robust Estimators with High Breakdown Point |
---|---|
Description: | Robust Location and Scatter Estimation and Robust Multivariate Analysis with High Breakdown Point: principal component analysis (Filzmoser and Todorov (2013), <doi:10.1016/j.ins.2012.10.017>), linear and quadratic discriminant analysis (Todorov and Pires (2007)), multivariate tests (Todorov and Filzmoser (2010) <doi:10.1016/j.csda.2009.08.015>), outlier detection (Todorov et al. (2010) <doi:10.1007/s11634-010-0075-2>). See also Todorov and Filzmoser (2009) <urn:isbn:978-3838108148>, Todorov and Filzmoser (2010) <doi:10.18637/jss.v032.i03> and Boudt et al. (2019) <doi:10.1007/s11222-019-09869-x>. |
Authors: | Valentin Todorov [aut, cre] |
Maintainer: | Valentin Todorov <[email protected]> |
License: | GPL (>=3) |
Version: | 1.7-6 |
Built: | 2024-11-16 05:20:30 UTC |
Source: | https://github.com/valentint/rrcov |
The data on annual maximum streamflow at 104 gaging stations in the central Appalachia region of the United States contains the sample L-moments ratios (L-CV, L-skewness and L-kurtosis) as used by Hosking and Wallis (1997) to illustrate regional freqency analysis (RFA).
data(Appalachia)
data(Appalachia)
A data frame with 104 observations on the following 3 variables:
L-CV
L-coefficient of variation
L-skewness
L-coefficient of skewness
L-kurtosis
L-coefficient of kurtosis
The sample L-moment ratios (L-CV, L-skewness and L-kurtosis) of a site are regarded as a point in three dimensional space.
Hosking, J. R. M. and J. R. Wallis (1997), Regional Frequency Analysis: An Approach Based on L-moments. Cambridge University Press, p.175–185
Neykov, N.M., Neytchev, P.N., Van Gelder, P.H.A.J.M. and Todorov V. (2007), Robust detection of discordant sites in regional frequency analysis, Water Resources Research, 43, W06417, doi:10.1029/2006WR005322
data(Appalachia) # plot a matrix of scatterplots pairs(Appalachia, main="Appalachia data set", pch=21, bg=c("red", "green3", "blue")) mcd<-CovMcd(Appalachia) mcd plot(mcd, which="dist", class=TRUE) plot(mcd, which="dd", class=TRUE) ## identify the discordant sites using robust distances and compare ## to the classical ones mcd <- CovMcd(Appalachia) rd <- sqrt(getDistance(mcd)) ccov <- CovClassic(Appalachia) cd <- sqrt(getDistance(ccov)) r.out <- which(rd > sqrt(qchisq(0.975,3))) c.out <- which(cd > sqrt(qchisq(0.975,3))) cat("Robust: ", length(r.out), " outliers: ", r.out,"\n") cat("Classical: ", length(c.out), " outliers: ", c.out,"\n")
data(Appalachia) # plot a matrix of scatterplots pairs(Appalachia, main="Appalachia data set", pch=21, bg=c("red", "green3", "blue")) mcd<-CovMcd(Appalachia) mcd plot(mcd, which="dist", class=TRUE) plot(mcd, which="dd", class=TRUE) ## identify the discordant sites using robust distances and compare ## to the classical ones mcd <- CovMcd(Appalachia) rd <- sqrt(getDistance(mcd)) ccov <- CovClassic(Appalachia) cd <- sqrt(getDistance(ccov)) r.out <- which(rd > sqrt(qchisq(0.975,3))) c.out <- which(cd > sqrt(qchisq(0.975,3))) cat("Robust: ", length(r.out), " outliers: ", r.out,"\n") cat("Classical: ", length(c.out), " outliers: ", c.out,"\n")
Produces a biplot from an object (derived from) Pca-class
.
## S4 method for signature 'Pca' biplot(x, choices=1L:2L, scale=1, ...)
## S4 method for signature 'Pca' biplot(x, choices=1L:2L, scale=1, ...)
x |
an object of class (derived from) |
choices |
length 2 vector specifying the components to plot. Only the default is a biplot in the strict sense. |
scale |
The variables are scaled by |
... |
optional arguments to be passed to the internal graphical functions. |
a plot is produced on the current graphics device.
signature(x = Pca)
: Plot a biplot, i.e. represent both
the observations and variables of a matrix of multivariate data on the same
plot. See also biplot.princomp
.
Gabriel, K. R. (1971). The biplot graphical display of matrices with applications to principal component analysis. Biometrika, 58, 453–467.
Pca-class
,
PcaClassic
,
PcaRobust-class
.
require(graphics) biplot(PcaClassic(USArrests, k=2))
require(graphics) biplot(PcaClassic(USArrests, k=2))
The data set bus (Hettich and Bay, 1999) corresponds to a study in automatic vehicle recognition (see Maronna et al. 2006, page 213, Example 6.3)). This data set from the Turing Institute, Glasgow, Scotland, contains measures of shape features extracted from vehicle silhouettes. The images were acquired by a camera looking downward at the model vehicle from a fixed angle of elevation. Each of the 218 rows corresponds to a view of a bus silhouette, and contains 18 attributes of the image.
data(bus)
data(bus)
A data frame with 218 observations on the following 18 variables:
V1
compactness
V2
circularity
V3
distance circularity
V4
radius ratio
V5
principal axis aspect ratio
V6
maximum length aspect ratio
V7
scatter ratio
V8
elongatedness
V9
principal axis rectangularity
V10
maximum length rectangularity
V11
scaled variance along major axis
V12
scaled variance along minor axis
V13
scaled radius of gyration
V14
skewness about major axis
V15
skewness about minor axis
V16
kurtosis about minor axis
V17
kurtosis about major axis
V18
hollows ratio
Hettich, S. and Bay, S.D. (1999), The UCI KDD Archive, Irvine, CA:University of California, Department of Information and Computer Science, 'http://kdd.ics.uci.edu'
Maronna, R., Martin, D. and Yohai, V., (2006). Robust Statistics: Theory and Methods. Wiley, New York.
## Reproduce Table 6.3 from Maronna et al. (2006), page 213 data(bus) bus <- as.matrix(bus) ## calculate MADN for each variable xmad <- apply(bus, 2, mad) cat("\nMin, Max of MADN: ", min(xmad), max(xmad), "\n") ## MADN vary between 0 (for variable 9) and 34. Therefore exclude ## variable 9 and divide the remaining variables by their MADNs. bus1 <- bus[, -9] madbus <- apply(bus1, 2, mad) bus2 <- sweep(bus1, 2, madbus, "/", check.margin = FALSE) ## Compute classical and robust PCA (Spherical/Locantore, Hubert, MCD and OGK) pca <- PcaClassic(bus2) rpca <- PcaLocantore(bus2) pcaHubert <- PcaHubert(bus2, k=17, kmax=17, mcd=FALSE) pcamcd <- PcaCov(bus2, cov.control=CovControlMcd()) pcaogk <- PcaCov(bus2, cov.control=CovControlOgk()) ev <- getEigenvalues(pca) evrob <- getEigenvalues(rpca) evhub <- getEigenvalues(pcaHubert) evmcd <- getEigenvalues(pcamcd) evogk <- getEigenvalues(pcaogk) uvar <- matrix(nrow=6, ncol=6) svar <- sum(ev) svarrob <- sum(evrob) svarhub <- sum(evhub) svarmcd <- sum(evmcd) svarogk <- sum(evogk) for(i in 1:6){ uvar[i,1] <- i uvar[i,2] <- round((svar - sum(ev[1:i]))/svar, 3) uvar[i,3] <- round((svarrob - sum(evrob[1:i]))/svarrob, 3) uvar[i,4] <- round((svarhub - sum(evhub[1:i]))/svarhub, 3) uvar[i,5] <- round((svarmcd - sum(evmcd[1:i]))/svarmcd, 3) uvar[i,6] <- round((svarogk - sum(evogk[1:i]))/svarogk, 3) } uvar <- as.data.frame(uvar) names(uvar) <- c("q", "Classical","Spherical", "Hubert", "MCD", "OGK") cat("\nBus data: proportion of unexplained variability for q components\n") print(uvar) ## Reproduce Table 6.4 from Maronna et al. (2006), page 214 ## ## Compute classical and robust PCA extracting only the first 3 components ## and take the squared orthogonal distances to the 3-dimensional hyperplane ## pca3 <- PcaClassic(bus2, k=3) # classical rpca3 <- PcaLocantore(bus2, k=3) # spherical (Locantore, 1999) hpca3 <- PcaHubert(bus2, k=3) # Hubert dist <- pca3@od^2 rdist <- rpca3@od^2 hdist <- hpca3@od^2 ## calculate the quantiles of the distances to the 3-dimensional hyperplane qclass <- round(quantile(dist, probs = seq(0, 1, 0.1)[-c(1,11)]), 1) qspc <- round(quantile(rdist, probs = seq(0, 1, 0.1)[-c(1,11)]), 1) qhubert <- round(quantile(hdist, probs = seq(0, 1, 0.1)[-c(1,11)]), 1) qq <- cbind(rbind(qclass, qspc, qhubert), round(c(max(dist), max(rdist), max(hdist)), 0)) colnames(qq)[10] <- "Max" rownames(qq) <- c("Classical", "Spherical", "Hubert") cat("\nBus data: quantiles of distances to hiperplane\n") print(qq) ## ## Reproduce Fig 6.1 from Maronna et al. (2006), page 214 ## cat("\nBus data: Q-Q plot of logs of distances to hyperplane (k=3) \nfrom classical and robust estimates. The line is the identity diagonal\n") plot(sort(log(dist)), sort(log(rdist)), xlab="classical", ylab="robust") lines(sort(log(dist)), sort(log(dist)))
## Reproduce Table 6.3 from Maronna et al. (2006), page 213 data(bus) bus <- as.matrix(bus) ## calculate MADN for each variable xmad <- apply(bus, 2, mad) cat("\nMin, Max of MADN: ", min(xmad), max(xmad), "\n") ## MADN vary between 0 (for variable 9) and 34. Therefore exclude ## variable 9 and divide the remaining variables by their MADNs. bus1 <- bus[, -9] madbus <- apply(bus1, 2, mad) bus2 <- sweep(bus1, 2, madbus, "/", check.margin = FALSE) ## Compute classical and robust PCA (Spherical/Locantore, Hubert, MCD and OGK) pca <- PcaClassic(bus2) rpca <- PcaLocantore(bus2) pcaHubert <- PcaHubert(bus2, k=17, kmax=17, mcd=FALSE) pcamcd <- PcaCov(bus2, cov.control=CovControlMcd()) pcaogk <- PcaCov(bus2, cov.control=CovControlOgk()) ev <- getEigenvalues(pca) evrob <- getEigenvalues(rpca) evhub <- getEigenvalues(pcaHubert) evmcd <- getEigenvalues(pcamcd) evogk <- getEigenvalues(pcaogk) uvar <- matrix(nrow=6, ncol=6) svar <- sum(ev) svarrob <- sum(evrob) svarhub <- sum(evhub) svarmcd <- sum(evmcd) svarogk <- sum(evogk) for(i in 1:6){ uvar[i,1] <- i uvar[i,2] <- round((svar - sum(ev[1:i]))/svar, 3) uvar[i,3] <- round((svarrob - sum(evrob[1:i]))/svarrob, 3) uvar[i,4] <- round((svarhub - sum(evhub[1:i]))/svarhub, 3) uvar[i,5] <- round((svarmcd - sum(evmcd[1:i]))/svarmcd, 3) uvar[i,6] <- round((svarogk - sum(evogk[1:i]))/svarogk, 3) } uvar <- as.data.frame(uvar) names(uvar) <- c("q", "Classical","Spherical", "Hubert", "MCD", "OGK") cat("\nBus data: proportion of unexplained variability for q components\n") print(uvar) ## Reproduce Table 6.4 from Maronna et al. (2006), page 214 ## ## Compute classical and robust PCA extracting only the first 3 components ## and take the squared orthogonal distances to the 3-dimensional hyperplane ## pca3 <- PcaClassic(bus2, k=3) # classical rpca3 <- PcaLocantore(bus2, k=3) # spherical (Locantore, 1999) hpca3 <- PcaHubert(bus2, k=3) # Hubert dist <- pca3@od^2 rdist <- rpca3@od^2 hdist <- hpca3@od^2 ## calculate the quantiles of the distances to the 3-dimensional hyperplane qclass <- round(quantile(dist, probs = seq(0, 1, 0.1)[-c(1,11)]), 1) qspc <- round(quantile(rdist, probs = seq(0, 1, 0.1)[-c(1,11)]), 1) qhubert <- round(quantile(hdist, probs = seq(0, 1, 0.1)[-c(1,11)]), 1) qq <- cbind(rbind(qclass, qspc, qhubert), round(c(max(dist), max(rdist), max(hdist)), 0)) colnames(qq)[10] <- "Max" rownames(qq) <- c("Classical", "Spherical", "Hubert") cat("\nBus data: quantiles of distances to hiperplane\n") print(qq) ## ## Reproduce Fig 6.1 from Maronna et al. (2006), page 214 ## cat("\nBus data: Q-Q plot of logs of distances to hyperplane (k=3) \nfrom classical and robust estimates. The line is the identity diagonal\n") plot(sort(log(dist)), sort(log(rdist)), xlab="classical", ylab="robust") lines(sort(log(dist)), sort(log(dist)))
This data set is based on the bushfire data set which was used by
Campbell (1984) to locate bushfire scars - see bushfire
in package robustbase
. The original dataset contains satelite
measurements on five frequency bands, corresponding to each of 38 pixels.
data(bushmiss)
data(bushmiss)
A data frame with 190 observations on 6 variables.
The original data set consists of 38 observations in 5 variables.
Based on it four new data sets are created in which some of the data
items are replaced by missing values with a simple "missing completely
at random " mechanism. For this purpose independent Bernoulli trials
are realized for each data item with a probability of success 0.1, 0.2, 0.3, 0.4,
where success means that the corresponding item is set to missing. The obtained five
data sets, including the original one (each with probability of a data
item to be missing equal to 0, 0.1, 0.2, 0.3 and 0.4 which is reflected
in the new variable MPROB
) are merged. (See also Beguin and Hulliger (2004).)
Maronna, R.A. and Yohai, V.J. (1995) The Behavoiur of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90, 330–341.
Beguin, C. and Hulliger, B. (2004) Multivariate outlier detection in incomplete survey data: the epidemic algorithm and transformed rank correlations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 127, 2, 275–294.
## The following code will result in exactly the same output ## as the one obtained from the original data set data(bushmiss) bf <- bushmiss[bushmiss$MPROB==0,1:5] plot(bf) covMcd(bf) ## Not run: ## This is the code with which the missing data were created: ## ## Creates a data set with missing values (for testing purposes) ## from a complete data set 'x'. The probability of ## each item being missing is 'pr' (Bernoulli trials). ## getmiss <- function(x, pr=0.1) { n <- nrow(x) p <- ncol(x) done <- FALSE iter <- 0 while(iter <= 50){ bt <- rbinom(n*p, 1, pr) btmat <- matrix(bt, nrow=n) btmiss <- ifelse(btmat==1, NA, 0) y <- x+btmiss if(length(which(rowSums(nanmap(y)) == p)) == 0) return (y) iter <- iter + 1 } y } ## End(Not run)
## The following code will result in exactly the same output ## as the one obtained from the original data set data(bushmiss) bf <- bushmiss[bushmiss$MPROB==0,1:5] plot(bf) covMcd(bf) ## Not run: ## This is the code with which the missing data were created: ## ## Creates a data set with missing values (for testing purposes) ## from a complete data set 'x'. The probability of ## each item being missing is 'pr' (Bernoulli trials). ## getmiss <- function(x, pr=0.1) { n <- nrow(x) p <- ncol(x) done <- FALSE iter <- 0 while(iter <= 50){ bt <- rbinom(n*p, 1, pr) btmat <- matrix(bt, nrow=n) btmiss <- ifelse(btmat==1, NA, 0) y <- x+btmiss if(length(which(rowSums(nanmap(y)) == p)) == 0) return (y) iter <- iter + 1 } y } ## End(Not run)
A data frame containing 11 variables with different dimensions of 111 cars
data(Cars)
data(Cars)
A data frame with 111 observations on the following 11 variables.
length
a numeric vector
wheelbase
a numeric vector
width
a numeric vector
height
a numeric vector
front.hd
a numeric vector
rear.hd
a numeric vector
front.leg
a numeric vector
rear.seating
a numeric vector
front.shoulder
a numeric vector
rear.shoulder
a numeric vector
luggage
a numeric vector
Consumer reports. (April 1990). http://backissues.com/issue/Consumer-Reports-April-1990, pp. 235–288.
Chambers, J. M. and Hastie, T. J. (1992). Statistical models in S. Cole, Pacific Grove, CA: Wadsworth and Brooks, pp. 46–47.
M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005), ROBPCA: A new approach to robust principal components analysis, Technometrics, 47, 64–79.
data(Cars) ## Plot a pairwise scaterplot matrix pairs(Cars[,1:6]) mcd <- CovMcd(Cars[,1:6]) plot(mcd, which="pairs") ## Start with robust PCA pca <- PcaHubert(Cars, k=ncol(Cars), kmax=ncol(Cars)) pca ## Compare with the classical PCA prcomp(Cars) ## or PcaClassic(Cars, k=ncol(Cars), kmax=ncol(Cars)) ## If you want to print the scores too, use print(pca, print.x=TRUE) ## Using the formula interface PcaHubert(~., data=Cars, k=ncol(Cars), kmax=ncol(Cars)) ## To plot the results: plot(pca) # distance plot pca2 <- PcaHubert(Cars, k=4) plot(pca2) # PCA diagnostic plot (or outlier map) ## Use the standard plots available for prcomp and princomp screeplot(pca) # it is interesting with all variables biplot(pca) # for biplot we need more than one PCs ## Restore the covraiance matrix py <- PcaHubert(Cars, k=ncol(Cars), kmax=ncol(Cars)) cov.1 <- py@loadings %*% diag(py@eigenvalues) %*% t(py@loadings) cov.1
data(Cars) ## Plot a pairwise scaterplot matrix pairs(Cars[,1:6]) mcd <- CovMcd(Cars[,1:6]) plot(mcd, which="pairs") ## Start with robust PCA pca <- PcaHubert(Cars, k=ncol(Cars), kmax=ncol(Cars)) pca ## Compare with the classical PCA prcomp(Cars) ## or PcaClassic(Cars, k=ncol(Cars), kmax=ncol(Cars)) ## If you want to print the scores too, use print(pca, print.x=TRUE) ## Using the formula interface PcaHubert(~., data=Cars, k=ncol(Cars), kmax=ncol(Cars)) ## To plot the results: plot(pca) # distance plot pca2 <- PcaHubert(Cars, k=4) plot(pca2) # PCA diagnostic plot (or outlier map) ## Use the standard plots available for prcomp and princomp screeplot(pca) # it is interesting with all variables biplot(pca) # for biplot we need more than one PCs ## Restore the covraiance matrix py <- PcaHubert(Cars, k=ncol(Cars), kmax=ncol(Cars)) cov.1 <- py@loadings %*% diag(py@eigenvalues) %*% t(py@loadings) cov.1
The data on annual precipitation totals for the North Cascades region contains the sample L-moments ratios (L-CV, L-skewness and L-kurtosis) for 19 sites as used by Hosking and Wallis (1997), page 53, Table 3.4, to illustrate screening tools for regional freqency analysis (RFA).
data(Cascades)
data(Cascades)
A data frame with 19 observations on the following 3 variables.
L-CV
L-coefficient of variation
L-skewness
L-coefficient of skewness
L-kurtosis
L-coefficient of kurtosis
The sample L-moment ratios (L-CV, L-skewness and L-kurtosis) of a site are regarded as a point in three dimensional space.
Hosking, J. R. M. and J. R. Wallis (1997), Regional Frequency Analysis: An Approach Based on L-moments. Cambridge University Press, p. 52–53
Neykov, N.M., Neytchev, P.N., Van Gelder, P.H.A.J.M. and Todorov V. (2007), Robust detection of discordant sites in regional frequency analysis, Water Resources Research, 43, W06417, doi:10.1029/2006WR005322
data(Cascades) # plot a matrix of scatterplots pairs(Cascades, main="Cascades data set", pch=21, bg=c("red", "green3", "blue")) mcd<-CovMcd(Cascades) mcd plot(mcd, which="dist", class=TRUE) plot(mcd, which="dd", class=TRUE) ## identify the discordant sites using robust distances and compare ## to the classical ones rd <- sqrt(getDistance(mcd)) ccov <- CovClassic(Cascades) cd <- sqrt(getDistance(ccov)) r.out <- which(rd > sqrt(qchisq(0.975,3))) c.out <- which(cd > sqrt(qchisq(0.975,3))) cat("Robust: ", length(r.out), " outliers: ", r.out,"\n") cat("Classical: ", length(c.out), " outliers: ", c.out,"\n")
data(Cascades) # plot a matrix of scatterplots pairs(Cascades, main="Cascades data set", pch=21, bg=c("red", "green3", "blue")) mcd<-CovMcd(Cascades) mcd plot(mcd, which="dist", class=TRUE) plot(mcd, which="dd", class=TRUE) ## identify the discordant sites using robust distances and compare ## to the classical ones rd <- sqrt(getDistance(mcd)) ccov <- CovClassic(Cascades) cd <- sqrt(getDistance(ccov)) r.out <- which(rd > sqrt(qchisq(0.975,3))) c.out <- which(cd > sqrt(qchisq(0.975,3))) cat("Robust: ", length(r.out), " outliers: ", r.out,"\n") cat("Classical: ", length(c.out), " outliers: ", c.out,"\n")
The class Cov
represents an estimate of the
multivariate location and scatter of a data set. The objects of class Cov
contain the classical estimates and serve as base for deriving other
estimates, i.e. different types of robust estimates.
Objects can be created by calls of the form new("Cov", ...)
,
but the usual way of creating Cov
objects is a call to the function
Cov
which serves as a constructor.
call
:Object of class "language"
cov
:covariance matrix
center
:location
n.obs
:number of observations used for the computation of the estimates
mah
:mahalanobis distances
det
:determinant
flag
:flags (FALSE if suspected an outlier)
method
:a character string describing the method used to compute the estimate: "Classic"
singularity
:a list with singularity information for the
covariance matrix (or NULL
of not singular)
X
:data
signature(obj = "Cov")
: location vector
signature(obj = "Cov")
: covariance matrix
signature(obj = "Cov")
: correlation matrix
signature(obj = "Cov")
: data frame
signature(obj = "Cov")
: distances
signature(obj = "Cov")
: Computes and returns
the eigenvalues of the covariance matrix
signature(obj = "Cov")
: Computes and returns
the determinant of the covariance matrix (or 0 if the covariance matrix is singular)
signature(obj = "Cov")
: Computes and returns
the shape matrix corresponding to the covariance matrix (i.e. the covariance matrix scaled to have determinant =1)
signature(obj = "Cov")
: Flags observations as outliers if the corresponding mahalanobis distance is larger then qchisq(prob, p)
where prob
defaults to 0.975.
signature(obj = "Cov")
: returns TRUE by default. If necessary, the robust
classes will override
signature(x = "Cov")
: plot the object
signature(object = "Cov")
: display the object
signature(object = "Cov")
: calculate summary information
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("Cov")
showClass("Cov")
Computes the classical estimates of multivariate location and scatter.
Returns an S4 class CovClassic
with the estimated center
,
cov
, Mahalanobis distances and weights based on these distances.
CovClassic(x, unbiased=TRUE) Cov(x, unbiased=TRUE)
CovClassic(x, unbiased=TRUE) Cov(x, unbiased=TRUE)
x |
a matrix or data frame. As usual, rows are observations and columns are variables. |
unbiased |
whether to return the unbiased estimate of
the covariance matrix. Default is |
An object of class "CovClassic"
.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovClassic(hbk.x) cv summary(cv) plot(cv)
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovClassic(hbk.x) cv summary(cv) plot(cv)
The class CovClassic
represents an estimate of the
multivariate location and scatter of a data set. The objects of class CovClassic
contain the classical estimates.
Objects can be created by calls of the form new("CovClassic", ...)
,
but the usual way of creating CovClassic
objects is a call to the function
CovClassic
which serves as a constructor.
call
:Object of class "language"
cov
:covariance matrix
center
:location
n.obs
:number of observations used for the computation of the estimates
mah
:mahalanobis distances
method
:a character string describing the method used to compute the estimate: "Classic"
singularity
:a list with singularity information for the
ocvariance matrix (or NULL
of not singular)
X
:data
signature(obj = "CovClassic")
: location vector
signature(obj = "CovClassic")
: covariance matrix
signature(obj = "CovClassic")
: correlation matrix
signature(obj = "CovClassic")
: data frame
signature(obj = "CovClassic")
: distances
signature(obj = "CovClassic")
: Computes and returns
the eigenvalues of the covariance matrix
signature(x = "CovClassic")
: plot the object
signature(object = "CovClassic")
: display the object
signature(object = "CovClassic")
: calculate summary information
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovClassic(hbk.x) cv summary(cv) plot(cv)
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovClassic(hbk.x) cv summary(cv) plot(cv)
The class "CovControl" is a VIRTUAL base control class for the derived classes representing the control parameters for the different robust methods
trace |
whether to print intermediate results. Default is |
tolSolve |
numeric tolerance to be used for inversion
( |
A virtual Class: No objects may be created from it.
No methods defined with class "CovControl" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
This function will create a control object CovControlMcd
containing the control parameters for CovMcd
CovControlMcd(alpha = 0.5, nsamp = 500, scalefn=NULL, maxcsteps=200, seed = NULL, trace= FALSE, use.correction = TRUE)
CovControlMcd(alpha = 0.5, nsamp = 500, scalefn=NULL, maxcsteps=200, seed = NULL, trace= FALSE, use.correction = TRUE)
alpha |
numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., |
nsamp |
number of subsets used for initial estimates or For |
scalefn |
|
maxcsteps |
maximal number of concentration steps in the deterministic MCD; should not be reached. |
seed |
starting value for random generator. Default is |
trace |
whether to print intermediate results. Default is |
use.correction |
whether to use finite sample correction factors.
Default is |
A CovControlMcd
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMcd", alpha=0.75) ctrl2 <- CovControlMcd(alpha=0.75) data(hbk) CovMcd(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMcd", alpha=0.75) ctrl2 <- CovControlMcd(alpha=0.75) data(hbk) CovMcd(hbk, control=ctrl1)
This class extends the CovControl
class
and contains the control parameters for "CovMcd"
Objects can be created by calls of the form new("CovControlMcd", ...)
or by calling the constructor-function CovControlMcd
.
alpha
:numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., alpha*n
observations are used for computing the determinant. Allowed values
are between 0.5 and 1 and the default is 0.5.
number of subsets used for initial estimates or "best"
,
"exact"
or "deterministic"
. Default is nsamp = 500
.
For nsamp="best"
exhaustive enumeration is done, as long as the
number of trials does not exceed 5000. For "exact"
,
exhaustive enumeration will be attempted however many samples are
needed. In this case a warning message will be displayed saying
that the computation can take a very long time.
For "deterministic"
, the deterministic MCD is computed; as
proposed by Hubert et al. (2012) it starts from the most
central observations of six (deterministic) estimators.
function
to compute a robust scale
estimate or character string specifying a rule determining such a
function.
maximal number of concentration steps in the deterministic MCD; should not be reached.
seed
:starting value for random generator. Default is seed = NULL
use.correction
: whether to use finite sample correction factors.
Default is use.correction=TRUE
.
trace
, tolSolve
:from the
"CovControl"
class.
Class "CovControl"
, directly.
signature(obj = "CovControlMcd")
: the generic
function restimate
allows the different methods for robust estimation to be
used polymorphically - this function will call CovMcd
passing it the control
object and will return the obtained CovRobust
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMcd", alpha=0.75) ctrl2 <- CovControlMcd(alpha=0.75) data(hbk) CovMcd(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMcd", alpha=0.75) ctrl2 <- CovControlMcd(alpha=0.75) data(hbk) CovMcd(hbk, control=ctrl1)
This function will create a control object CovControlMest
containing the control parameters for CovMest
CovControlMest(r = 0.45, arp = 0.05, eps = 0.001, maxiter = 120)
CovControlMest(r = 0.45, arp = 0.05, eps = 0.001, maxiter = 120)
r |
a numeric value specifying the required
breakdown point. Allowed values are between
|
arp |
a numeric value specifying the asympthotic
rejection point, i.e. the fraction of points receiving zero
weight (see Rocke (1996)). Default is |
eps |
a numeric value specifying the
relative precision of the solution of the M-estimate.
Defaults to |
maxiter |
maximum number of iterations allowed in the computation of the M-estimate. Defaults to 120 |
A CovControlMest
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMest", r=0.4) ctrl2 <- CovControlMest(r=0.4) data(hbk) CovMest(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMest", r=0.4) ctrl2 <- CovControlMest(r=0.4) data(hbk) CovMest(hbk, control=ctrl1)
This class extends the CovControl
class
and contains the control parameters for CovMest
Objects can be created by calls of the form new("CovControlMest", ...)
or by calling the constructor-function CovControlMest
.
r
:a numeric value specifying the required
breakdown point. Allowed values are between
(n - p)/(2 * n)
and 1 and the default is 0.45
arp
:a numeric value specifying the asympthotic
rejection point, i.e. the fraction of points receiving zero
weight (see Rocke (1996)). Default is 0.05
eps
:a numeric value specifying the
relative precision of the solution of the M-estimate.
Defaults to 1e-3
maxiter
:maximum number of iterations allowed in the computation of the M-estimate. Defaults to 120
trace
, tolSolve
:from the
"CovControl"
class.
Class "CovControl"
, directly.
signature(obj = "CovControlMest")
: the generic
function restimate
allowes the different methods for robust estimation to be
used polymorphically - this function will call CovMest
passing it the control
object and will return the obtained CovRobust
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMest", r=0.4) ctrl2 <- CovControlMest(r=0.4) data(hbk) CovMest(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMest", r=0.4) ctrl2 <- CovControlMest(r=0.4) data(hbk) CovMest(hbk, control=ctrl1)
This function will create a control object CovControlMMest
containing the control parameters for CovMMest
CovControlMMest(bdp = 0.5, eff=0.95, maxiter = 50, sest=CovControlSest(), trace = FALSE, tolSolve = 1e-7)
CovControlMMest(bdp = 0.5, eff=0.95, maxiter = 50, sest=CovControlSest(), trace = FALSE, tolSolve = 1e-7)
bdp |
a numeric value specifying the required breakdown point. Allowed values are between 0.5 and 1 and the default is 0.5 |
eff |
a numeric value specifying the required efficiency
for the MM estimates. Default is |
sest |
an |
maxiter |
maximum number of iterations allowed in the computation of the MM-estimate. Defaults to 150. |
trace |
whether to print intermediate results. Default is |
tolSolve |
numeric tolerance to be used as a convergence tolerance for the MM-iteration. |
A CovControlSest
object.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMMest", bdp=0.25) ctrl2 <- CovControlMMest(bdp=0.25) data(hbk) CovMMest(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMMest", bdp=0.25) ctrl2 <- CovControlMMest(bdp=0.25) data(hbk) CovMMest(hbk, control=ctrl1)
This class extends the CovControl
class
and contains the control parameters for CovMMest
Objects can be created by calls of the form new("CovControlMMest", ...)
or by calling the constructor-function CovControlMMest
.
a numeric value specifying the required
breakdown point. Allowed values are between
0.5 and 1 and the default is bdp=0.5
.
a numeric value specifying the required efficiency
for the MM estimates. Default is eff=0.95
.
an CovControlSest
object containing control parameters for the initial S-estimate.
maximum number of iterations allowed
in the computation of the MM-estimate.
Default is maxiter=50
.
trace
, tolSolve
:from the
"CovControl"
class. tolSolve
is used as
a convergence tolerance for the MM-iteration.
Class "CovControl"
, directly.
signature(obj = "CovControlMMest")
: the generic
function restimate
allowes the different methods for robust estimation to be
used polymorphically - this function will call CovMMest
passing it the control
object and will return the obtained CovRobust
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMMest", bdp=0.25) ctrl2 <- CovControlMMest(bdp=0.25) data(hbk) CovMMest(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMMest", bdp=0.25) ctrl2 <- CovControlMMest(bdp=0.25) data(hbk) CovMMest(hbk, control=ctrl1)
This function will create a control object CovControlMrcd
containing the control parameters for CovMrcd
CovControlMrcd(alpha = 0.5, h=NULL, maxcsteps=200, rho=NULL, target=c("identity", "equicorrelation"), maxcond=50, trace= FALSE)
CovControlMrcd(alpha = 0.5, h=NULL, maxcsteps=200, rho=NULL, target=c("identity", "equicorrelation"), maxcond=50, trace= FALSE)
alpha |
numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., |
h |
the size of the subset (can be between ceiling(n/2) and n).
Normally NULL and then it |
maxcsteps |
maximal number of concentration steps in the deterministic MCD; should not be reached. |
rho |
regularization parameter. Normally NULL and will be estimated from the data. |
target |
structure of the robust positive definite target matrix:
a) "identity": target matrix is diagonal matrix with robustly estimated
univariate scales on the diagonal or b) "equicorrelation": non-diagonal
target matrix that incorporates an equicorrelation structure
(see (17) in paper). Default is |
maxcond |
maximum condition number allowed
(see step 3.4 in algorithm 1). Default is |
trace |
whether to print intermediate results. Default is |
A CovControlMrcd
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMrcd", alpha=0.75) ctrl2 <- CovControlMrcd(alpha=0.75) data(hbk) CovMrcd(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMrcd", alpha=0.75) ctrl2 <- CovControlMrcd(alpha=0.75) data(hbk) CovMrcd(hbk, control=ctrl1)
This class extends the CovControl
class
and contains the control parameters for "CovMrcd"
Objects can be created by calls of the form new("CovControlMrcd", ...)
or by calling the constructor-function CovControlMrcd
.
alpha
:numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., alpha*n
observations are used for computing the determinant. Allowed values
are between 0.5 and 1 and the default is 0.5.
the size of the subset (can be between ceiling(n/2) and n).
Normally NULL and then it h
will be calculated as
h=ceiling(alpha*n)
. If h
is provided, alpha
will be calculated as alpha=h/n
.
maximal number of concentration steps in the deterministic MCD; should not be reached.
regularization parameter. Normally NULL and will be estimated from the data.
structure of the robust positive definite target matrix: a) "identity": target matrix is diagonal matrix with robustly estimated univariate scales on the diagonal or b) "equicorrelation": non-diagonal target matrix that incorporates an equicorrelation structure (see (17) in paper).
maximum condition number allowed (see step 3.4 in algorithm 1).
trace
, tolSolve
:from the "CovControl"
class.
Class "CovControl"
, directly.
signature(obj = "CovControlMrcd")
: the generic
function restimate
allows the different methods for robust estimation to be
used polymorphically - this function will call CovMrcd
passing it the control
object and will return the obtained CovRobust
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMrcd", alpha=0.75) ctrl2 <- CovControlMrcd(alpha=0.75) data(hbk) CovMrcd(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMrcd", alpha=0.75) ctrl2 <- CovControlMrcd(alpha=0.75) data(hbk) CovMrcd(hbk, control=ctrl1)
This function will create a control object CovControlMve
containing the control parameters for CovMve
CovControlMve(alpha = 0.5, nsamp = 500, seed = NULL, trace= FALSE)
CovControlMve(alpha = 0.5, nsamp = 500, seed = NULL, trace= FALSE)
alpha |
numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., |
nsamp |
number of subsets used for initial estimates or |
seed |
starting value for random generator. Default is |
trace |
whether to print intermediate results. Default is |
A CovControlMve
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMve", alpha=0.75) ctrl2 <- CovControlMve(alpha=0.75) data(hbk) CovMve(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMve", alpha=0.75) ctrl2 <- CovControlMve(alpha=0.75) data(hbk) CovMve(hbk, control=ctrl1)
This class extends the CovControl
class
and contains the control parameters for "CovMve"
Objects can be created by calls of the form new("CovControlMve", ...)
or by calling the constructor-function CovControlMve
.
alpha
:numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., alpha*n
observations are used for computing the determinant. Allowed values
are between 0.5 and 1 and the default is 0.5.
nsamp
: number of subsets used for initial estimates or "best"
or "exact"
. Default is nsamp = 500
. For
nsamp="best"
exhaustive enumeration is done, as long as the
number of trials does not exceed 5000. For "exact"
,
exhaustive enumeration will be attempted however many samples are
needed. In this case a warning message will be displayed saying
that the computation can take a very long time.
seed
:starting value for random generator. Default is seed = NULL
trace
, tolSolve
:from the
"CovControl"
class.
Class "CovControl"
, directly.
signature(obj = "CovControlMve")
: the generic
function restimate
allowes the different methods for robust estimation to be
used polymorphically - this function will call CovMve
passing it the control
object and will return the obtained CovRobust
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlMve", alpha=0.75) ctrl2 <- CovControlMve(alpha=0.75) data(hbk) CovMve(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlMve", alpha=0.75) ctrl2 <- CovControlMve(alpha=0.75) data(hbk) CovMve(hbk, control=ctrl1)
This function will create a control object CovControlOgk
containing the control parameters for CovOgk
CovControlOgk(niter = 2, beta = 0.9, mrob = NULL, vrob = .vrobGK, smrob = "scaleTau2", svrob = "gk")
CovControlOgk(niter = 2, beta = 0.9, mrob = NULL, vrob = .vrobGK, smrob = "scaleTau2", svrob = "gk")
niter |
number of iterations, usually 1 or 2 since iterations beyond the second do not lead to improvement. |
beta |
coverage parameter for the final reweighted estimate |
mrob |
function for computing the robust univariate location
and dispersion - one could use the |
vrob |
function for computing robust estimate
of covariance between two random vectors - one could use the function
proposed by Gnanadesikan and Kettenring (1972), see
|
smrob |
a string indicating the name of the function for computing
the robust univariate location and dispersion - defaults to
|
svrob |
a string indicating the name of the function for computing
robust estimate of covariance between two random vectors - defaults |
If the user does not specify a scale and covariance function to be used in
the computations or specifies one by using the arguments smrob
and svrob
(i.e. the names of the functions as strings), a native code written in C will be called which
is by far faster than the R version.
If the arguments mrob
and vrob
are not NULL, the specified functions
will be used via the pure R implementation of the algorithm. This could be quite slow.
A CovControlOgk
object
Valentin Todorov [email protected]
Maronna, R.A. and Zamar, R.H. (2002) Robust estimates of location and dispersion of high-dimensional datasets; Technometrics 44(4), 307–317.
Yohai, R.A. and Zamar, R.H. (1998) High breakdown point estimates of regression by means of the minimization of efficient scale JASA 86, 403–413.
Gnanadesikan, R. and John R. Kettenring (1972) Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, 81–124.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlOgk", beta=0.95) ctrl2 <- CovControlOgk(beta=0.95) data(hbk) CovOgk(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlOgk", beta=0.95) ctrl2 <- CovControlOgk(beta=0.95) data(hbk) CovOgk(hbk, control=ctrl1)
This class extends the CovControl
class
and contains the control parameters for "CovOgk"
Objects can be created by calls of the form new("CovControlOgk", ...)
or by calling the constructor-function CovControlOgk
.
number of iterations, usually 1 or 2 since iterations beyond the second do not lead to improvement.
coverage parameter for the final reweighted estimate
function for computing the robust univariate location
and dispersion - defaults to the tau scale
defined in
Yohai and Zamar (1998)
function for computing robust estimate of covariance between two random vectors - defaults the one proposed by Gnanadesikan and Kettenring (1972)
A string indicating the name of the function for computing
the robust univariate location and dispersion - defaults to scaleTau2
-
the scale 'tau' function defined in Yohai and Zamar (1998)
A string indicating the name of the function for computing
robust estimate of covariance between two random vectors -
defaults to gk
, the one proposed by Gnanadesikan and Kettenring (1972).
trace
, tolSolve
:from the
"CovControl"
class.
Class "CovControl"
, directly.
signature(obj = "CovControlOgk")
: the generic
function restimate
allowes the different methods for robust estimation to be
used polymorphically - this function will call CovOgk
passing it the control
object and will return the obtained CovRobust
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlOgk", beta=0.95) ctrl2 <- CovControlOgk(beta=0.95) data(hbk) CovOgk(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlOgk", beta=0.95) ctrl2 <- CovControlOgk(beta=0.95) data(hbk) CovOgk(hbk, control=ctrl1)
This function will create a control object CovControlSde
containing the control parameters for CovSde
CovControlSde(nsamp = 0, maxres = 0, tune = 0.95, eps = 0.5, prob = 0.99, seed = NULL, trace = FALSE, tolSolve = 1e-14)
CovControlSde(nsamp = 0, maxres = 0, tune = 0.95, eps = 0.5, prob = 0.99, seed = NULL, trace = FALSE, tolSolve = 1e-14)
nsamp |
a positive integer giving the number of resamples required;
|
maxres |
a positive integer specifying the maximum number of
resamples to be performed including those that are discarded due to linearly
dependent subsamples. If |
tune |
a numeric value between 0 and 1 giving the fraction of the data to receive non-zero weight.
Defaults to |
prob |
a numeric value between 0 and 1 specifying the probability of high breakdown point;
used to compute |
eps |
a numeric value between 0 and 0.5 specifying the breakdown point;
used to compute |
seed |
starting value for random generator. Default is |
trace |
whether to print intermediate results. Default is |
tolSolve |
numeric tolerance to be used for inversion
( |
A CovControlSde
object.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlSde", nsamp=2000) ctrl2 <- CovControlSde(nsamp=2000) data(hbk) CovSde(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlSde", nsamp=2000) ctrl2 <- CovControlSde(nsamp=2000) data(hbk) CovSde(hbk, control=ctrl1)
This class extends the CovControl
class
and contains the control parameters for CovSde
Objects can be created by calls of the form new("CovControlSde", ...)
or by calling the constructor-function CovControlSde
.
nsamp
a positive integer giving the number of resamples required
maxres
a positive integer specifying the maximum number of resamples to be performed including those that are discarded due to linearly dependent subsamples.
tune
a numeric value between 0 and 1 giving the fraction of
the data to receive non-zero weight. Default is tune = 0.95
.
prob
a numeric value between 0 and 1 specifying
the probability of high breakdown point; used to compute
nsamp
when nsamp
is omitted. Default is prob = 0.99
.
eps
a numeric value between 0 and 0.5 specifying the breakdown point;
used to compute nsamp
when nresamp
is omitted.
Default is eps = 0.5
.
starting value for random generator. Default is seed = NULL
.
trace
, tolSolve
:from the
"CovControl"
class.
Class "CovControl"
, directly.
signature(obj = "CovControlSde")
: ...
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlSde", nsamp=2000) ctrl2 <- CovControlSde(nsamp=2000) data(hbk) CovSde(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlSde", nsamp=2000) ctrl2 <- CovControlSde(nsamp=2000) data(hbk) CovSde(hbk, control=ctrl1)
This function will create a control object CovControlSest
containing the control parameters for CovSest
CovControlSest(bdp = 0.5, arp = 0.1, eps = 1e-5, maxiter = 120, nsamp = 500, seed = NULL, trace = FALSE, tolSolve = 1e-14, method= "sfast")
CovControlSest(bdp = 0.5, arp = 0.1, eps = 1e-5, maxiter = 120, nsamp = 500, seed = NULL, trace = FALSE, tolSolve = 1e-14, method= "sfast")
bdp |
a numeric value specifying the required
breakdown point. Allowed values are between
|
arp |
a numeric value specifying the asympthotic
rejection point (for the Rocke type S estimates),
i.e. the fraction of points receiving zero
weight (see Rocke (1996)). Default is |
eps |
a numeric value specifying the
relative precision of the solution of the S-estimate (bisquare and Rocke type).
Defaults to |
maxiter |
maximum number of iterations allowed in the computation of the S-estimate (bisquare and Rocke type). Defaults to 120. |
nsamp |
the number of random subsets considered. Default is |
seed |
starting value for random generator. Default is |
trace |
whether to print intermediate results. Default is |
tolSolve |
numeric tolerance to be used for inversion
( |
method |
Which algorithm to use: 'sfast'=FAST-S or 'surreal'=SURREAL |
A CovControlSest
object.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlSest", bdp=0.4) ctrl2 <- CovControlSest(bdp=0.4) data(hbk) CovSest(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlSest", bdp=0.4) ctrl2 <- CovControlSest(bdp=0.4) data(hbk) CovSest(hbk, control=ctrl1)
This class extends the CovControl
class
and contains the control parameters for CovSest
Objects can be created by calls of the form new("CovControlSest", ...)
or by calling the constructor-function CovControlSest
.
a numeric value specifying the required
breakdown point. Allowed values are between
(n - p)/(2 * n)
and 1 and the default is bdp=0.45
.
a numeric value specifying the asympthotic
rejection point (for the Rocke type S estimates),
i.e. the fraction of points receiving zero
weight (see Rocke (1996)). Default is arp=0.1
.
a numeric value specifying the
relative precision of the solution of the S-estimate
(bisquare and Rocke type). Default is to eps=1e-5
.
maximum number of iterations allowed
in the computation of the S-estimate (bisquare and Rocke type).
Default is maxiter=120
.
the number of random subsets considered.
Default is nsamp = 500
.
starting value for random generator. Default is seed = NULL
.
Which algorithm to use: 'sfast'=FAST-S, 'surreal'=Ruppert's SURREAL algorithm, 'bisquare'=Bisquare S-estimation with HBDP start or 'rocke' for Rocke type S-estimates
trace
, tolSolve
:from the
"CovControl"
class.
Class "CovControl"
, directly.
signature(obj = "CovControlSest")
: the generic
function restimate
allowes the different methods for robust estimation to be
used polymorphically - this function will call CovSest
passing it the control
object and will return the obtained CovRobust
object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## the following two statements are equivalent ctrl1 <- new("CovControlSest", bdp=0.4) ctrl2 <- CovControlSest(bdp=0.4) data(hbk) CovSest(hbk, control=ctrl1)
## the following two statements are equivalent ctrl1 <- new("CovControlSest", bdp=0.4) ctrl2 <- CovControlSest(bdp=0.4) data(hbk) CovSest(hbk, control=ctrl1)
Computes a robust multivariate location and scatter estimate with a high breakdown point, using the ‘Fast MCD’ (Minimum Covariance Determinant) estimator.
CovMcd(x, raw.only=FALSE, alpha=control@alpha, nsamp=control@nsamp, scalefn=control@scalefn, maxcsteps=control@maxcsteps, initHsets=NULL, save.hsets=FALSE, seed=control@seed, trace=control@trace, [email protected], control=CovControlMcd(), ...)
CovMcd(x, raw.only=FALSE, alpha=control@alpha, nsamp=control@nsamp, scalefn=control@scalefn, maxcsteps=control@maxcsteps, initHsets=NULL, save.hsets=FALSE, seed=control@seed, trace=control@trace, use.correction=control@use.correction, control=CovControlMcd(), ...)
x |
a matrix or data frame. |
raw.only |
should only the “raw” estimate be returned. |
alpha |
numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., |
nsamp |
number of subsets used for initial estimates or For |
scalefn |
|
maxcsteps |
maximal number of concentration steps in the deterministic MCD; should not be reached. |
initHsets |
NULL or a |
save.hsets |
(for deterministic MCD) logical indicating if the
initial subsets should be returned as |
seed |
starting value for random generator. Default is |
trace |
whether to print intermediate results. Default is |
use.correction |
whether to use finite sample correction factors.
Default is |
control |
a control object (S4) of class |
... |
potential further arguments passed to robustbase's
|
This function computes the minimum covariance determinant estimator
of location and scatter and returns an S4 object of class
CovMcd-class
containing the estimates.
The implementation of the function is similar to the existing R function
covMcd()
which returns an S3 object.
The MCD method looks for the
observations (out of
) whose classical
covariance matrix has the lowest possible determinant. The raw MCD
estimate of location is then the average of these
points,
whereas the raw MCD estimate of scatter is their covariance matrix,
multiplied by a consistency factor and a finite sample correction factor
(to make it consistent at the normal model and unbiased at small samples).
Both rescaling factors are returned also in the vector
raw.cnp2
of length 2. Based on these raw MCD estimates, a reweighting step is performed
which increases the finite-sample efficiency considerably - see Pison et al. (2002).
The rescaling factors for the reweighted estimates are returned in the
vector cnp2
of length 2. Details for the computation of the finite
sample correction factors can be found in Pison et al. (2002).
The finite sample corrections can be suppressed by setting use.correction=FALSE
.
The implementation in rrcov uses the Fast MCD algorithm of Rousseeuw and Van Driessen (1999)
to approximate the minimum covariance determinant estimator.
An S4 object of class CovMcd-class
which is a subclass of the
virtual class CovRobust-class
.
Valentin Todorov [email protected]
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley.
P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.
M. Hubert, P. Rousseeuw and T. Verdonck (2012) A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics 21(3), 618–637.
Pison, G., Van Aelst, S., and Willems, G. (2002), Small Sample Corrections for LTS and MCD, Metrika, 55, 111-123.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
cov.rob
from package MASS
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovMcd(hbk.x) cD <- CovMcd(hbk.x, nsamp = "deterministic") summary(cD) ## the following three statements are equivalent c1 <- CovMcd(hbk.x, alpha = 0.75) c2 <- CovMcd(hbk.x, control = CovControlMcd(alpha = 0.75)) ## direct specification overrides control one: c3 <- CovMcd(hbk.x, alpha = 0.75, control = CovControlMcd(alpha=0.95)) c1
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovMcd(hbk.x) cD <- CovMcd(hbk.x, nsamp = "deterministic") summary(cD) ## the following three statements are equivalent c1 <- CovMcd(hbk.x, alpha = 0.75) c2 <- CovMcd(hbk.x, control = CovControlMcd(alpha = 0.75)) ## direct specification overrides control one: c3 <- CovMcd(hbk.x, alpha = 0.75, control = CovControlMcd(alpha=0.95)) c1
This class, derived from the virtual class "CovRobust"
accomodates
MCD Estimates of multivariate location and scatter computed by the
‘Fast MCD’ algorithm.
Objects can be created by calls of the form new("CovMcd", ...)
,
but the usual way of creating CovMcd
objects is a call to the function
CovMcd
which serves as a constructor.
alpha
:Object of class "numeric"
- the size of the
subsets over which the determinant is minimized (the default is (n+p+1)/2)
quan
:Object of class "numeric"
- the number of
observations on which the MCD is based. If quan
equals
n.obs
, the MCD is the classical covariance matrix.
best
:Object of class "Uvector"
- the best subset
found and used for computing the raw estimates. The size of best
is equal to quan
raw.cov
:Object of class "matrix"
the raw
(not reweighted) estimate of location
raw.center
:Object of class "vector"
- the raw
(not reweighted) estimate of scatter
raw.mah
:Object of class "Uvector"
- mahalanobis
distances of the observations based on the raw estimate of the
location and scatter
raw.wt
:Object of class "Uvector"
- weights of
the observations based on the raw estimate of the location and scatter
raw.cnp2
:Object of class "numeric"
- a vector of length
two containing the consistency correction factor and the finite sample
correction factor of the raw estimate of the covariance matrix
cnp2
:Object of class "numeric"
- a vector of length two
containing the consistency correction factor and the finite sample
correction factor of the final estimate of the covariance matrix.
iter
, crit
, wt
:from the
"CovRobust"
class.
call
, cov
, center
,
n.obs
, mah
, method
,
singularity
, X
:from the "Cov"
class.
Class "CovRobust"
, directly.
Class "Cov"
, by class "CovRobust"
.
No methods defined with class "CovMcd"
in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
CovMcd
, Cov-class
, CovRobust-class
showClass("CovMcd")
showClass("CovMcd")
Computes constrained M-Estimates of multivariate location and scatter
based on the translated biweight function (‘t-biweight’) using
a High breakdown point initial estimate. The default initial estimate
is the Minimum Volume Ellipsoid computed
with CovMve
. The raw (not reweighted) estimates are taken
and the covariance matrix is standardized to determinant 1.
covMest(x, cor=FALSE, r = 0.45, arp = 0.05, eps=1e-3, maxiter=120, control, t0, S0)
covMest(x, cor=FALSE, r = 0.45, arp = 0.05, eps=1e-3, maxiter=120, control, t0, S0)
x |
a matrix or data frame. |
cor |
should the returned result include a correlation matrix?
Default is |
.
r |
required breakdown point. Allowed values are between
|
arp |
asympthotic rejection point, i.e. the fraction of points
receiving zero weight (see Rocke (1996)). Default is |
eps |
a numeric value specifying the relative precision of the solution of
the M-estimate. Defaults to |
maxiter |
maximum number of iterations allowed in the computation of the M-estimate. Defaults to 120 |
control |
a list with estimation options - same as these provided in the fucntion specification. If the control object is supplied, the parameters from it will be used. If parameters are passed also in the invocation statement, they will override the corresponding elements of the control object. |
t0 |
optional initial high breakdown point estimates of the location. If not supplied MVE will be used. |
S0 |
optional initial high breakdown point estimates of the scatter. If not supplied MVE will be used. |
Rocke (1996) has shown that the S-estimates of multivariate location and scatter
in high dimensions can be sensitive to outliers even if the breakdown point
is set to be near 0.5. To mitigate this problem he proposed to utilize
the translated biweight (or t-biweight) method with a
standardization step consisting of equating the median of rho(d)
with the median under normality. This is then not an S-estimate, but is
instead a constrained M-estimate. In order to make the smooth estimators
to work, a reasonable starting point is necessary, which will lead reliably to a
good solution of the estimator. In covMest
the MVE computed by
CovMve
is used, but the user has the possibility to give her own
initial estimates.
An object of class "mest"
which is basically a list
with
the following components. This class is "derived" from "mcd"
so that
the same generic functions - print
, plot
, summary
- can
be used.
NOTE: this is going to change - in one of the next revisions covMest
will return an S4 class "mest"
which is derived (i.e. contains
)
form class "cov"
.
center |
the final estimate of location. |
cov |
the final estimate of scatter. |
cor |
the estimate of the correlation matrix (only if
|
mah |
mahalanobis distances of the observations using the M-estimate of the location and scatter. |
X |
the input data as a matrix. |
n.obs |
total number of observations. |
method |
character string naming the method (M-Estimates). |
call |
the call used (see |
The psi, rho and weight functions for the M estimation are encapsulated in a
virtual S4 class PsiFun
from which a PsiBwt
class, implementing
the translated biweight (t-biweight), is dervied. The base class PsiFun
contains also the M-iteration itself. Although not documented and not
accessibale directly by the user these classes will form the bases for adding
other functions (biweight, LWS, etc.) as well as S-estimates.
Valentin Todorov [email protected],
(some code from C. Becker - http://www.sfb475.uni-dortmund.de/dienst/de/content/struk-d/bereicha-d/tpa1softw-d.html)
D.L.Woodruff and D.M.Rocke (1994) Computable robust estimation of multivariate location and shape on high dimension using compound estimators, Journal of the American Statistical Association, 89, 888–896.
D.M.Rocke (1996) Robustness properties of S-estimates of multivariate location and shape in high dimension, Annals of Statistics, 24, 1327-1345.
D.M.Rocke and D.L.Woodruff (1996) Identification of outliers in multivariate data Journal of the American Statistical Association, 91, 1047–1061.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
Computes constrained M-Estimates of multivariate location and scatter
based on the translated biweight function (‘t-biweight’) using
a High breakdown point initial estimate as defined by Rocke (1996).
The default initial estimate is the Minimum Volume Ellipsoid computed
with CovMve
. The raw (not reweighted) estimates are taken
and the covariance matrix is standardized to determinant 1.
CovMest(x, r = 0.45, arp = 0.05, eps=1e-3, maxiter=120, control, t0, S0, initcontrol)
CovMest(x, r = 0.45, arp = 0.05, eps=1e-3, maxiter=120, control, t0, S0, initcontrol)
x |
a matrix or data frame. |
r |
required breakdown point. Allowed values are between
|
arp |
asympthotic rejection point, i.e. the fraction of points
receiving zero weight (see Rocke (1996)). Default is |
eps |
a numeric value specifying the relative precision of the solution of
the M-estimate. Defaults to |
maxiter |
maximum number of iterations allowed in the computation of the M-estimate. Defaults to 120 |
control |
a control object (S4) of class |
t0 |
optional initial high breakdown point estimates of the location. If not supplied MVE will be used. |
S0 |
optional initial high breakdown point estimates of the scatter. If not supplied MVE will be used. |
initcontrol |
optional control object - of class CovControl - specifing the initial high breakdown point estimates of location and scatter. If not supplied MVE will be used. |
Rocke (1996) has shown that the S-estimates of multivariate location and scatter
in high dimensions can be sensitive to outliers even if the breakdown point
is set to be near 0.5. To mitigate this problem he proposed to utilize
the translated biweight (or t-biweight) method with a
standardization step consisting of equating the median of rho(d)
with the median under normality. This is then not an S-estimate, but is
instead a constrained M-estimate. In order to make the smooth estimators
to work, a reasonable starting point is necessary, which will lead reliably to a
good solution of the estimator. In CovMest
the MVE computed by
CovMve
is used, but the user has the possibility to give her own
initial estimates.
An object of class CovMest-class
which is a subclass of the virtual class CovRobust-class
.
The psi, rho and weight functions for the M estimation are encapsulated in a
virtual S4 class PsiFun
from which a PsiBwt
class, implementing
the translated biweight (t-biweight), is dervied. The base class PsiFun
contains also the M-iteration itself. Although not documented and not
accessibale directly by the user these classes will form the bases for adding
other functions (biweight, LWS, etc.) as well as S-estimates.
Valentin Todorov [email protected],
(some code from C. Becker - http://www.sfb475.uni-dortmund.de/dienst/de/content/struk-d/bereicha-d/tpa1softw-d.html)
D.L.Woodruff and D.M.Rocke (1994) Computable robust estimation of multivariate location and shape on high dimension using compound estimators, Journal of the American Statistical Association, 89, 888–896.
D.M.Rocke (1996) Robustness properties of S-estimates of multivariate location and shape in high dimension, Annals of Statistics, 24, 1327-1345.
D.M.Rocke and D.L.Woodruff (1996) Identification of outliers in multivariate data Journal of the American Statistical Association, 91, 1047–1061.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
covMcd
, Cov-class
,
CovMve
,
CovRobust-class
, CovMest-class
library(rrcov) data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovMest(hbk.x) ## the following four statements are equivalent c0 <- CovMest(hbk.x) c1 <- CovMest(hbk.x, r = 0.45) c2 <- CovMest(hbk.x, control = CovControlMest(r = 0.45)) c3 <- CovMest(hbk.x, control = new("CovControlMest", r = 0.45)) ## direct specification overrides control one: c4 <- CovMest(hbk.x, r = 0.40, control = CovControlMest(r = 0.25)) c1 summary(c1) plot(c1)
library(rrcov) data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovMest(hbk.x) ## the following four statements are equivalent c0 <- CovMest(hbk.x) c1 <- CovMest(hbk.x, r = 0.45) c2 <- CovMest(hbk.x, control = CovControlMest(r = 0.45)) c3 <- CovMest(hbk.x, control = new("CovControlMest", r = 0.45)) ## direct specification overrides control one: c4 <- CovMest(hbk.x, r = 0.40, control = CovControlMest(r = 0.25)) c1 summary(c1) plot(c1)
This class, derived from the virtual class "CovRobust" accomodates constrained M-Estimates of multivariate location and scatter based on the translated biweight function (‘t-biweight’) using a High breakdown point initial estimate (Minimum Covariance Determinant - ‘Fast MCD’)
Objects can be created by calls of the form new("CovMest", ...)
,
but the usual way of creating CovMest
objects is a call to the function
CovMest
which serves as a constructor.
vt
:Object of class "vector"
- vector of weights (v)
iter
, crit
, wt
:from the
"CovRobust"
class.
call
, cov
, center
,
n.obs
, mah
, method
,
singularity
, X
:from the "Cov"
class.
Class "CovRobust"
, directly.
Class "Cov"
, by class "CovRobust"
.
No methods defined with class "CovMest" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
CovMest
, Cov-class
, CovRobust-class
showClass("CovMest")
showClass("CovMest")
Computes MM-Estimates of multivariate location and scatter starting from an initial S-estimate
CovMMest(x, bdp = 0.5, eff = 0.95, eff.shape=TRUE, maxiter = 50, trace = FALSE, tolSolve = 1e-7, control)
CovMMest(x, bdp = 0.5, eff = 0.95, eff.shape=TRUE, maxiter = 50, trace = FALSE, tolSolve = 1e-7, control)
x |
a matrix or data frame. |
bdp |
a numeric value specifying the required
breakdown point. Allowed values are between
0.5 and 1 and the default is |
eff |
a numeric value specifying the required efficiency
for the MM estimates. Default is |
eff.shape |
logical; if TRUE, eff is with regard to shape-efficiency, otherwise location-efficiency. Default is |
maxiter |
maximum number of iterations allowed
in the computation of the S-estimate (bisquare and Rocke type).
Default is |
trace |
whether to print intermediate results. Default is |
tolSolve |
numeric tolerance to be used as a convergence tolerance for the MM-iteration |
control |
a control object (S4) of class |
Computes MM-estimates of multivariate location and scatter starting from an initial S-estimate.
An S4 object of class CovMMest-class
which is a subclass of the
virtual class CovRobust-class
.
Valentin Todorov [email protected]
Tatsuoka, K.S. and Tyler, D.E. (2000). The uniqueness of S and M-functionals under non-elliptical distributions. Annals of Statistics 28, 1219–1243
M. Salibian-Barrera, S. Van Aelstt and G. Willems (2006). Principal components analysis based on multivariate MM-estimators with fast and robust bootstrap. Journal of the American Statistical Association 101, 1198–1211.
R. A. Maronna, D. Martin and V. Yohai (2006). Robust Statistics: Theory and Methods. Wiley, New York.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
library(rrcov) data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovMMest(hbk.x) ## the following four statements are equivalent c0 <- CovMMest(hbk.x) c1 <- CovMMest(hbk.x, bdp = 0.25) c2 <- CovMMest(hbk.x, control = CovControlMMest(bdp = 0.25)) c3 <- CovMMest(hbk.x, control = new("CovControlMMest", bdp = 0.25)) ## direct specification overrides control one: c4 <- CovMMest(hbk.x, bdp = 0.40, control = CovControlMMest(bdp = 0.25)) c1 summary(c1) plot(c1) ## Deterministic MM-estmates CovMMest(hbk.x, control=CovControlMMest(sest=CovControlSest(method="sdet")))
library(rrcov) data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovMMest(hbk.x) ## the following four statements are equivalent c0 <- CovMMest(hbk.x) c1 <- CovMMest(hbk.x, bdp = 0.25) c2 <- CovMMest(hbk.x, control = CovControlMMest(bdp = 0.25)) c3 <- CovMMest(hbk.x, control = new("CovControlMMest", bdp = 0.25)) ## direct specification overrides control one: c4 <- CovMMest(hbk.x, bdp = 0.40, control = CovControlMMest(bdp = 0.25)) c1 summary(c1) plot(c1) ## Deterministic MM-estmates CovMMest(hbk.x, control=CovControlMMest(sest=CovControlSest(method="sdet")))
This class, derived from the virtual class "CovRobust"
accomodates MM Estimates of multivariate location and scatter.
Objects can be created by calls of the form new("CovMMest", ...)
,
but the usual way of creating CovSest
objects is a call to the function
CovMMest
which serves as a constructor.
det
, flag
, iter
, crit
:from the
"CovRobust"
class.
tuning parameter of the loss function for MM-estimation
(depend on control parameters eff
and eff.shape
).
Can be computed by the internal function
.csolve.bw.MM(p, eff, eff.shape=TRUE)
.
For the tuning parameters of the underlying S-estimate see the slot sest
and
"CovSest"
.
an CovSest
object containing the initial S-estimate.
call
, cov
, center
,
n.obs
, mah
, method
,
singularity
, X
:from the "Cov"
class.
Class "CovRobust"
, directly.
Class "Cov"
, by class "CovRobust"
.
No methods defined with class "CovMMest" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
CovMMest
, Cov-class
, CovRobust-class
showClass("CovMMest")
showClass("CovMMest")
Computes a robust multivariate location and scatter estimate with a high breakdown point, using the Minimum Regularized Covariance Determonant (MRCD) estimator.
CovMrcd(x, alpha=control@alpha, h=control@h, maxcsteps=control@maxcsteps, initHsets=NULL, save.hsets=FALSE, rho=control@rho, target=control@target, maxcond=control@maxcond, trace=control@trace, control=CovControlMrcd())
CovMrcd(x, alpha=control@alpha, h=control@h, maxcsteps=control@maxcsteps, initHsets=NULL, save.hsets=FALSE, rho=control@rho, target=control@target, maxcond=control@maxcond, trace=control@trace, control=CovControlMrcd())
x |
a matrix or data frame. |
alpha |
numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., |
h |
the size of the subset (can be between ceiling(n/2) and n).
Normally NULL and then it |
maxcsteps |
maximal number of concentration steps in the deterministic MCD; should not be reached. |
initHsets |
NULL or a |
save.hsets |
(for deterministic MCD) logical indicating if the
initial subsets should be returned as |
rho |
regularization parameter. Normally NULL and will be estimated from the data. |
target |
structure of the robust positive definite target matrix:
a) "identity": target matrix is diagonal matrix with robustly estimated
univariate scales on the diagonal or b) "equicorrelation": non-diagonal
target matrix that incorporates an equicorrelation structure
(see (17) in paper). Default is |
maxcond |
maximum condition number allowed
(see step 3.4 in algorithm 1). Default is |
trace |
whether to print intermediate results. Default is |
control |
a control object (S4) of class |
This function computes the minimum regularized covariance determinant estimator (MRCD)
of location and scatter and returns an S4 object of class
CovMrcd-class
containing the estimates.
Similarly like the MCD method, MRCD looks for the
observations (out of
) whose classical
covariance matrix has the lowest possible determinant, but
replaces the subset-based covariance by a regularized
covariance estimate, defined as a weighted average of the
sample covariance of the h-subset and a predetermined
positive definite target matrix. The Minimum Regularized Covariance
Determinant (MRCD) estimator is then the regularized covariance
based on the h-subset which makes the overall determinant the smallest.
A data-driven procedure sets the weight of the target matrix (
rho
), so that
the regularization is only used when needed.
An S4 object of class CovMrcd-class
which is a subclass of the
virtual class CovRobust-class
.
Kris Boudt, Peter Rousseeuw, Steven Vanduffel and Tim Verdonk. Improved by Joachim Schreurs and Iwein Vranckx. Adapted for rrcov by Valentin Todorov [email protected]
Kris Boudt, Peter Rousseeuw, Steven Vanduffel and Tim Verdonck (2020) The Minimum Regularized Covariance Determinant estimator, Statistics and Computing, 30, pp 113–128 doi:10.1007/s11222-019-09869-x.
Mia Hubert, Peter Rousseeuw and Tim Verdonck (2012) A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics 21(3), 618–637.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## The result will be (almost) identical to the raw MCD ## (since we do not do reweighting of MRCD) ## data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) c0 <- CovMcd(hbk.x, alpha=0.75, use.correction=FALSE) cc <- CovMrcd(hbk.x, alpha=0.75) cc$rho all.equal(c0$best, cc$best) all.equal(c0$raw.center, cc$center) all.equal(c0$raw.cov/c0$raw.cnp2[1], cc$cov/cc$cnp2) summary(cc) ## the following three statements are equivalent c1 <- CovMrcd(hbk.x, alpha = 0.75) c2 <- CovMrcd(hbk.x, control = CovControlMrcd(alpha = 0.75)) ## direct specification overrides control one: c3 <- CovMrcd(hbk.x, alpha = 0.75, control = CovControlMrcd(alpha=0.95)) c1 ## Not run: ## This is the first example from Boudt et al. (2020). The first variable is ## the dependent one, which we remove and remain with p=226 NIR absorbance spectra data(octane) octane <- octane[, -1] # remove the dependent variable y n <- nrow(octane) p <- ncol(octane) ## Compute MRCD with h=33, which gives approximately 15 percent breakdown point. ## This value of h was found by Boudt et al. (2020) using a data driven approach, ## similar to the Forward Search of Atkinson et al. (2004). ## The default value of h would be 20 (i.e. alpha=0.5) out <- CovMrcd(octane, h=33) out$rho ## Please note that in the paper is indicated that the obtained rho=0.1149, however, ## this value of rho is obtained if the parameter maxcond is set equal to 999 (this was ## the default in an earlier version of the software, now the default is maxcond=50). ## To reproduce the result from the paper, change the call to CovMrcd() as follows ## (this will not influence the results shown further): ## out <- CovMrcd(octane, h=33, maxcond=999) ## out$rho robpca = PcaHubert(octane, k=2, alpha=0.75, mcd=FALSE) (outl.robpca = which(robpca@flag==FALSE)) # Observations flagged as outliers by ROBPCA: # 25, 26, 36, 37, 38, 39 # Plot the orthogonal distances versus the score distances: pch = rep(20,n); pch[robpca@flag==FALSE] = 17 col = rep('black',n); col[robpca@flag==FALSE] = 'red' plot(robpca, pch=pch, col=col, id.n.sd=6, id.n.od=6) ## Plot now the MRCD mahalanobis distances pch = rep(20,n); pch[!getFlag(out)] = 17 col = rep('black',n); col[!getFlag(out)] = 'red' plot(out, pch=pch, col=col, id.n=6) ## End(Not run)
## The result will be (almost) identical to the raw MCD ## (since we do not do reweighting of MRCD) ## data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) c0 <- CovMcd(hbk.x, alpha=0.75, use.correction=FALSE) cc <- CovMrcd(hbk.x, alpha=0.75) cc$rho all.equal(c0$best, cc$best) all.equal(c0$raw.center, cc$center) all.equal(c0$raw.cov/c0$raw.cnp2[1], cc$cov/cc$cnp2) summary(cc) ## the following three statements are equivalent c1 <- CovMrcd(hbk.x, alpha = 0.75) c2 <- CovMrcd(hbk.x, control = CovControlMrcd(alpha = 0.75)) ## direct specification overrides control one: c3 <- CovMrcd(hbk.x, alpha = 0.75, control = CovControlMrcd(alpha=0.95)) c1 ## Not run: ## This is the first example from Boudt et al. (2020). The first variable is ## the dependent one, which we remove and remain with p=226 NIR absorbance spectra data(octane) octane <- octane[, -1] # remove the dependent variable y n <- nrow(octane) p <- ncol(octane) ## Compute MRCD with h=33, which gives approximately 15 percent breakdown point. ## This value of h was found by Boudt et al. (2020) using a data driven approach, ## similar to the Forward Search of Atkinson et al. (2004). ## The default value of h would be 20 (i.e. alpha=0.5) out <- CovMrcd(octane, h=33) out$rho ## Please note that in the paper is indicated that the obtained rho=0.1149, however, ## this value of rho is obtained if the parameter maxcond is set equal to 999 (this was ## the default in an earlier version of the software, now the default is maxcond=50). ## To reproduce the result from the paper, change the call to CovMrcd() as follows ## (this will not influence the results shown further): ## out <- CovMrcd(octane, h=33, maxcond=999) ## out$rho robpca = PcaHubert(octane, k=2, alpha=0.75, mcd=FALSE) (outl.robpca = which(robpca@flag==FALSE)) # Observations flagged as outliers by ROBPCA: # 25, 26, 36, 37, 38, 39 # Plot the orthogonal distances versus the score distances: pch = rep(20,n); pch[robpca@flag==FALSE] = 17 col = rep('black',n); col[robpca@flag==FALSE] = 'red' plot(robpca, pch=pch, col=col, id.n.sd=6, id.n.od=6) ## Plot now the MRCD mahalanobis distances pch = rep(20,n); pch[!getFlag(out)] = 17 col = rep('black',n); col[!getFlag(out)] = 'red' plot(out, pch=pch, col=col, id.n=6) ## End(Not run)
This class, derived from the virtual class "CovRobust"
accomodates
MRCD Estimates of multivariate location and scatter computed by a variant of the
‘Fast MCD’ algorithm.
Objects can be created by calls of the form new("CovMrcd", ...)
,
but the usual way of creating CovMrcd
objects is a call to the function
CovMrcd
which serves as a constructor.
alpha
:Object of class "numeric"
- the size of the
subsets over which the determinant is minimized (the default is (n+p+1)/2)
quan
:Object of class "numeric"
- the number of
observations on which the MCD is based. If quan
equals
n.obs
, the MCD is the classical covariance matrix.
best
:Object of class "Uvector"
- the best subset
found and used for computing the raw estimates. The size of best
is equal to quan
cnp2
:Object of class "numeric"
- containing the consistency
correction factor of the estimate of the covariance matrix.
icov
:The inverse of the covariance matrix.
rho
:The estimated regularization parameter.
target
:The estimated target matrix.
crit
:from the
"CovRobust"
class.
call
, cov
, center
,
n.obs
, mah
, method
,
X
:from the "Cov"
class.
Class "CovRobust"
, directly.
Class "Cov"
, by class "CovRobust"
.
No methods defined with class "CovMrcd"
in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
CovMrcd
, Cov-class
, CovRobust-class
, CovMcd-class
showClass("CovMrcd")
showClass("CovMrcd")
Computes a robust multivariate location and scatter estimate with a high breakdown point, using the ‘MVE’ (Minimum Volume Ellipsoid) estimator.
CovMve(x, alpha = 1/2, nsamp = 500, seed = NULL, trace = FALSE, control)
CovMve(x, alpha = 1/2, nsamp = 500, seed = NULL, trace = FALSE, control)
x |
a matrix or data frame. |
alpha |
numeric parameter controlling the size of the subsets
over which the determinant is minimized, i.e., |
nsamp |
number of subsets used for initial estimates or |
seed |
starting value for random generator. Default is |
trace |
whether to print intermediate results. Default is |
control |
a control object (S4) of class |
This function computes the minimum volume ellipsoid estimator
of location and scatter and returns an S4 object of class
CovMve-class
containing the estimates.
The approximate estimate is
based on a subset of size alpha*n
with an enclosing ellipsoid of smallest volume.
The mean of the best found subset provides the raw estimate of the location,
and the rescaled covariance matrix is the raw estimate of scatter. The rescaling of
the raw covariance matrix is by median(dist)/qchisq(0.5, p)
and this scale factor
is returned in the slot raw.cnp2
. Currently no finite sample corrction factor is applied.
The Mahalanobis distances of all observations from the location estimate for the
raw covariance matrix are calculated, and those points within the 97.5
under Gaussian assumptions are declared to be good. The final (reweightd) estimates are the
mean and rescaled covariance of the good points. The reweighted covariance matrix is
rescaled by 1/pgamma(qchisq(alpha, p)/2, p/2 + 1)/alpha
(see Croux and Haesbroeck, 1999) and this scale factor is returned
in the slot cnp2
.
The search for the approximate solution is made over ellipsoids determined by the
covariance matrix of p+1
of the data points and applying
a simple but effective improvement of the subsampling procedure
as described in Maronna et al. (2006), p. 198.
Although there exists no formal proof of this improvement (as for MCD and LTS),
simulations show that it can be recommended as an approximation of the MVE.
An S4 object of class CovMve-class
which is a subclass of the
virtual class CovRobust-class
.
Main reason for implementing the MVE estimate was that it is the recommended
initial estimate for S estimation (see Maronna et al. (2006), p. 199) and will
be used by default in CovMest
(after removing the correction
factors from the covariance matrix and rescaling to determinant 1).
Valentin Todorov [email protected] and Matias Salibian-Barrera [email protected]
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley.
C. Croux and G. Haesbroeck (1999). Influence function and efficiency of the minimum covariance determinant scatter matrix estimator. Journal of Multivariate Analysis, 71, 161–190.
R. A. Maronna, D. Martin and V. Yohai (2006). Robust Statistics: Theory and Methods. Wiley, New York.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
cov.rob
from package MASS
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovMve(hbk.x) ## the following three statements are equivalent c1 <- CovMve(hbk.x, alpha = 0.75) c2 <- CovMve(hbk.x, control = CovControlMve(alpha = 0.75)) ## direct specification overrides control one: c3 <- CovMve(hbk.x, alpha = 0.75, control = CovControlMve(alpha=0.95)) c1
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovMve(hbk.x) ## the following three statements are equivalent c1 <- CovMve(hbk.x, alpha = 0.75) c2 <- CovMve(hbk.x, control = CovControlMve(alpha = 0.75)) ## direct specification overrides control one: c3 <- CovMve(hbk.x, alpha = 0.75, control = CovControlMve(alpha=0.95)) c1
This class, derived from the virtual class "CovRobust"
accomodates
MVE Estimates of multivariate location and scatter computed by the
‘Fast MVE’ algorithm.
Objects can be created by calls of the form new("CovMve", ...)
,
but the usual way of creating CovMve
objects is a call to the function
CovMve
which serves as a constructor.
alpha
:Object of class "numeric"
- the size of the
subsets over which the volume of the ellipsoid is minimized (the default is (n+p+1)/2)
quan
:Object of class "numeric"
- the number of
observations on which the MVE is based. If quan
equals
n.obs
, the MVE is the classical covariance matrix.
best
:Object of class "Uvector"
- the best subset
found and used for computing the raw estimates. The size of best
is equal to quan
raw.cov
:Object of class "matrix"
the raw
(not reweighted) estimate of location
raw.center
:Object of class "vector"
- the raw
(not reweighted) estimate of scatter
raw.mah
:Object of class "Uvector"
- mahalanobis
distances of the observations based on the raw estimate of the
location and scatter
raw.wt
:Object of class "Uvector"
- weights of
the observations based on the raw estimate of the location and scatter
raw.cnp2
:Object of class "numeric"
- a vector of length
two containing the consistency correction factor and the finite sample
correction factor of the raw estimate of the covariance matrix
cnp2
:Object of class "numeric"
- a vector of length two
containing the consistency correction factor and the finite sample
correction factor of the final estimate of the covariance matrix.
iter
, crit
, wt
:from the
"CovRobust"
class.
call
, cov
, center
,
n.obs
, mah
, method
,
singularity
, X
:from the "Cov"
class.
Class "CovRobust"
, directly.
Class "Cov"
, by class "CovRobust"
.
No methods defined with class "CovMve" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
CovMve
, Cov-class
, CovRobust-class
showClass("CovMve")
showClass("CovMve")
Computes a robust multivariate location and scatter estimate with a high breakdown point, using the pairwise algorithm proposed by Marona and Zamar (2002) which in turn is based on the pairwise robust estimator proposed by Gnanadesikan-Kettenring (1972).
CovOgk(x, niter = 2, beta = 0.9, control)
CovOgk(x, niter = 2, beta = 0.9, control)
x |
a matrix or data frame. |
niter |
number of iterations, usually 1 or 2 since iterations beyond the second do not lead to improvement. |
beta |
coverage parameter for the final reweighted estimate |
control |
a control object (S4) of class |
The method proposed by Marona and Zamar (2002) allowes to obtain
positive-definite and almost affine equivariant robust scatter matrices
starting from any pairwise robust scatter matrix. The default robust estimate
of covariance between two random vectors used is the one proposed by
Gnanadesikan and Kettenring (1972) but the user can choose any other method by
redefining the function in slot vrob
of the control object
CovControlOgk
. Similarly, the function for computing the robust
univariate location and dispersion used is the tau scale
defined
in Yohai and Zamar (1998) but it can be redefined in the control object.
The estimates obtained by the OGK method, similarly as in CovMcd
are returned
as 'raw' estimates. To improve the estimates a reweighting step is performed using
the coverage parameter beta
and these reweighted estimates are returned as
'final' estimates.
An S4 object of class CovOgk-class
which is a subclass of the
virtual class CovRobust-class
.
If the user does not specify a scale and covariance function to be used in
the computations or specifies one by using the arguments smrob
and svrob
(i.e. the names of the functions as strings), a native code written in C will be called which
is by far faster than the R version.
If the arguments mrob
and vrob
are not NULL, the specified functions
will be used via the pure R implementation of the algorithm. This could be quite slow.
See CovControlOgk
for details.
Valentin Todorov [email protected] and Kjell Konis [email protected]
Maronna, R.A. and Zamar, R.H. (2002) Robust estimates of location and dispersion of high-dimensional datasets; Technometrics 44(4), 307–317.
Yohai, R.A. and Zamar, R.H. (1998) High breakdown point estimates of regression by means of the minimization of efficient scale JASA 86, 403–413.
Gnanadesikan, R. and John R. Kettenring (1972) Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, 81–124.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovOgk(hbk.x) ## the following three statements are equivalent c1 <- CovOgk(hbk.x, niter=1) c2 <- CovOgk(hbk.x, control = CovControlOgk(niter=1)) ## direct specification overrides control one: c3 <- CovOgk(hbk.x, beta=0.95, control = CovControlOgk(beta=0.99)) c1 x<-matrix(c(1,2,3,7,1,2,3,7), ncol=2) ## CovOgk(x) - this would fail because the two columns of x are exactly collinear. ## In order to fix it, redefine the default 'vrob' function for example ## in the following way and pass it as a parameter in the control ## object. cc <- CovOgk(x, control=new("CovControlOgk", vrob=function(x1, x2, ...) { r <- .vrobGK(x1, x2, ...) if(is.na(r)) r <- 0 r }) ) cc
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovOgk(hbk.x) ## the following three statements are equivalent c1 <- CovOgk(hbk.x, niter=1) c2 <- CovOgk(hbk.x, control = CovControlOgk(niter=1)) ## direct specification overrides control one: c3 <- CovOgk(hbk.x, beta=0.95, control = CovControlOgk(beta=0.99)) c1 x<-matrix(c(1,2,3,7,1,2,3,7), ncol=2) ## CovOgk(x) - this would fail because the two columns of x are exactly collinear. ## In order to fix it, redefine the default 'vrob' function for example ## in the following way and pass it as a parameter in the control ## object. cc <- CovOgk(x, control=new("CovControlOgk", vrob=function(x1, x2, ...) { r <- .vrobGK(x1, x2, ...) if(is.na(r)) r <- 0 r }) ) cc
This class, derived from the virtual class "CovRobust"
accomodates
OGK Estimates of multivariate location and scatter computed by the
algorithm proposed by Marona and Zamar (2002).
Objects can be created by calls of the form new("CovOgk", ...)
,
but the usual way of creating CovOgk
objects is a call to the function
CovOgk
which serves as a constructor.
raw.cov
:Object of class "matrix"
the raw
(not reweighted) estimate of covariance matrix
raw.center
:Object of class "vector"
- the raw
(not reweighted) estimate of the location vector
raw.mah
:Object of class "Uvector"
- mahalanobis
distances of the observations based on the raw estimate of the
location and scatter
raw.wt
:Object of class "Uvector"
- weights of
the observations based on the raw estimate of the location and scatter
iter
, crit
, wt
:from the
"CovRobust"
class.
call
, cov
, center
,
n.obs
, mah
, method
,
singularity
, X
:from the "Cov"
class.
Class "CovRobust"
, directly.
Class "Cov"
, by class "CovRobust"
.
No methods defined with class "CovOgk" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("CovOgk")
showClass("CovOgk")
Computes a robust multivariate location and scatter estimate with a high breakdown point, using one of the available estimators.
CovRobust(x, control, na.action = na.fail)
CovRobust(x, control, na.action = na.fail)
x |
a matrix or data frame. |
control |
a control object (S4) for one of the available control classes,
e.g. |
na.action |
A function to specify the action to be taken if missing values are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. |
This function simply calls the restimate
method of the control object
control
. If a character string naming an estimator is specified, a
new control object will be created and used (with default estimation options).
If this argument is missing or a character string
'auto' is specified, the function will select the robust estimator
according to the size of the dataset. If there are less than 1000
observations and less than 10 variables or less than 5000 observations
and less than 5 variables, Stahel-Donoho estimator will be used. Otherwise,
if there are less than 50000 observations either bisquare S-estimates
(for less than 10 variables) or Rocke type S-estimates (for 10 to 20 variables)
will be used. In both cases the S iteration starts at the initial MVE estimate.
And finally, if there are more than 50000 observations and/or more than 20 variables the Orthogonalized
Quadrant Correlation estimator (CovOgk
with the corresponding parameters) is used.
An object derived from a CovRobust
object, depending on the selected estimator.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovRobust(hbk.x) CovRobust(hbk.x, CovControlSest(method="bisquare"))
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovRobust(hbk.x) CovRobust(hbk.x, CovControlSest(method="bisquare"))
CovRobust
is a virtual base class used for deriving the concrete classes
representing different robust estimates of multivariate location and scatter. Here are implemeted the
standard methods common for all robust estimates like show
, summary
and plot
.
The derived classes can override these methods and can define new ones.
A virtual Class: No objects may be created from it.
iter
:number of iterations used to compute the estimates
crit
:value of the criterion function
wt
:weights
call
, cov
, center
,
n.obs
, mah
, method
,
singularity
, X
:from the "Cov"
class.
Class "Cov"
, directly.
signature(obj = "CovRobust")
: Will return FALSE, since this is a 'Robust' object
signature(obj = "CovRobust")
: Return the name of the particular robust method used (as a character string)
signature(object = "CovRobust")
: display the object
signature(x = "CovRobust")
: plot the object
signature(obj = "CovRobust")
: Return the object with the reweighted estimates replaced by the raw ones (only relevant for CovMcd, CovMve and CovOgk)
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
Cov-class
, CovMcd-class
, CovMest-class
, CovOgk-class
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovMest(hbk.x) # it is not possible to create an object of # class CovRobust, since it is a VIRTUAL class cv summary(cv) # summary method for class CovRobust plot(cv) # plot method for class CovRobust
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovMest(hbk.x) # it is not possible to create an object of # class CovRobust, since it is a VIRTUAL class cv summary(cv) # summary method for class CovRobust plot(cv) # plot method for class CovRobust
Compute a robust estimate of location and scale using the Stahel-Donoho projection based estimator
CovSde(x, nsamp, maxres, tune = 0.95, eps = 0.5, prob = 0.99, seed = NULL, trace = FALSE, control)
CovSde(x, nsamp, maxres, tune = 0.95, eps = 0.5, prob = 0.99, seed = NULL, trace = FALSE, control)
x |
a matrix or data frame. |
nsamp |
a positive integer giving the number of resamples required;
|
maxres |
a positive integer specifying the maximum number of
resamples to be performed including those that are discarded due to linearly
dependent subsamples. If |
tune |
a numeric value between 0 and 1 giving the fraction of the data to receive non-zero weight.
Defaults to |
prob |
a numeric value between 0 and 1 specifying the probability of high breakdown point;
used to compute |
eps |
a numeric value between 0 and 0.5 specifying the breakdown point; used to compute
|
seed |
starting value for random generator. Default is |
trace |
whether to print intermediate results. Default is |
control |
a control object (S4) of class |
The projection based Stahel-Donoho estimator posses very good statistical properties,
but it can be very slow if the number of variables is too large. It is recommended to use
this estimator if n <= 1000
and p<=10
or n <= 5000
and p<=5
.
The number of subsamples required is calculated to provide a breakdown point of
eps
with probability prob
and can reach values larger than
the larger integer value - in such case it is limited to .Machine$integer.max
.
Of course you could provide nsamp
in the call, i.e. nsamp=1000
but
this will not guarantee the required breakdown point of th eestimator.
For larger data sets it is better to use CovMcd
or CovOgk
.
If you use CovRobust
, the estimator will be selected automatically
according on the size of the data set.
An S4 object of class CovSde-class
which is a subclass of the
virtual class CovRobust-class
.
The Fortran code for the Stahel-Donoho method was taken almost with no changes from
package robust
which in turn has it from the Insightful Robust Library
(thanks to by Kjell Konis).
Valentin Todorov [email protected] and Kjell Konis [email protected]
R. A. Maronna and V.J. Yohai (1995) The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90 (429), 330–341.
R. A. Maronna, D. Martin and V. Yohai (2006). Robust Statistics: Theory and Methods. Wiley, New York.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovSde(hbk.x) ## the following four statements are equivalent c0 <- CovSde(hbk.x) c1 <- CovSde(hbk.x, nsamp=2000) c2 <- CovSde(hbk.x, control = CovControlSde(nsamp=2000)) c3 <- CovSde(hbk.x, control = new("CovControlSde", nsamp=2000)) ## direct specification overrides control one: c4 <- CovSde(hbk.x, nsamp=100, control = CovControlSde(nsamp=2000)) c1 summary(c1) plot(c1) ## Use the function CovRobust() - if no estimation method is ## specified, for small data sets CovSde() will be called cr <- CovRobust(hbk.x) cr
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) CovSde(hbk.x) ## the following four statements are equivalent c0 <- CovSde(hbk.x) c1 <- CovSde(hbk.x, nsamp=2000) c2 <- CovSde(hbk.x, control = CovControlSde(nsamp=2000)) c3 <- CovSde(hbk.x, control = new("CovControlSde", nsamp=2000)) ## direct specification overrides control one: c4 <- CovSde(hbk.x, nsamp=100, control = CovControlSde(nsamp=2000)) c1 summary(c1) plot(c1) ## Use the function CovRobust() - if no estimation method is ## specified, for small data sets CovSde() will be called cr <- CovRobust(hbk.x) cr
This class, derived from the virtual class "CovRobust"
accomodates Stahel-Donoho estimates of multivariate location and scatter.
Objects can be created by calls of the form new("CovSde", ...)
,
but the usual way of creating CovSde
objects is a call to the function
CovSde
which serves as a constructor.
iter
, crit
, wt
:from the
"CovRobust"
class.
call
, cov
, center
,
n.obs
, mah
, method
,
singularity
, X
:from the "Cov"
class.
Class "CovRobust"
, directly.
Class "Cov"
, by class "CovRobust"
.
No methods defined with class "CovSde" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
CovSde
, Cov-class
, CovRobust-class
showClass("CovSde")
showClass("CovSde")
Computes S-Estimates of multivariate location and scatter based on Tukey's biweight function using a fast algorithm similar to the one proposed by Salibian-Barrera and Yohai (2006) for the case of regression. Alternativley, the Ruppert's SURREAL algorithm, bisquare or Rocke type estimation can be used.
CovSest(x, bdp = 0.5, arp = 0.1, eps = 1e-5, maxiter = 120, nsamp = 500, seed = NULL, trace = FALSE, tolSolve = 1e-14, scalefn, maxisteps=200, initHsets = NULL, save.hsets = missing(initHsets) || is.null(initHsets), method = c("sfast", "surreal", "bisquare", "rocke", "suser", "sdet"), control, t0, S0, initcontrol)
CovSest(x, bdp = 0.5, arp = 0.1, eps = 1e-5, maxiter = 120, nsamp = 500, seed = NULL, trace = FALSE, tolSolve = 1e-14, scalefn, maxisteps=200, initHsets = NULL, save.hsets = missing(initHsets) || is.null(initHsets), method = c("sfast", "surreal", "bisquare", "rocke", "suser", "sdet"), control, t0, S0, initcontrol)
x |
a matrix or data frame. |
bdp |
a numeric value specifying the required
breakdown point. Allowed values are between
|
arp |
a numeric value specifying the asympthotic
rejection point (for the Rocke type S estimates),
i.e. the fraction of points receiving zero
weight (see Rocke (1996)). Default is |
eps |
a numeric value specifying the
relative precision of the solution of the S-estimate
(bisquare and Rocke type). Default is to |
maxiter |
maximum number of iterations allowed
in the computation of the S-estimate (bisquare and Rocke type).
Default is |
nsamp |
the number of random subsets considered. The default is different for the different methods:
(i) for |
seed |
starting value for random generator. Default is |
trace |
whether to print intermediate results. Default is |
tolSolve |
numeric tolerance to be used for inversion
( |
scalefn |
|
maxisteps |
maximal number of concentration steps in the deterministic S-estimates; should not be reached. |
initHsets |
NULL or a |
save.hsets |
(for deterministic S-estimates) logical indicating if the
initial subsets should be returned as |
method |
Which algorithm to use: 'sfast'=C implementation of FAST-S, 'surreal'=SURREAL,
'bisquare', 'rocke'. The method 'suser' currently calls the R implementation of FAST-S
but in the future will allow the user to supply own |
control |
a control object (S4) of class |
t0 |
optional initial HBDP estimate for the center |
S0 |
optional initial HBDP estimate for the covariance matrix |
initcontrol |
optional control object to be used for computing the initial HBDP estimates |
Computes multivariate S-estimator of location and scatter. The computation will be performed by one of the following algorithms:
An algorithm similar to the one proposed by Salibian-Barrera and Yohai (2006) for the case of regression
Ruppert's SURREAL algorithm when method
is set to 'surreal'
Bisquare S-Estimate with method
set to 'bisquare'
Rocke type S-Estimate with method
set to 'rocke'
Except for the last algorithm, ROCKE, all other use Tukey biweight loss function.
The tuning parameters used in the loss function (as determined by bdp) are
returned in the slots cc
and kp
of the result object. They can be computed
by the internal function .csolve.bw.S(bdp, p)
.
An S4 object of class CovSest-class
which is a subclass of the
virtual class CovRobust-class
.
Valentin Todorov [email protected], Matias Salibian-Barrera [email protected] and Victor Yohai [email protected]. See also the code from Kristel Joossens, K.U. Leuven, Belgium and Ella Roelant, Ghent University, Belgium.
M. Hubert, P. Rousseeuw and T. Verdonck (2012) A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics 21(3), 618–637.
M. Hubert, P. Rousseeuw, D. Vanpaemel and T. Verdonck (2015) The DetS and DetMM estimators for multivariate location and scatter. Computational Statistics and Data Analysis 81, 64–75.
H.P. Lopuhaä (1989) On the Relation between S-estimators and M-estimators of Multivariate Location and Covariance. Annals of Statistics 17 1662–1683.
D. Ruppert (1992) Computing S Estimators for Regression and Multivariate Location/Dispersion. Journal of Computational and Graphical Statistics 1 253–270.
M. Salibian-Barrera and V. Yohai (2006) A fast algorithm for S-regression estimates, Journal of Computational and Graphical Statistics, 15, 414–427.
R. A. Maronna, D. Martin and V. Yohai (2006). Robust Statistics: Theory and Methods. Wiley, New York.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
library(rrcov) data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cc <- CovSest(hbk.x) cc ## summry and different types of plots summary(cc) plot(cc) plot(cc, which="dd") plot(cc, which="pairs") plot(cc, which="xydist") ## the following four statements are equivalent c0 <- CovSest(hbk.x) c1 <- CovSest(hbk.x, bdp = 0.25) c2 <- CovSest(hbk.x, control = CovControlSest(bdp = 0.25)) c3 <- CovSest(hbk.x, control = new("CovControlSest", bdp = 0.25)) ## direct specification overrides control one: c4 <- CovSest(hbk.x, bdp = 0.40, control = CovControlSest(bdp = 0.25)) c1 summary(c1) plot(c1) ## Use the SURREAL algorithm of Ruppert cr <- CovSest(hbk.x, method="surreal") cr ## Use Bisquare estimation cr <- CovSest(hbk.x, method="bisquare") cr ## Use Rocke type estimation cr <- CovSest(hbk.x, method="rocke") cr ## Use Deterministic estimation cr <- CovSest(hbk.x, method="sdet") cr
library(rrcov) data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cc <- CovSest(hbk.x) cc ## summry and different types of plots summary(cc) plot(cc) plot(cc, which="dd") plot(cc, which="pairs") plot(cc, which="xydist") ## the following four statements are equivalent c0 <- CovSest(hbk.x) c1 <- CovSest(hbk.x, bdp = 0.25) c2 <- CovSest(hbk.x, control = CovControlSest(bdp = 0.25)) c3 <- CovSest(hbk.x, control = new("CovControlSest", bdp = 0.25)) ## direct specification overrides control one: c4 <- CovSest(hbk.x, bdp = 0.40, control = CovControlSest(bdp = 0.25)) c1 summary(c1) plot(c1) ## Use the SURREAL algorithm of Ruppert cr <- CovSest(hbk.x, method="surreal") cr ## Use Bisquare estimation cr <- CovSest(hbk.x, method="bisquare") cr ## Use Rocke type estimation cr <- CovSest(hbk.x, method="rocke") cr ## Use Deterministic estimation cr <- CovSest(hbk.x, method="sdet") cr
This class, derived from the virtual class "CovRobust"
accomodates S Estimates of multivariate location and scatter computed
by the ‘Fast S’ or ‘SURREAL’ algorithm.
Objects can be created by calls of the form new("CovSest", ...)
,
but the usual way of creating CovSest
objects is a call to the function
CovSest
which serves as a constructor.
iter
, crit
, wt
:from the
"CovRobust"
class.
iBest
, nsteps
, initHsets
:parameters for deterministic S-estimator (the best initial subset, number of concentration steps to convergence for each of the initial subsets, and the computed initial subsets, respectively).
cc
, kp
tuning parameters used in Tukey biweight
loss function, as determined by bdp. Can be computed by the internal function
.csolve.bw.S(bdp, p)
.
call
, cov
, center
,
n.obs
, mah
, method
,
singularity
, X
:from the "Cov"
class.
Class "CovRobust"
, directly.
Class "Cov"
, by class "CovRobust"
.
No methods defined with class "CovSest" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
CovSest
, Cov-class
, CovRobust-class
showClass("CovSest")
showClass("CovSest")
The data set contains five measurements made on 145 non-obese adult patients classified into three groups.
The three primary variables are glucose intolerance (area under the straight line connecting glucose levels), insulin response to oral glucose (area under the straight line connecting insulin levels) and insulin resistance (measured by the steady state plasma glucose (SSPG) determined after chemical suppression of endogenous insulin secretion). Two additional variables, the relative weight and fasting plasma glucose, are also included.
Reaven and Miller, following Friedman and Rubin (1967), applied cluster analysis
to the three primary variables and identified three clusters: "normal",
"chemical diabetic", and "overt diabetic" subjects. The column group
contains the classifications of the subjects into these three groups,
obtained by current medical criteria.
data(diabetes)
data(diabetes)
A data frame with the following variables:
relative weight, expressed as the ratio of actual weight to expected weight, given the person's height.
fasting plasma glucose level.
area under plasma glucose curve after a three hour oral glucose tolerance test (OGTT).
area under plasma insulin curve after a three hour oral glucose tolerance test (OGTT).
Steady state plasma glucose, a measure of insulin resistance.
the type of diabetes: a factor with levels normal
, chemical
and overt
.
Reaven, G. M. and Miller, R. G. (1979). An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16, 17–24. Andrews, D. F. and Herzberg, A. M. (1985). Data: A Collection of Problems from Many Fields for the Student and Research Worker, Springer-Verlag, Ch. 36.
Reaven, G. M. and Miller, R. G. (1979). An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16, 17–24.
Friedman, H. P. and Rubin, J. (1967). On some invariant criteria for grouping data. Journal of the American Statistical Association 62, 1159–1178.
Hawkins, D. M. and McLachlan, G. J., 1997. High-breakdown linear discriminant analysis. Journal of the American Statistical Association 92 (437), 136–143.
data(diabetes) (cc <- Linda(group~insulin+glucose+sspg, data=diabetes)) (pr <- predict(cc))
data(diabetes) (cc <- Linda(group~insulin+glucose+sspg, data=diabetes)) (pr <- predict(cc))
The Fish Catch data set contains measurements on 159 fish caught in the lake Laengelmavesi, Finland.
data(fish)
data(fish)
A data frame with 159 observations on the following 7 variables.
Weight
Weight of the fish (in grams)
Length1
Length from the nose to the beginning of the tail (in cm)
Length2
Length from the nose to the notch of the tail (in cm)
Length3
Length from the nose to the end of the tail (in cm)
Height
Maximal height as % of Length3
Width
Maximal width as % of Length3
Species
Species
The Fish Catch data set contains measurements on 159 fish caught in the lake Laengelmavesi, Finland. For the 159 fishes of 7 species the weight, length, height, and width were measured. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail and from the nose to the end of its tail. The height and width are calculated as percentages of the third length variable. This results in 6 observed variables, Weight, Length1, Length2, Length3, Height, Width. Observation 14 has a missing value in variable Weight, therefore this observation is usually excluded from the analysis. The last variable, Species, represents the grouping structure: the 7 species are 1=Bream, 2=Whitewish, 3=Roach, 4=Parkki, 5=Smelt, 6=Pike, 7=Perch. This data set was also analyzed in the context of robust Linear Discriminant Analysis by Todorov (2007), Todorov and Pires (2007).
Journal of Statistical Education, Fish Catch Data Set, [https://jse.amstat.org/datasets/fishcatch.dat.txt] accessed November, 2023.
Todorov, V. (2007 Robust selection of variables in linear discriminant analysis, Statistical Methods and Applications, 15, 395–407, doi:10.1007/s10260-006-0032-6.
Todorov, V. and Pires, A.M. (2007) Comparative performance of several robust linear discriminant analysis methods, REVSTAT Statistical Journal, 5, 63–83.
data(fish) # remove observation #14 containing missing value fish <- fish[-14,] # The height and width are calculated as percentages # of the third length variable fish[,5] <- fish[,5]*fish[,4]/100 fish[,6] <- fish[,6]*fish[,4]/100 # plot a matrix of scatterplots pairs(fish[1:6], main="Fish Catch Data", pch=21, bg=c("red", "green3", "blue", "yellow", "magenta", "violet", "turquoise")[unclass(fish$Species)])
data(fish) # remove observation #14 containing missing value fish <- fish[-14,] # The height and width are calculated as percentages # of the third length variable fish[,5] <- fish[,5]*fish[,4]/100 fish[,6] <- fish[,6]*fish[,4]/100 # plot a matrix of scatterplots pairs(fish[1:6], main="Fish Catch Data", pch=21, bg=c("red", "green3", "blue", "yellow", "magenta", "violet", "turquoise")[unclass(fish$Species)])
A data set that contains the spectra of six different cultivars of the same fruit (cantaloupe - Cucumis melo L. Cantaloupensis group) obtained from Colin Greensill (Faculty of Engineering and Physical Systems, Central Queensland University, Rockhampton, Australia). The total data set contained 2818 spectra measured in 256 wavelengths. For illustrative purposes are considered only three cultivars out of it, named D, M and HA with sizes 490, 106 and 500, respectively. Thus the data set thus contains 1096 observations. For more details about this data set see the references below.
data(fruit)
data(fruit)
A data frame with 1096 rows and 257 variables (one grouping variable – cultivar
– and 256 measurement variables).
Colin Greensill (Faculty of Engineering and Physical Systems, Central Queensland University, Rockhampton, Australia).
Hubert, M. and Van Driessen, K., (2004). Fast and robust discriminant analysis. Computational Statistics and Data Analysis, 45(2):301–320. doi:10.1016/S0167-9473(02)00299-2.
Vanden Branden, K and Hubert, M, (2005). Robust classification in high dimensions based on the SIMCA Method. Chemometrics and Intelligent Laboratory Systems, 79(1-2), pp. 10–21. doi:10.1016/j.chemolab.2005.03.002.
Hubert, M, Rousseeuw, PJ and Verdonck, T, (2012). A Deterministic Algorithm for Robust Location and Scatter. Journal of Computational and Graphical Statistics, 21(3), pp 618–637. doi:10.1080/10618600.2012.672100.
data(fruit) table(fruit$cultivar)
data(fruit) table(fruit$cultivar)
Accessor methods to the slots of objects of classCov
and its subclasses
getCenter(obj) getCov(obj) getCorr(obj) getData(obj) getDistance(obj) getEvals(obj) getDet(obj) getShape(obj) getFlag(obj, prob=0.975) getMeth(obj) isClassic(obj) getRaw(obj)
getCenter(obj) getCov(obj) getCorr(obj) getData(obj) getDistance(obj) getEvals(obj) getDet(obj) getShape(obj) getFlag(obj, prob=0.975) getMeth(obj) isClassic(obj) getRaw(obj)
obj |
an object of class |
prob |
optional argument for |
generic functions - see getCenter
, getCov
, getCorr
, getData
, getDistance
, getEvals
, getDet
, getShape
, getFlag
, isClassic
generic functions - see getCenter
, getCov
, getCorr
, getData
, getDistance
, getEvals
, getDet
, getShape
, getFlag
, getMeth
, isClassic
A simple function to calculate the points of
a confidence ellipsoid, by default dist=qchisq(0.975, 2)
getEllipse(loc = c(0, 0), cov = matrix(c(1, 0, 0, 1), ncol = 2), crit = 0.975)
getEllipse(loc = c(0, 0), cov = matrix(c(1, 0, 0, 1), ncol = 2), crit = 0.975)
loc |
location vector |
cov |
a |
crit |
the confidence level, default is |
A matrix with two columns containing the calculated points.
Valentin Todorov, [email protected]
data(hbk) cc <- cov.wt(hbk) e1 <- getEllipse(loc=cc$center[1:2], cov=cc$cov[1:2,1:2]) e2 <- getEllipse(loc=cc$center[1:2], cov=cc$cov[1:2,1:2], crit=0.99) plot(X2~X1, data=hbk, xlim=c(min(X1, e1[,1], e2[,1]), max(X1,e1[,1], e2[,1])), ylim=c(min(X2, e1[,2], e2[,2]), max(X2,e1[,2], e2[,2]))) lines(e1, type="l", lty=1, col="red") lines(e2, type="l", lty=2, col="blue") legend("topleft", legend=c(0.975, 0.99), lty=1:2, col=c("red", "blue"))
data(hbk) cc <- cov.wt(hbk) e1 <- getEllipse(loc=cc$center[1:2], cov=cc$cov[1:2,1:2]) e2 <- getEllipse(loc=cc$center[1:2], cov=cc$cov[1:2,1:2], crit=0.99) plot(X2~X1, data=hbk, xlim=c(min(X1, e1[,1], e2[,1]), max(X1,e1[,1], e2[,1])), ylim=c(min(X2, e1[,2], e2[,2]), max(X2,e1[,2], e2[,2]))) lines(e1, type="l", lty=1, col="red") lines(e2, type="l", lty=2, col="blue") legend("topleft", legend=c(0.975, 0.99), lty=1:2, col=c("red", "blue"))
Accessor methods to the slots of objects of class Pca
and its subclasses
obj |
an object of class |
Accessors for object of class Pca
Accessors for object of class PcaRobust
Accessors for object of class PcaClassic
The hemophilia data set contains two measured variables on 75 women, belonging to two groups: n1=30 of them are non-carriers (normal group) and n2=45 are known hemophilia A carriers (obligatory carriers).
data(hemophilia)
data(hemophilia)
A data frame with 75 observations on the following 3 variables.
AHFactivity
AHF activity
AHFantigen
AHF antigen
gr
group - normal or obligatory carrier
Originally analized in the context of discriminant analysis by Habemma and Hermans (1974). The objective is to find a procedure for detecting potential hemophilia A carriers on the basis of two measured variables: X1=log10(AHV activity) and X2=log10(AHV-like antigen). The first group of n1=30 women consists of known non-carriers (normal group) and the second group of n2=45 women is selected from known hemophilia A carriers (obligatory carriers). This data set was also analyzed by Johnson and Wichern (1998) as well as, in the context of robust Linear Discriminant Analysis by Hawkins and McLachlan (1997) and Hubert and Van Driessen (2004).
Habemma, J.D.F, Hermans, J. and van den Broek, K. (1974) Stepwise Discriminant Analysis Program Using Density Estimation in Proceedings in Computational statistics, COMPSTAT'1974 (Physica Verlag, Heidelberg, 1974, pp 101–110).
Johnson, R.A. and Wichern, D. W. Applied Multivariate Statistical Analysis (Prentice Hall, International Editions, 2002, fifth edition)
Hawkins, D. M. and McLachlan, G.J. (1997) High-Breakdown Linear Discriminant Analysis J. Amer. Statist. Assoc. 92 136–143.
Hubert, M., Van Driessen, K. (2004) Fast and robust discriminant analysis, Computational Statistics and Data Analysis, 45 301–320.
data(hemophilia) plot(AHFantigen~AHFactivity, data=hemophilia, col=as.numeric(as.factor(gr))+1) ## ## Compute robust location and covariance matrix and ## plot the tolerance ellipses (mcd <- CovMcd(hemophilia[,1:2])) col <- ifelse(hemophilia$gr == "carrier", 2, 3) ## define clours for the groups plot(mcd, which="tolEllipsePlot", class=TRUE, col=col)
data(hemophilia) plot(AHFantigen~AHFactivity, data=hemophilia, col=as.numeric(as.factor(gr))+1) ## ## Compute robust location and covariance matrix and ## plot the tolerance ellipses (mcd <- CovMcd(hemophilia[,1:2])) col <- ifelse(hemophilia$gr == "carrier", 2, 3) ## define clours for the groups plot(mcd, which="tolEllipsePlot", class=TRUE, col=col)
”This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The targets were free electrons in the ionosphere. "good" radar returns are those showing evidence of some type of structure in the ionosphere. "bad" returns are those that do not; their signals pass through the ionosphere. Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal.” [UCI archive]
data(ionosphere)
data(ionosphere)
A data frame with 351 rows and 33 variables: 32 measurements and one
(the last, Class
) grouping variable: 225 'good'
and 126 'bad'
.
The original dataset at UCI contains 351 rows and 35 columns. The first 34 columns are features, the last column contains the classification label of 'g' and 'b'. The first feature is binary and the second one is only 0s, one grouping variable - factor with labels 'good' and 'bad'.
Source: Space Physics Group; Applied Physics Laboratory; Johns Hopkins University; Johns Hopkins Road; Laurel; MD 20723
Donor: Vince Sigillito ([email protected])
The data have been taken from the UCI Repository Of Machine Learning Databases at https://archive.ics.uci.edu/ml/datasets/ionosphere
This data set, with the original 34 features is available in the package mlbench
and a different data set (refering to the same UCI repository) is available in
the package dprep
(archived on CRAN).
Sigillito, V. G., Wing, S. P., Hutton, L. V., and Baker, K. B. (1989). Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10, 262-266.
data(ionosphere) ionosphere[, 1:6] |> pairs()
data(ionosphere) ionosphere[, 1:6] |> pairs()
Returns TRUE if the covariance matrix contained in a Cov-class
object (or derived from) is singular.
## S4 method for signature 'Cov' isSingular(obj)
## S4 method for signature 'Cov' isSingular(obj)
obj |
an object of class (derived from) |
signature(x = Cov)
: Check if a covariance matrix
(object of class Cov-class
) is singular.
Cov-class
,
CovClassic
,
CovRobust-class
.
data(hbk) cc <- CovClassic(hbk) isSingular(cc)
data(hbk) cc <- CovClassic(hbk) isSingular(cc)
The class Lda
serves as a base class for deriving
all other classes representing the results of classical
and robust Linear Discriminant Analisys methods
A virtual Class: No objects may be created from it.
call
:the (matched) function call.
prior
:prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the common covariance matrix
ldf
:a matrix containing the linear discriminant functions
ldfconst
:a vector containing the constants of each linear discriminant function
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
covobj
:object of class "Cov"
containing the estimate
of the common covariance matrix of the centered data. It is not NULL
only in case of method "B".
control
:object of class "CovControl"
specifying which estimate
and with what estimation options to use for the group means and common covariance
(or NULL
for classical linear discriminant analysis)
signature(object = "Lda")
: calculates prediction using the results in
object
. An optional data frame or matrix in which to look for variables with which
to predict. If omitted, the training data set is used. If the original fit used a formula or
a data frame or a matrix with column names, newdata must contain columns with the
same names. Otherwise it must contain the same number of columns,
to be used in the same order.
signature(object = "Lda")
: prints the results
signature(object = "Lda")
: prints summary information
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
LdaClassic
, LdaClassic-class
, LdaRobust-class
showClass("Lda")
showClass("Lda")
Performs a linear discriminant analysis and returns the results as an object of class LdaClassic
(aka constructor).
LdaClassic(x, ...) ## Default S3 method: LdaClassic(x, grouping, prior = proportions, tol = 1.0e-4, ...)
LdaClassic(x, ...) ## Default S3 method: LdaClassic(x, grouping, prior = proportions, tol = 1.0e-4, ...)
x |
a matrix or data frame containing the explanatory variables (training set). |
grouping |
grouping variable: a factor specifying the class for each observation. |
prior |
prior probabilities, default to the class proportions for the training set. |
tol |
tolerance |
... |
arguments passed to or from other methods. |
Returns an S4 object of class LdaClassic
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## Example anorexia library(MASS) data(anorexia) ## rrcov: LdaClassic() lda <- LdaClassic(Treat~., data=anorexia) predict(lda)@classification ## MASS: lda() lda.MASS <- lda(Treat~., data=anorexia) predict(lda.MASS)$class ## Compare the prediction results of MASS:::lda() and LdaClassic() all.equal(predict(lda)@classification, predict(lda.MASS)$class)
## Example anorexia library(MASS) data(anorexia) ## rrcov: LdaClassic() lda <- LdaClassic(Treat~., data=anorexia) predict(lda)@classification ## MASS: lda() lda.MASS <- lda(Treat~., data=anorexia) predict(lda.MASS)$class ## Compare the prediction results of MASS:::lda() and LdaClassic() all.equal(predict(lda)@classification, predict(lda.MASS)$class)
Contains the results of a classical Linear Discriminant Analysis
Objects can be created by calls of the form new("LdaClassic", ...)
but the
usual way of creating LdaClassic
objects is a call to the function
LdaClassic
which serves as a constructor.
call
:The (matched) function call.
prior
:Prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the common covariance matrix
ldf
:a matrix containing the linear discriminant functions
ldfconst
:a vector containing the constants of each linear discriminant function
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
Class "Lda"
, directly.
No methods defined with class "LdaClassic" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
LdaRobust-class
, Lda-class
, LdaClassic
showClass("LdaClassic")
showClass("LdaClassic")
Performs robust linear discriminant analysis by the projection-pursuit approach -
proposed by Pires and Branco (2010) - and returns the results as an object of
class LdaPP
(aka constructor).
LdaPP(x, ...) ## S3 method for class 'formula' LdaPP(formula, data, subset, na.action, ...) ## Default S3 method: LdaPP(x, grouping, prior = proportions, tol = 1.0e-4, method = c("huber", "mad", "sest", "class"), optim = FALSE, trace=FALSE, ...)
LdaPP(x, ...) ## S3 method for class 'formula' LdaPP(formula, data, subset, na.action, ...) ## Default S3 method: LdaPP(x, grouping, prior = proportions, tol = 1.0e-4, method = c("huber", "mad", "sest", "class"), optim = FALSE, trace=FALSE, ...)
formula |
a formula of the form |
data |
an optional data frame (or similar: see
|
subset |
an optional vector used to select rows (observations) of the
data matrix |
na.action |
a function which indicates what should happen
when the data contain |
x |
a matrix or data frame containing the explanatory variables (training set). |
grouping |
grouping variable: a factor specifying the class for each observation. |
prior |
prior probabilities, default to the class proportions for the training set. |
tol |
tolerance |
method |
method |
optim |
wheather to perform the approximation using the Nelder and Mead simplex method
(see function |
trace |
whether to print intermediate results. Default is |
... |
arguments passed to or from other methods. |
Currently the algorithm is implemented only for binary classification and in the following will be assumed that only two groups are present.
The PP algorithm searches for low-dimensional projections of higher-dimensional
data where a projection index is maximized. Similar to the original Fisher's proposal
the squared standardized distance between the observations in the two groups is maximized.
Instead of the sample univariate mean and standard deviation (T,S)
robust
alternatives are used. These are selected through the argument method
and can be one of
the pair (T,S)
are the robust M-estimates of location and scale
(T,S)
are the Median and the Median Absolute Deviation
the pair (T,S)
are the robust S-estimates of location and scale
(T,S)
are the mean and the standard deviation.
The first approximation A1 to the solution is obtained by investigating
a finite number of candidate directions, the unit vectors defined
by all pairs of points such that one belongs to the first group
and the other to the second group. The found solution is stored in the slots
raw.ldf
and raw.ldfconst
.
The second approximation A2 (optional) is performed by
a numerical optimization algorithm using A1 as initial solution.
The Nelder and Mead method implemented in the function optim
is applied.
Whether this refinement will be used is controlled by the argument optim
.
If optim=TRUE
the result of the optimization is stored into the slots
ldf
and ldfconst
. Otherwise these slots are set equal to
raw.ldf
and raw.ldfconst
.
Returns an S4 object of class LdaPP-class
Still an experimental version! Only binary classification is supported.
Valentin Todorov [email protected] and Ana Pires [email protected]
Pires, A. M. and A. Branco, J. (2010) Projection-pursuit approach to robust linear discriminant analysis Journal Multivariate Analysis, Academic Press, Inc., 101, 2464–2485.
## ## Function to plot a LDA separation line ## lda.line <- function(lda, ...) { ab <- lda@ldf[1,] - lda@ldf[2,] cc <- lda@ldfconst[1] - lda@ldfconst[2] abline(a=-cc/ab[2], b=-ab[1]/ab[2],...) } data(pottery) x <- pottery[,c("MG", "CA")] grp <- pottery$origin col <- c(3,4) gcol <- ifelse(grp == "Attic", col[1], col[2]) gpch <- ifelse(grp == "Attic", 16, 1) ## ## Reproduce Fig. 2. from Pires and branco (2010) ## plot(CA~MG, data=pottery, col=gcol, pch=gpch) ## Not run: ppc <- LdaPP(x, grp, method="class", optim=TRUE) lda.line(ppc, col=1, lwd=2, lty=1) pph <- LdaPP(x, grp, method="huber",optim=TRUE) lda.line(pph, col=3, lty=3) pps <- LdaPP(x, grp, method="sest", optim=TRUE) lda.line(pps, col=4, lty=4) ppm <- LdaPP(x, grp, method="mad", optim=TRUE) lda.line(ppm, col=5, lty=5) rlda <- Linda(x, grp, method="mcd") lda.line(rlda, col=6, lty=1) fsa <- Linda(x, grp, method="fsa") lda.line(fsa, col=8, lty=6) ## Use the formula interface: ## LdaPP(origin~MG+CA, data=pottery) ## use the same two predictors LdaPP(origin~., data=pottery) ## use all predictor variables ## ## Predict method data(pottery) fit <- LdaPP(origin~., data = pottery) predict(fit) ## End(Not run)
## ## Function to plot a LDA separation line ## lda.line <- function(lda, ...) { ab <- lda@ldf[1,] - lda@ldf[2,] cc <- lda@ldfconst[1] - lda@ldfconst[2] abline(a=-cc/ab[2], b=-ab[1]/ab[2],...) } data(pottery) x <- pottery[,c("MG", "CA")] grp <- pottery$origin col <- c(3,4) gcol <- ifelse(grp == "Attic", col[1], col[2]) gpch <- ifelse(grp == "Attic", 16, 1) ## ## Reproduce Fig. 2. from Pires and branco (2010) ## plot(CA~MG, data=pottery, col=gcol, pch=gpch) ## Not run: ppc <- LdaPP(x, grp, method="class", optim=TRUE) lda.line(ppc, col=1, lwd=2, lty=1) pph <- LdaPP(x, grp, method="huber",optim=TRUE) lda.line(pph, col=3, lty=3) pps <- LdaPP(x, grp, method="sest", optim=TRUE) lda.line(pps, col=4, lty=4) ppm <- LdaPP(x, grp, method="mad", optim=TRUE) lda.line(ppm, col=5, lty=5) rlda <- Linda(x, grp, method="mcd") lda.line(rlda, col=6, lty=1) fsa <- Linda(x, grp, method="fsa") lda.line(fsa, col=8, lty=6) ## Use the formula interface: ## LdaPP(origin~MG+CA, data=pottery) ## use the same two predictors LdaPP(origin~., data=pottery) ## use all predictor variables ## ## Predict method data(pottery) fit <- LdaPP(origin~., data = pottery) predict(fit) ## End(Not run)
The class LdaPP
represents an algorithm for robust linear discriminant
analysis by projection-pursuit approach. The objects of class LdaPP
contain the results
of the robust linear discriminant analysis by projection-pursuit approach.
Objects can be created by calls of the form new("LdaPP", ...)
but the
usual way of creating LdaPP
objects is a call to the function
LdaPP
which serves as a constructor.
call
:The (matched) function call.
prior
:Prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the common covariance matrix
raw.ldf
:a matrix containing the raw linear discriminant functions - see Details in LdaPP
raw.ldfconst
:a vector containing the raw constants of each raw linear discriminant function - see Details in LdaPP
ldf
:a matrix containing the linear discriminant functions
ldfconst
:a vector containing the constants of each linear discriminant function
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
Class "LdaRobust"
, directly.
Class "Lda"
, by class "LdaRobust", distance 2.
signature(object = "LdaPP")
: calculates prediction using the results in
object
. An optional data frame or matrix in which to look for variables with which
to predict. If omitted, the training data set is used. If the original fit used a formula or
a data frame or a matrix with column names, newdata must contain columns with the
same names. Otherwise it must contain the same number of columns,
to be used in the same order. If the argument raw=TRUE
is set the raw
(obtained by the first approximation algorithm) linear discriminant
function and constant will be used.
Valentin Todorov [email protected] and Ana Pires [email protected]
Pires, A. M. and A. Branco, J. (2010) Projection-pursuit approach to robust linear discriminant analysis Journal Multivariate Analysis, Academic Press, Inc., 101, 2464–2485.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
LdaRobust-class
, Lda-class
,
LdaClassic
, LdaClassic-class
,
Linda
, Linda-class
showClass("LdaPP")
showClass("LdaPP")
The class LdaRobust
searves as a base class for deriving all other
classes representing the results of the robust Linear Discriminant Analysis methods
A virtual Class: No objects may be created from it.
call
:The (matched) function call.
prior
:Prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the common covariance matrix
ldf
:a matrix containing the linear discriminant functions
ldfconst
:a vector containing the constants of each linear discriminant function
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
Class "Lda"
, directly.
No methods defined with class "LdaRobust" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("LdaRobust")
showClass("LdaRobust")
Robust linear discriminant analysis based on MCD and returns
the results as an object of class Linda
(aka constructor).
Linda(x, ...) ## Default S3 method: Linda(x, grouping, prior = proportions, tol = 1.0e-4, method = c("mcd", "mcdA", "mcdB", "mcdC", "fsa", "mrcd", "ogk"), alpha=0.5, l1med=FALSE, cov.control, trace=FALSE, ...)
Linda(x, ...) ## Default S3 method: Linda(x, grouping, prior = proportions, tol = 1.0e-4, method = c("mcd", "mcdA", "mcdB", "mcdC", "fsa", "mrcd", "ogk"), alpha=0.5, l1med=FALSE, cov.control, trace=FALSE, ...)
x |
a matrix or data frame containing the explanatory variables (training set). |
grouping |
grouping variable: a factor specifying the class for each observation. |
prior |
prior probabilities, default to the class proportions for the training set. |
tol |
tolerance |
method |
method |
alpha |
this parameter measures the fraction of outliers the algorithm should resist. In MCD alpha controls the size of the subsets over which the determinant is minimized, i.e. alpha*n observations are used for computing the determinant. Allowed values are between 0.5 and 1 and the default is 0.5. |
l1med |
whether to use L1 median (space median) instead of MCD to compute
the group means locations in order to center the data in methods |
cov.control |
specifies which covariance estimator to use by providing
a |
.
trace |
whether to print intermediate results. Default is |
... |
arguments passed to or from other methods |
details
Returns an S4 object of class Linda
Valentin Todorov [email protected]
Hawkins, D.M. and McLachlan, G.J. (1997) High-Breakdown Linear Discriminant Analysis, Journal of the American Statistical Association, 92, 136–143.
Todorov V. (2007) Robust selection of variables in linear discriminant analysis, Statistical Methods and Applications, 15, 395–407, doi:10.1007/s10260-006-0032-6.
Todorov, V. and Pires, A.M. (2007) Comparative Performance of Several Robust Linear Discriminant Analysis Methods. REVSTAT Statistical Journal, 5, p 63–83.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## Example anorexia library(MASS) data(anorexia) ## start with the classical estimates lda <- LdaClassic(Treat~., data=anorexia) predict(lda)@classification ## try now the robust LDA with the default method (MCD with pooled whitin cov matrix) rlda <- Linda(Treat~., data= anorexia) predict(rlda)@classification ## try the other methods Linda(Treat~., data= anorexia, method="mcdA") Linda(Treat~., data= anorexia, method="mcdB") Linda(Treat~., data= anorexia, method="mcdC") ## try the Hawkins&McLachlan method ## use the default method grp <- anorexia[,1] grp <- as.factor(grp) x <- anorexia[,2:3] Linda(x, grp, method="fsa") ## Do DA with Linda and method mcdB or mcdC, when some classes ## have very few observations. Use L1 median instead of MCD ## to compute the group means (l1med=TRUE). data(fish) # remove observation #14 containing missing value fish <- fish[-14,] # The height and width are calculated as percentages # of the third length variable fish[,5] <- fish[,5]*fish[,4]/100 fish[,6] <- fish[,6]*fish[,4]/100 table(fish$Species) Linda(Species~., data=fish, l1med=TRUE) Linda(Species~., data=fish, method="mcdC", l1med=TRUE)
## Example anorexia library(MASS) data(anorexia) ## start with the classical estimates lda <- LdaClassic(Treat~., data=anorexia) predict(lda)@classification ## try now the robust LDA with the default method (MCD with pooled whitin cov matrix) rlda <- Linda(Treat~., data= anorexia) predict(rlda)@classification ## try the other methods Linda(Treat~., data= anorexia, method="mcdA") Linda(Treat~., data= anorexia, method="mcdB") Linda(Treat~., data= anorexia, method="mcdC") ## try the Hawkins&McLachlan method ## use the default method grp <- anorexia[,1] grp <- as.factor(grp) x <- anorexia[,2:3] Linda(x, grp, method="fsa") ## Do DA with Linda and method mcdB or mcdC, when some classes ## have very few observations. Use L1 median instead of MCD ## to compute the group means (l1med=TRUE). data(fish) # remove observation #14 containing missing value fish <- fish[-14,] # The height and width are calculated as percentages # of the third length variable fish[,5] <- fish[,5]*fish[,4]/100 fish[,6] <- fish[,6]*fish[,4]/100 table(fish$Species) Linda(Species~., data=fish, l1med=TRUE) Linda(Species~., data=fish, method="mcdC", l1med=TRUE)
Robust linear discriminant analysis is performed by replacing the classical group means and withing group covariance matrix by robust equivalents based on MCD.
Objects can be created by calls of the form new("Linda", ...)
but the
usual way of creating Linda
objects is a call to the function
Linda
which serves as a constructor.
call
:The (matched) function call.
prior
:Prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the common covariance matrix
ldf
:a matrix containing the linear discriminant functions
ldfconst
:a vector containing the constants of each linear discriminant function
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
l1med
:wheather L1 median was used to compute group means.
Class "LdaRobust"
, directly.
Class "Lda"
, by class "LdaRobust", distance 2.
No methods defined with class "Linda" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
LdaRobust-class
, Lda-class
, LdaClassic
, LdaClassic-class
showClass("Linda")
showClass("Linda")
The data on annual maximum streamflow at 18 sites with smallest drainage area basin in southeastern USA contains the sample L-moments ratios (L-CV, L-skewness and L-kurtosis) as used by Hosking and Wallis (1997) to illustrate the discordancy measure in regional freqency analysis (RFA).
data(lmom32)
data(lmom32)
A data frame with 18 observations on the following 3 variables.
L-CV
L-coefficient of variation
L-skewness
L-coefficient of skewness
L-kurtosis
L-coefficient of kurtosis
The sample L-moment ratios (L-CV, L-skewness and L-kurtosis) of a site are regarded as a point in three dimensional space.
Hosking, J. R. M. and J. R. Wallis (1997), Regional Frequency Analysis: An Approach Based on L-moments. Cambridge University Press, p.49, Table 3.2
Neykov, N.M., Neytchev, P.N., Van Gelder, P.H.A.J.M. and Todorov V. (2007), Robust detection of discordant sites in regional frequency analysis, Water Resources Research, 43, W06417, doi:10.1029/2006WR005322
data(lmom32) # plot a matrix of scatterplots pairs(lmom32, main="Hosking and Wallis Data Set, Table 3.3", pch=21, bg=c("red", "green3", "blue")) mcd<-CovMcd(lmom32) mcd plot(mcd, which="dist", class=TRUE) plot(mcd, which="dd", class=TRUE) ## identify the discordant sites using robust distances and compare ## to the classical ones mcd <- CovMcd(lmom32) rd <- sqrt(getDistance(mcd)) ccov <- CovClassic(lmom32) cd <- sqrt(getDistance(ccov)) r.out <- which(rd > sqrt(qchisq(0.975,3))) c.out <- which(cd > sqrt(qchisq(0.975,3))) cat("Robust: ", length(r.out), " outliers: ", r.out,"\n") cat("Classical: ", length(c.out), " outliers: ", c.out,"\n")
data(lmom32) # plot a matrix of scatterplots pairs(lmom32, main="Hosking and Wallis Data Set, Table 3.3", pch=21, bg=c("red", "green3", "blue")) mcd<-CovMcd(lmom32) mcd plot(mcd, which="dist", class=TRUE) plot(mcd, which="dd", class=TRUE) ## identify the discordant sites using robust distances and compare ## to the classical ones mcd <- CovMcd(lmom32) rd <- sqrt(getDistance(mcd)) ccov <- CovClassic(lmom32) cd <- sqrt(getDistance(ccov)) r.out <- which(rd > sqrt(qchisq(0.975,3))) c.out <- which(cd > sqrt(qchisq(0.975,3))) cat("Robust: ", length(r.out), " outliers: ", r.out,"\n") cat("Classical: ", length(c.out), " outliers: ", c.out,"\n")
The data on annual maximum streamflow at 17 sites with largest drainage area basins in southeastern USA contains the sample L-moments ratios (L-CV, L-skewness and L-kurtosis) as used by Hosking and Wallis (1997) to illustrate the discordancy measure in regional freqency analysis (RFA).
data(lmom33)
data(lmom33)
A data frame with 17 observations on the following 3 variables.
L-CV
L-coefficient of variation
L-skewness
L-coefficient of skewness
L-kurtosis
L-coefficient of kurtosis
The sample L-moment ratios (L-CV, L-skewness and L-kurtosis) of a site are regarded as a point in three dimensional space.
Hosking, J. R. M. and J. R. Wallis (1997), Regional Frequency Analysis: An Approach Based on L-moments. Cambridge University Press, p.51, Table 3.3
Neykov, N.M., Neytchev, P.N., Van Gelder, P.H.A.J.M. and Todorov V. (2007), Robust detection of discordant sites in regional frequency analysis, Water Resources Research, 43, W06417, doi:10.1029/2006WR005322
data(lmom33) # plot a matrix of scatterplots pairs(lmom33, main="Hosking and Wallis Data Set, Table 3.3", pch=21, bg=c("red", "green3", "blue")) mcd<-CovMcd(lmom33) mcd plot(mcd, which="dist", class=TRUE) plot(mcd, which="dd", class=TRUE) ## identify the discordant sites using robust distances and compare ## to the classical ones mcd <- CovMcd(lmom33) rd <- sqrt(getDistance(mcd)) ccov <- CovClassic(lmom33) cd <- sqrt(getDistance(ccov)) r.out <- which(rd > sqrt(qchisq(0.975,3))) c.out <- which(cd > sqrt(qchisq(0.975,3))) cat("Robust: ", length(r.out), " outliers: ", r.out,"\n") cat("Classical: ", length(c.out), " outliers: ", c.out,"\n")
data(lmom33) # plot a matrix of scatterplots pairs(lmom33, main="Hosking and Wallis Data Set, Table 3.3", pch=21, bg=c("red", "green3", "blue")) mcd<-CovMcd(lmom33) mcd plot(mcd, which="dist", class=TRUE) plot(mcd, which="dd", class=TRUE) ## identify the discordant sites using robust distances and compare ## to the classical ones mcd <- CovMcd(lmom33) rd <- sqrt(getDistance(mcd)) ccov <- CovClassic(lmom33) cd <- sqrt(getDistance(ccov)) r.out <- which(rd > sqrt(qchisq(0.975,3))) c.out <- which(cd > sqrt(qchisq(0.975,3))) cat("Robust: ", length(r.out), " outliers: ", r.out,"\n") cat("Classical: ", length(c.out), " outliers: ", c.out,"\n")
A data set containing relative CPU performance data of 209 machines on 8 variables.
are predictive, one (PRP
) is the goal field and one (ERP
) is the
linear regression's guess. The estimated relative performance values were
estimated by the authors using a linear regression method. See their article
(Ein-Dor and Feldmesser, CACM 4/87, pp 308-317) for more details on how the
relative performance values were set.
data(machines)
data(machines)
A data frame with 209 rows and 8 variables The variables are as follows:
MMIN: minimum main memory in kilobytes (integer)
MMAX: maximum main memory in kilobytes (integer)
CACH: cache memory in kilobytes (integer)
CHMIN: minimum channels in units (integer)
CHMAX: maximum channels in units (integer)
PRP: published relative performance (integer)
ERP: estimated relative performance from the original article (integer)
Phillip Ein-Dor and Jacob Feldmesser (1987), Attributes of the performance of central processing units: A relative performance prediction model, Communications of the ACM, 30, 4, pp 308-317.
M. Hubert, P. J. Rousseeuw and T. Verdonck (2009), Robust PCA for skewed data and its outlier map, Computational Statistics & Data Analysis, 53, 2264–2274.
data(machines) ## Compute the medcouple of each variable of the Computer hardware data data.frame(MC=round(apply(machines, 2, mc),2)) ## Plot a pairwise scaterplot matrix pairs(machines[,1:6]) mcd <- CovMcd(machines[,1:6]) plot(mcd, which="pairs") ## Remove the rownames (too long) rownames(machines) <- NULL ## Start with robust PCA based on MCD (P << n) (pca1 <- PcaHubert(machines, k=3)) plot(pca1, main="ROBPCA-MCD", off=0.03) ## PCA with the projection algoritm of Hubert (pca2 <- PcaHubert(machines, k=3, mcd=FALSE)) plot(pca2, main="ROBPCA-SD", off=0.03) ## PCA with the adjusted for skewness algorithm of Hubert et al (2009) (pca3 <- PcaHubert(machines, k=3, mcd=FALSE, skew=TRUE)) plot(pca3, main="ROBPCA-AO", off=0.03)
data(machines) ## Compute the medcouple of each variable of the Computer hardware data data.frame(MC=round(apply(machines, 2, mc),2)) ## Plot a pairwise scaterplot matrix pairs(machines[,1:6]) mcd <- CovMcd(machines[,1:6]) plot(mcd, which="pairs") ## Remove the rownames (too long) rownames(machines) <- NULL ## Start with robust PCA based on MCD (P << n) (pca1 <- PcaHubert(machines, k=3)) plot(pca1, main="ROBPCA-MCD", off=0.03) ## PCA with the projection algoritm of Hubert (pca2 <- PcaHubert(machines, k=3, mcd=FALSE)) plot(pca2, main="ROBPCA-SD", off=0.03) ## PCA with the adjusted for skewness algorithm of Hubert et al (2009) (pca3 <- PcaHubert(machines, k=3, mcd=FALSE, skew=TRUE)) plot(pca3, main="ROBPCA-AO", off=0.03)
Simple artificial data set generated according the example by Marona and Yohai (1998). The data set consists of 20 bivariate normal observations generated with zero means, unit variances and correlation 0.8. The sample correlation is 0.81. Two outliers are introduced (i.e. these are 10% of the data) in the following way: two points are modified by interchanging the largest (observation 19) and smallest (observation 9) value of the first coordinate. The sample correlation becomes 0.05. This example provides a good example of the fact that a multivariate outlier need not be an outlier in any of its coordinate variables.
data(maryo)
data(maryo)
A data frame with 20 observations on 2 variables. To introduce the outliers x[9,1] with x[19,1] are interchanged.
R. A. Marona and V. J. Yohai (1998) Robust estimation of multivariate location and scatter. In Encyclopedia of Statistical Sciences, Updated Volume 2 (Eds. S.Kotz, C.Read and D.Banks). Wiley, New York p. 590
data(maryo) getCorr(CovClassic(maryo)) ## the sample correlation is 0.81 ## Modify 10%% of the data in the following way: ## modify two points (out of 20) by interchanging the ## largest and smallest value of the first coordinate imin <- which(maryo[,1]==min(maryo[,1])) # imin = 9 imax <- which(maryo[,1]==max(maryo[,1])) # imax = 19 maryo1 <- maryo maryo1[imin,1] <- maryo[imax,1] maryo1[imax,1] <- maryo[imin,1] ## The sample correlation becomes 0.05 plot(maryo1) getCorr(CovClassic(maryo1)) ## the sample correlation becomes 0.05 getCorr(CovMcd(maryo1)) ## the (reweighted) MCD correlation is 0.79
data(maryo) getCorr(CovClassic(maryo)) ## the sample correlation is 0.81 ## Modify 10%% of the data in the following way: ## modify two points (out of 20) by interchanging the ## largest and smallest value of the first coordinate imin <- which(maryo[,1]==min(maryo[,1])) # imin = 9 imax <- which(maryo[,1]==max(maryo[,1])) # imax = 19 maryo1 <- maryo maryo1[imin,1] <- maryo[imax,1] maryo1[imax,1] <- maryo[imin,1] ## The sample correlation becomes 0.05 plot(maryo1) getCorr(CovClassic(maryo1)) ## the sample correlation becomes 0.05 getCorr(CovMcd(maryo1)) ## the (reweighted) MCD correlation is 0.79
The octane data contains near infrared absorbance spectra (NIR) of
n=39
gasoline samples over p=226
wavelengths ranging
from 1102 nm to 1552 nm with measurements every two nanometers. For
each of the 39 production gasoline samples the octane number y
was measured.
Six of the samples (25, 26, and 36-39) contain added alcohol.
data(octane)
data(octane)
A data frame with 39 observations and 227 columns, the wavelengts are
by column and the first variable is the dependent one (y
).
K.H. Esbensen, S. Schoenkopf and T. Midtgaard Multivariate Analysis in Practice, Trondheim, Norway: Camo, 1994.
M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005), ROBPCA: a new approach to robust principal components analysis, Technometrics, 47, 64–79.
P. J. Rousseeuw, M. Debruyne, S. Engelen and M. Hubert (2006), Robustness and Outlier Detection in Chemometrics, Critical Reviews in Analytical Chemistry, 36(3–4), 221–242.
data(octane) octane <- octane[, -1] # remove the dependent variable y pca=PcaHubert(octane, k=10) screeplot(pca, type="lines") pca2 <- PcaHubert(octane, k=2) plot(pca2, id.n.sd=6) pca7 <- PcaHubert(octane, k=7) plot(pca7, id.n.sd=6)
data(octane) octane <- octane[, -1] # remove the dependent variable y pca=PcaHubert(octane, k=10) screeplot(pca, type="lines") pca2 <- PcaHubert(octane, k=2) plot(pca2, id.n.sd=6) pca7 <- PcaHubert(octane, k=7) plot(pca7, id.n.sd=6)
This dataset consists of 120 olive oil samples on measurements on 25 chemical compositions (fatty acids, sterols, triterpenic alcohols) of olive oils from Tuscany, Italy (Armanino et al. 1989). There are 4 classes corresponding to different production areas. Class 1, Class 2, Class 3, and Class 4 contain 50, 25, 34, and 11 observations, respectively.
data(olitos)
data(olitos)
A data frame with 120 observations on the following 26 variables.
X1
Free fatty acids
X2
Refractive index
X3
K268
X4
delta K
X5
Palmitic acid
X6
Palmitoleic acid
X7
a numeric vector
X8
a numeric vector
X9
a numeric vector
X10
a numeric vector
X11
a numeric vector
X12
a numeric vector
X13
a numeric vector
X14
a numeric vector
X15
a numeric vector
X16
a numeric vector
X17
a numeric vector
X18
a numeric vector
X19
a numeric vector
X20
a numeric vector
X21
a numeric vector
X22
a numeric vector
X23
a numeric vector
X24
a numeric vector
X25
a numeric vector
grp
a factor with levels 1
2
3
4
Prof. Roberto Todeschini, Milano Chemometrics and QSAR Research Group https://michem.unimib.it
C. Armanino, R. Leardi, S. Lanteri and G. Modi, 1989. Chemometric analysis of Tuscan olive oils. Cbemometrics and Intelligent Laboratoty Sysiem, 5: 343–354.
R. Todeschini, V. Consonni, A. Mauri, M. Pavan (2004) Software for the calculation of molecular descriptors. Pavan M. Talete slr, Milan, Italy, http://www.talete.mi.it
data(olitos) cc <- Linda(grp~., data=olitos, method="mcdC", l1med=TRUE) cc pr <- predict(cc) tt <- mtxconfusion(cc@grp, pr@classification, printit=TRUE)
data(olitos) cc <- Linda(grp~., data=olitos, method="mcdC", l1med=TRUE) cc pr <- predict(cc) tt <- mtxconfusion(cc@grp, pr@classification, printit=TRUE)
The oslo Transect data set contains 360 samples of different plant species collected along a 120 km transect running through the city of Oslo, Norway.
data(OsloTransect)
data(OsloTransect)
A data frame with 360 observations on the following 38 variables.
X.ID
a numeric vector, unique ID of the sample
X.MAT
a factor with levels BBA
BIL
BWO
FER
MOS
ROG
SNE
STW
TWI
XCOO
a numeric vector, X coordinate
YCOO
a numeric vector, Y coordinate
XCOO_km
a numeric vector
YCOO_km
a numeric vector
X.FOREST
a factor with levels BIRSPR
MIXDEC
PINE
SPRBIR
SPRPIN
SPRUCE
DAY
a numeric vector
X.WEATHER
a factor with levels CLOUD
MOIST
NICE
RAIN
ALT
a numeric vector
X.ASP
a factor with levels E
FLAT
N
NE
NW
S
SE
SW
W
X.GRVEG
a factor with levels BLGR
BLLY
BLMOLI
BLUE
BLUGRA
GRAS
GRBLU
GRFE
GRMO
LYLI
MIX
MOGR
MOSS
X.FLITHO
a factor with levels CAMSED
GNEID_O
GNEIS_O
GNEIS_R
MAGM
MICSH
Ag_ppb
a numeric vector
As_ash
a numeric vector
B
a numeric vector
Ba
a numeric vector
Ca
a numeric vector
Cd
a numeric vector
Co
a numeric vector
Cr
a numeric vector
Cu
a numeric vector
Fe
a numeric vector
Hg_ppb
a numeric vector
K
a numeric vector
La
a numeric vector
LOI
a numeric vector
Mg
a numeric vector
Mn
a numeric vector
Mo
a numeric vector
Ni
a numeric vector
P
a numeric vector
Pb
a numeric vector
S
a numeric vector
Sb
a numeric vector
Sr
a numeric vector
Ti
a numeric vector
Zn
a numeric vector
Samples of different plant species were collected along a 120 km
transect running through the city of Oslo,
Norway (forty samples each of leaves, needles,roots or
barks of several plant species), and the concentrations of
25 chemical elements for the sample materials are reported.
The factors that influenced the observed element
concentrations in the sample materials were investigated.
This data set was used in Todorov and Filzmoser (2007) for
illustration of the robust statistics for one-way MANOVA implemented in the function
Wilks.test
.
REIMANN,C., ARNOLDUSSEN,A., BOYD,R., FINNE,T.E., NORDGULEN,Oe., VOLDEN,T. and ENGLMAIER,P. (2006) The Influence of a city on element contents of a terrestrial moss (Hylocomium splendens), The Science of the Total Environment 369 419–432.
REIMANN,C., ARNOLDUSSEN,A., BOYD,R., FINNE,T.E., KOLLER,F., NORDGULEN,Oe., and ENGLMAIER,P. (2007) Element contents in leaves of four plant species (birch, mountain ash, fern and spruce) along anthropogenic and geogenic concentration gradients, The Science of the Total Environment 377 416–433.
REIMANN,C., ARNOLDUSSEN,A., FINNE,T.E., KOLLER,F., NORDGULEN,Oe., and ENGLMAIER,P., (2007) Element contents in birch leaves, bark and wood under different anthropogenic and geogenic conditions, Applied Geochemistry, 22 1549–1566.
Todorov V. and Filzmoser P. (2007) Robust statistic for the one-way MANOVA, submetted to the Journal of Environmetrics.
data(OsloTransect) str(OsloTransect) ## ## Log-transform the numerical part of the data, ## choose the desired groups and variables and ## perform the classical Wilks' Lambda test ## OsloTransect[,14:38] <- log(OsloTransect[,14:38]) grp <- OsloTransect$X.FLITHO ind <- which(grp =="CAMSED" | grp == "GNEIS_O" | grp == "GNEIS_R" | grp=="MAGM") (cwl <- Wilks.test(X.FLITHO~K+P+Zn+Cu,data=OsloTransect[ind,])) ## ## Perform now the robust MCD based Wilks' Lambda test. ## Use the already computed multiplication factor 'xd' and ## degrees of freedom 'xq' for the approximate distribution. ## xd <- -0.003708238 xq <- 11.79073 (mcdwl <- Wilks.test(X.FLITHO~K+P+Zn+Cu,data=OsloTransect[ind,], method="mcd", xd=xd, xq=xq))
data(OsloTransect) str(OsloTransect) ## ## Log-transform the numerical part of the data, ## choose the desired groups and variables and ## perform the classical Wilks' Lambda test ## OsloTransect[,14:38] <- log(OsloTransect[,14:38]) grp <- OsloTransect$X.FLITHO ind <- which(grp =="CAMSED" | grp == "GNEIS_O" | grp == "GNEIS_R" | grp=="MAGM") (cwl <- Wilks.test(X.FLITHO~K+P+Zn+Cu,data=OsloTransect[ind,])) ## ## Perform now the robust MCD based Wilks' Lambda test. ## Use the already computed multiplication factor 'xd' and ## degrees of freedom 'xq' for the approximate distribution. ## xd <- -0.003708238 xq <- 11.79073 (mcdwl <- Wilks.test(X.FLITHO~K+P+Zn+Cu,data=OsloTransect[ind,], method="mcd", xd=xd, xq=xq))
The class Pca
searves as a base class for deriving all other
classes representing the results of the classical and robust Principal
Component Analisys methods
A virtual Class: No objects may be created from it.
call
:Object of class "language"
center
:Object of class "vector"
the center of the data
scale
:Object of class "vector"
the scaling applied to each variable of the data
rank
:Object of class "numeric"
the rank of the data matrix
loadings
:Object of class "matrix"
the matrix
of variable loadings (i.e., a matrix whose columns contain the eigenvectors)
eigenvalues
:Object of class "vector"
the eigenvalues
scores
:Object of class "matrix"
the scores - the value
of the projected on the space of the principal components data (the centred
(and scaled if requested) data multiplied
by the loadings
matrix) is returned. Hence, cov(scores)
is the diagonal matrix diag(eigenvalues)
k
:Object of class "numeric"
number of (choosen) principal components
sd
:Object of class "Uvector"
Score distances within the robust PCA subspace
od
:Object of class "Uvector"
Orthogonal distances to the robust PCA subspace
cutoff.sd
:Object of class "numeric"
Cutoff value for the score distances
cutoff.od
:Object of class "numeric"
Cutoff values for the orthogonal distances
flag
:Object of class "Uvector"
The observations whose score distance is larger
than cutoff.sd or whose orthogonal distance is larger than cutoff.od can be considered
as outliers and receive a flag equal to zero.
The regular observations receive a flag 1
criterion to use for computing the cutoff values for the orthogonal and score distances. Default is 0.975.
n.obs
:Object of class "numeric"
the number of observations
eig0
:Object of class "vector"
all eigenvalues
totvar0
:Object of class "numeric"
the total variance explained (=sum(eig0)
)
signature(obj = "Pca")
: center of the data
signature(obj = "Pca")
: return the scaling applied to each variable
signature(obj = "Pca")
: the eigenvalues of the
covariance/correlation matrix, though the calculation is actually done
with the singular values of the data matrix)
signature(obj = "Pca")
: returns the matrix
loadings
(i.e., a matrix whose columns contain the eigenvectors).
The function prcomp returns this matrix in the element rotation.
signature(obj = "Pca")
: returns an S3 object prcomp
for compatibility with the functions prcomp() and princomp(). Thus the
standard plots screeplot() and biplot() can be used
signature(obj = "Pca")
: returns the rotated data (the centred
(and scaled if requested) data multiplied by the loadings matrix).
signature(obj = "Pca")
: returns the standard deviations of the
principal components (i.e., the square roots of the eigenvalues of the
covariance/correlation matrix, though the calculation is actually done
with the singular values of the data matrix)
signature(x = "Pca")
: produces a distance plot (if k=rank
) or
distance-distance plot (ifk<rank
)
signature(x = "Pca")
: prints the results. The difference to the show()
method is that additional parametesr are possible.
signature(object = "Pca")
: prints the results
signature(object = "Pca")
: calculates prediction using the results in
object
. An optional data frame or matrix in which to look for variables with which
to predict. If omitted, the scores are used. If the original fit used a formula or
a data frame or a matrix with column names, newdata must contain columns with the
same names. Otherwise it must contain the same number of columns,
to be used in the same order. See also predict.prcomp
and
predict.princomp
signature(x = "Pca")
: plots the variances against the
number of the principal component. See also plot.prcomp
and
plot.princomp
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
PcaClassic
, PcaClassic-class
, PcaRobust-class
showClass("Pca")
showClass("Pca")
Compute score and orthogonal distances for an object (derived from)Pca-class
.
pca.distances(obj, data, r, crit=0.975)
pca.distances(obj, data, r, crit=0.975)
obj |
an object of class (derived from) |
data |
The data matrix for which the |
r |
rank of data |
crit |
Criterion to use for computing the cutoff values. |
This function calculates the score and orthogonal distances and the appropriate cutoff values for identifying outlying observations. The computed values are used to create a vector a of flags, one for each observation, identifying the outliers.
An S4 object of class derived from the virtual class Pca-class
-
the same object passed to the function, but with the score and orthogonal
distances as well as their cutoff values and the corresponding flags appended to it.
Valentin Todorov [email protected]
M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005), ROBPCA: a new approach to robust principal components analysis, Technometrics, 47, 64–79.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaHubert(hbk) pca.distances(pca, hbk, rankMM(hbk))
## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaHubert(hbk) pca.distances(pca, hbk, rankMM(hbk))
Produces a score plot from an object (derived from) Pca-class
.
pca.scoreplot(obj, i=1, j=2, main, id.n, ...)
pca.scoreplot(obj, i=1, j=2, main, id.n, ...)
obj |
an object of class (derived from) |
i |
First score coordinate, defaults to |
j |
Second score coordinate, defaults to |
main |
The main title of the plot. |
id.n |
Number of observations to identify by a label. If missing and the total number of observations is less or equal to 10, all observations will be labelled. |
... |
Optional arguments to be passed to the internal graphical functions. |
Valentin Todorov [email protected]
Pca-class
,
PcaClassic
,
PcaRobust-class
.
require(graphics) ## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaHubert(hbk) pca pca.scoreplot(pca)
require(graphics) ## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaHubert(hbk) pca pca.scoreplot(pca)
Performs a principal components analysis and returns the results as an object of class PcaClassic (aka constructor).
PcaClassic(x, ...) ## Default S3 method: PcaClassic(x, k = ncol(x), kmax = ncol(x), scale=FALSE, signflip=TRUE, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaClassic(formula, data = NULL, subset, na.action, ...)
PcaClassic(x, ...) ## Default S3 method: PcaClassic(x, k = ncol(x), kmax = ncol(x), scale=FALSE, signflip=TRUE, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaClassic(formula, data = NULL, subset, na.action, ...)
formula |
a formula with no response variable, referring only to numeric variables. |
data |
an optional data frame (or similar: see
|
subset |
an optional vector used to select rows (observations) of the
data matrix |
na.action |
a function which indicates what should happen
when the data contain |
... |
arguments passed to or from other methods. |
x |
a numeric matrix (or data frame) which provides the data for the principal components analysis. |
k |
number of principal components to compute. If |
kmax |
maximal number of principal components to compute.
Default is |
scale |
a value indicating whether and how the variables should be scaled
to have unit variance (only possible if there are no constant
variables). If |
signflip |
a logical value indicating wheather to try to solve
the sign indeterminancy of the loadings - ad hoc approach setting
the maximum element in a singular vector to be positive. Default is
|
crit.pca.distances |
criterion to use for computing the cutoff values for the orthogonal and score distances. Default is 0.975. |
trace |
whether to print intermediate results. Default is |
An S4 object of class PcaClassic-class
which is a subclass of the
virtual class Pca-class
.
This function can be seen as a wrapper arround prcomp() from stats
which
returns the results of the PCA in a class compatible with the object model for robust PCA.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
Contains the results of a classical Principal Components Analysis
Objects can be created by calls of the form new("PcaClassic", ...)
but the
usual way of creating PcaClassic
objects is a call to the function
PcaClassic
which serves as a constructor.
call
:Object of class "language"
center
:Object of class "vector"
the center of the data
scale
:Object of class "vector"
the scaling applied to each variable
rank
:Object of class "numeric"
the rank of the data matrix
loadings
:Object of class "matrix"
the matrix
of variable loadings (i.e., a matrix whose columns contain the eigenvectors)
eigenvalues
:Object of class "vector"
the eigenvalues
scores
:Object of class "matrix"
the scores - the value
of the projected on the space of the principal components data (the centred
(and scaled if requested) data multiplied
by the loadings
matrix) is returned. Hence, cov(scores)
is the diagonal matrix diag(eigenvalues)
k
:Object of class "numeric"
number of (choosen) principal components
sd
:Object of class "Uvector"
Score distances within the robust PCA subspace
od
:Object of class "Uvector"
Orthogonal distances to the robust PCA subspace
cutoff.sd
:Object of class "numeric"
Cutoff value for the score distances
cutoff.od
:Object of class "numeric"
Cutoff values for the orthogonal distances
flag
:Object of class "Uvector"
The observations whose score distance is larger
than cutoff.sd or whose orthogonal distance is larger than cutoff.od can be considered
as outliers and receive a flag equal to zero.
The regular observations receive a flag 1
n.obs
:Object of class "numeric"
the number of observations
eig0
:Object of class "vector"
all eigenvalues
totvar0
:Object of class "numeric"
the total variance explained (=sum(eig0)
)
Class "Pca"
, directly.
signature(obj = "PcaClassic")
: returns the number of
observations used in the computation, i.e. n.obs
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
PcaRobust-class
, Pca-class
, PcaClassic
showClass("PcaClassic")
showClass("PcaClassic")
Robust PCA are obtained by replacing the classical covariance matrix
by a robust covariance estimator. This can be one of the available
in rrcov
estimators, i.e. MCD, OGK, M or S estimator.
PcaCov(x, ...) ## Default S3 method: PcaCov(x, k = ncol(x), kmax = ncol(x), cov.control=CovControlMcd(), scale = FALSE, signflip = TRUE, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaCov(formula, data = NULL, subset, na.action, ...)
PcaCov(x, ...) ## Default S3 method: PcaCov(x, k = ncol(x), kmax = ncol(x), cov.control=CovControlMcd(), scale = FALSE, signflip = TRUE, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaCov(formula, data = NULL, subset, na.action, ...)
formula |
a formula with no response variable, referring only to numeric variables. |
data |
an optional data frame (or similar: see
|
subset |
an optional vector used to select rows (observations) of the
data matrix |
na.action |
a function which indicates what should happen
when the data contain |
... |
arguments passed to or from other methods. |
x |
a numeric matrix (or data frame) which provides the data for the principal components analysis. |
k |
number of principal components to compute. If |
kmax |
maximal number of principal components to compute.
Default is |
cov.control |
specifies which covariance estimator to use by providing
a |
.
scale |
a value indicating whether and how the variables should be scaled
to have unit variance (only possible if there are no constant
variables). If |
signflip |
a logical value indicating wheather to try to solve the sign indeterminancy of the loadings -
ad hoc approach setting the maximum element in a singular vector to be positive. Default is |
crit.pca.distances |
criterion to use for computing the cutoff values for the orthogonal and score distances. Default is 0.975. |
trace |
whether to print intermediate results. Default is |
PcaCov
, serving as a constructor for objects of class PcaCov-class
is a generic function with "formula" and "default" methods. For details see the relevant references.
An S4 object of class PcaCov-class
which is a subclass of the
virtual class PcaRobust-class
.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaCov(hbk) pca ## Compare with the classical PCA prcomp(hbk) ## or PcaClassic(hbk) ## If you want to print the scores too, use print(pca, print.x=TRUE) ## Using the formula interface PcaCov(~., data=hbk) ## To plot the results: plot(pca) # distance plot pca2 <- PcaCov(hbk, k=2) plot(pca2) # PCA diagnostic plot (or outlier map) ## Use the standard plots available for for prcomp and princomp screeplot(pca) biplot(pca)
## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaCov(hbk) pca ## Compare with the classical PCA prcomp(hbk) ## or PcaClassic(hbk) ## If you want to print the scores too, use print(pca, print.x=TRUE) ## Using the formula interface PcaCov(~., data=hbk) ## To plot the results: plot(pca) # distance plot pca2 <- PcaCov(hbk, k=2) plot(pca2) # PCA diagnostic plot (or outlier map) ## Use the standard plots available for for prcomp and princomp screeplot(pca) biplot(pca)
Robust PCA are obtained by replacing the classical covariance matrix
by a robust covariance estimator. This can be one of the available
in rrcov
estimators, i.e. MCD, OGK, M, S or Stahel-Donoho estimator.
Objects can be created by calls of the form new("PcaCov", ...)
but the
usual way of creating PcaCov
objects is a call to the function
PcaCov
which serves as a constructor.
quan
:Object of class "numeric"
The quantile h
used throughout the algorithm
call
, center
, rank
, loadings
,
eigenvalues
, scores
, k
,
sd
, od
, cutoff.sd
, cutoff.od
,
flag
, n.obs
, eig0
, totvar0
:from the "Pca"
class.
Class "PcaRobust"
, directly.
Class "Pca"
, by class "PcaRobust", distance 2.
signature(obj = "PcaCov")
: ...
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
PcaRobust-class
, Pca-class
, PcaClassic
, PcaClassic-class
showClass("PcaCov")
showClass("PcaCov")
Computes an approximation of the PP-estimators for PCA using the grid search algorithm in the plane.
PcaGrid(x, ...) ## Default S3 method: PcaGrid(x, k = 0, kmax = ncol(x), scale=FALSE, na.action = na.fail, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaGrid(formula, data = NULL, subset, na.action, ...)
PcaGrid(x, ...) ## Default S3 method: PcaGrid(x, k = 0, kmax = ncol(x), scale=FALSE, na.action = na.fail, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaGrid(formula, data = NULL, subset, na.action, ...)
formula |
a formula with no response variable, referring only to numeric variables. |
data |
an optional data frame (or similar: see
|
subset |
an optional vector used to select rows (observations) of the
data matrix |
na.action |
a function which indicates what should happen
when the data contain |
... |
arguments passed to or from other methods. |
x |
a numeric matrix (or data frame) which provides the data for the principal components analysis. |
k |
number of principal components to compute. If |
kmax |
maximal number of principal components to compute.
Default is |
scale |
a value indicating whether and how the variables should be
scaled. If |
crit.pca.distances |
criterion to use for computing the cutoff values for the orthogonal and score distances. Default is 0.975. |
trace |
whether to print intermediate results. Default is |
PcaGrid
, serving as a constructor for objects of class PcaGrid-class
is a generic function with "formula" and "default" methods. For details see PCAgrid
and the relevant references.
An S4 object of class PcaGrid-class
which is a subclass of the
virtual class PcaRobust-class
.
Valentin Todorov [email protected]
C. Croux, P. Filzmoser, M. Oliveira, (2007). Algorithms for Projection-Pursuit Robust Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, 87, 225.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PCAgrid pc <- PcaGrid(x, 6) # we could draw a biplot too: biplot(pc) # we could use another objective function, and # maybe only calculate the first three principal components: pc <- PcaGrid(x, 3, method="qn") biplot(pc) # now we want to compare the results with the non-robust principal components pc <- PcaClassic(x, k=3) # again, a biplot for comparision: biplot(pc)
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PCAgrid pc <- PcaGrid(x, 6) # we could draw a biplot too: biplot(pc) # we could use another objective function, and # maybe only calculate the first three principal components: pc <- PcaGrid(x, 3, method="qn") biplot(pc) # now we want to compare the results with the non-robust principal components pc <- PcaClassic(x, k=3) # again, a biplot for comparision: biplot(pc)
Holds the results of an approximation of the PP-estimators for PCA using the grid search algorithm in the plane.
Objects can be created by calls of the form new("PcaGrid", ...)
but the
usual way of creating PcaGrid
objects is a call to the function
PcaGrid()
which serves as a constructor.
call
, center
, scale
, rank
, loadings
,
eigenvalues
, scores
, k
,
sd
, od
, cutoff.sd
, cutoff.od
,
flag
, n.obs
:from the "Pca"
class.
Class "PcaRobust"
, directly.
Class "Pca"
, by class "PcaRobust"
, distance 2.
signature(obj = "PcaGrid")
: ...
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
PcaRobust-class
, Pca-class
, PcaClassic
, PcaClassic-class
showClass("PcaGrid")
showClass("PcaGrid")
The ROBPCA algorithm was proposed by Hubert et al (2005) and stays for 'ROBust method for Principal Components Analysis'. It is resistant to outliers in the data. The robust loadings are computed using projection-pursuit techniques and the MCD method. Therefore ROBPCA can be applied to both low and high-dimensional data sets. In low dimensions, the MCD method is applied.
PcaHubert(x, ...) ## Default S3 method: PcaHubert(x, k = 0, kmax = 10, alpha = 0.75, mcd = TRUE, skew=FALSE, maxdir=250, scale = FALSE, signflip = TRUE, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaHubert(formula, data = NULL, subset, na.action, ...)
PcaHubert(x, ...) ## Default S3 method: PcaHubert(x, k = 0, kmax = 10, alpha = 0.75, mcd = TRUE, skew=FALSE, maxdir=250, scale = FALSE, signflip = TRUE, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaHubert(formula, data = NULL, subset, na.action, ...)
formula |
a formula with no response variable, referring only to numeric variables. |
data |
an optional data frame (or similar: see
|
subset |
an optional vector used to select rows (observations) of the
data matrix |
na.action |
a function which indicates what should happen
when the data contain |
... |
arguments passed to or from other methods. |
x |
a numeric matrix (or data frame) which provides the data for the principal components analysis. |
k |
number of principal components to compute. If |
kmax |
maximal number of principal components to compute.
Default is |
alpha |
this parameter measures the fraction of outliers the algorithm should resist. In MCD alpha controls the size of the subsets over which the determinant is minimized, i.e. alpha*n observations are used for computing the determinant. Allowed values are between 0.5 and 1 and the default is 0.75. |
mcd |
Logical - when the number of variables is sufficiently small,
the loadings are computed as the eigenvectors of the MCD covariance matrix,
hence the function |
skew |
Logical - whether the adjusted outlyingness algorithm for skewed
data (Hubert et al., 2009) will be used, default is |
maxdir |
maximal number of random directions to use for computing the
outlyingness (or the adjusted outlyingness when |
scale |
a value indicating whether and how the variables should be scaled.
If |
signflip |
a logical value indicating wheather to try to solve the sign indeterminancy of the loadings -
ad hoc approach setting the maximum element in a singular vector to be positive. Default is |
crit.pca.distances |
criterion to use for computing the cutoff values for the orthogonal and score distances. Default is 0.975. |
trace |
whether to print intermediate results. Default is |
PcaHubert
, serving as a constructor for objects of class PcaHubert-class
is a generic function with "formula" and "default" methods.
The calculation is done using the ROBPCA method of Hubert et al (2005) which can
be described briefly as follows. For details see the relevant references.
Let n
denote the number of observations, and p
the
number of original variables in the input data matrix X
. The
ROBPCA algorithm finds a robust center M (p x 1)
of the data
and a loading matrix P
which is (p x k)
dimensional.
Its columns are orthogonal and define a new coordinate system. The
scores T, an (n x k)
matrix, are the coordinates of the
centered observations with respect to the loadings:
The ROBPCA algorithm also yields a robust covariance matrix (often singular) which can be computed as
where is the diagonal matrix with the eigenvalues
.
This is done in the following three main steps:
Step 1: The data are preprocessed by reducing their data space to
the subspace spanned by the n
observations. This is done by
singular value decomposition of the input data matrix. As a result
the data are represented using at most n-1=rank(X)
without
loss of information.
Step 2: In this step for each data point a measure of outlyingness
is computed. For this purpose the high-dimensional data points are
projected on many univariate directions, each time the univariate
MCD estimator of location and scale is computed and the
standardized distance to the center is measured. The largest of
these distances (over all considered directions) is the outlyingness
measure of the data point. The h
data points with smallest
outlyingness measure are used to compute the covariance matrix
and to select the number
k
of principal
components to retain. This is done by finding such k
that
and
Alternatively the number of principal components
k
can be
specified by the user after inspecting the scree plot.
Step 3: The data points are projected on the k-dimensional subspace
spanned by the k
eigenvectors corresponding to the largest
k
eigenvalues of the matrix . The location and
scatter of the projected data are computed using the
reweighted MCD estimator and the eigenvectors of this scatter matrix
yield the robust principal components.
An S4 object of class PcaHubert-class
which is a subclass of the
virtual class PcaRobust-class
.
The ROBPCA algorithm is implemented on the bases of the Matlab implementation, available as part of LIBRA, a Matlab Library for Robust Analysis to be found at www.wis.kuleuven.ac.be/stat/robust.html
Valentin Todorov [email protected]
M. Hubert, P. J. Rousseeuw, K. Vanden Branden (2005), ROBPCA: a new approach to robust principal components analysis, Technometrics, 47, 64–79.
M. Hubert, P. J. Rousseeuw and T. Verdonck (2009), Robust PCA for skewed data and its outlier map, Computational Statistics & Data Analysis, 53, 2264–2274.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaHubert(hbk) pca ## Compare with the classical PCA prcomp(hbk) ## or PcaClassic(hbk) ## If you want to print the scores too, use print(pca, print.x=TRUE) ## Using the formula interface PcaHubert(~., data=hbk) ## To plot the results: plot(pca) # distance plot pca2 <- PcaHubert(hbk, k=2) plot(pca2) # PCA diagnostic plot (or outlier map) ## Use the standard plots available for prcomp and princomp screeplot(pca) biplot(pca) ## Restore the covraiance matrix py <- PcaHubert(hbk) cov.1 <- py@loadings %*% diag(py@eigenvalues) %*% t(py@loadings) cov.1
## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaHubert(hbk) pca ## Compare with the classical PCA prcomp(hbk) ## or PcaClassic(hbk) ## If you want to print the scores too, use print(pca, print.x=TRUE) ## Using the formula interface PcaHubert(~., data=hbk) ## To plot the results: plot(pca) # distance plot pca2 <- PcaHubert(hbk, k=2) plot(pca2) # PCA diagnostic plot (or outlier map) ## Use the standard plots available for prcomp and princomp screeplot(pca) biplot(pca) ## Restore the covraiance matrix py <- PcaHubert(hbk) cov.1 <- py@loadings %*% diag(py@eigenvalues) %*% t(py@loadings) cov.1
The ROBPCA algorithm was proposed by Hubert et al (2005) and stays for 'ROBust method for Principal Components Analysis'. It is resistant to outliers in the data. The robust loadings are computed using projection-pursuit techniques and the MCD method. Therefore ROBPCA can be applied to both low and high-dimensional data sets. In low dimensions, the MCD method is applied.
Objects can be created by calls of the form new("PcaHubert", ...)
but the
usual way of creating PcaHubert
objects is a call to the function
PcaHubert
which serves as a constructor.
alpha
:Object of class "numeric"
the fraction of outliers
the algorithm should resist - this is the argument alpha
quan
:The quantile h
used throughout the algorithm
skew
:Whether the adjusted outlyingness algorithm for skewed data was used
ao
:Object of class "Uvector"
Adjusted outlyingness within the robust PCA subspace
call
, center
, scale
, rank
, loadings
,
eigenvalues
, scores
, k
,
sd
, od
, cutoff.sd
, cutoff.od
,
flag
, n.obs
, eig0
, totvar0
:from the "Pca"
class.
Class "PcaRobust"
, directly.
Class "Pca"
, by class "PcaRobust", distance 2.
signature(obj = "PcaHubert")
: Returns the quantile
used throughout the algorithm
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
PcaRobust-class
, Pca-class
, PcaClassic
, PcaClassic-class
showClass("PcaHubert")
showClass("PcaHubert")
The Spherical Principal Components procedure was proposed by Locantore et al., (1999) as a functional data analysis method. The idea is to perform classical PCA on the data, projected onto a unit sphere. The estimates of the eigenvectors are consistent and the procedure is extremely fast. The simulations of Maronna (2005) show that this method has very good performance.
PcaLocantore(x, ...) ## Default S3 method: PcaLocantore(x, k = ncol(x), kmax = ncol(x), delta = 0.001, na.action = na.fail, scale = FALSE, signflip = TRUE, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaLocantore(formula, data = NULL, subset, na.action, ...)
PcaLocantore(x, ...) ## Default S3 method: PcaLocantore(x, k = ncol(x), kmax = ncol(x), delta = 0.001, na.action = na.fail, scale = FALSE, signflip = TRUE, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaLocantore(formula, data = NULL, subset, na.action, ...)
formula |
a formula with no response variable, referring only to numeric variables. |
data |
an optional data frame (or similar: see
|
subset |
an optional vector used to select rows (observations) of the
data matrix |
na.action |
a function which indicates what should happen
when the data contain |
... |
arguments passed to or from other methods. |
x |
a numeric matrix (or data frame) which provides the data for the principal components analysis. |
k |
number of principal components to compute. If |
kmax |
maximal number of principal components to compute.
Default is |
delta |
an accuracy parameter |
scale |
a value indicating whether and how the variables should be scaled
to have unit variance (only possible if there are no constant
variables). If |
signflip |
a logical value indicating wheather to try to solve
the sign indeterminancy of the loadings - ad hoc approach setting
the maximum element in a singular vector to be positive. Default is
|
crit.pca.distances |
criterion to use for computing the cutoff values for the orthogonal and score distances. Default is 0.975. |
trace |
whether to print intermediate results. Default is |
PcaLocantore
, serving as a constructor for objects of class
PcaLocantore-class
is a generic function with "formula"
and "default" methods. For details see the relevant references.
An S4 object of class PcaLocantore-class
which is a subclass of the
virtual class PcaRobust-class
.
Valentin Todorov [email protected] The SPC algorithm is implemented on the bases of the available from the web site of the book Maronna et al. (2006) code https://www.wiley.com/legacy/wileychi/robust_statistics/
N. Locantore, J. Marron, D. Simpson, N. Tripoli, J. Zhang and K. Cohen K. (1999), Robust principal components for functional data. Test, 8, 1-28.
R. Maronna, D. Martin and V. Yohai (2006), Robust Statistics: Theory and Methods. Wiley, New York.
R. Maronna (2005). Principal components and orthogonal regression based on robust scales. Technometrics, 47, 264-273.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaLocantore(hbk) pca ## Compare with the classical PCA prcomp(hbk) ## or PcaClassic(hbk) ## If you want to print the scores too, use print(pca, print.x=TRUE) ## Using the formula interface PcaLocantore(~., data=hbk) ## To plot the results: plot(pca) # distance plot pca2 <- PcaLocantore(hbk, k=2) plot(pca2) # PCA diagnostic plot (or outlier map) ## Use the standard plots available for for prcomp and princomp screeplot(pca) biplot(pca)
## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaLocantore(hbk) pca ## Compare with the classical PCA prcomp(hbk) ## or PcaClassic(hbk) ## If you want to print the scores too, use print(pca, print.x=TRUE) ## Using the formula interface PcaLocantore(~., data=hbk) ## To plot the results: plot(pca) # distance plot pca2 <- PcaLocantore(hbk, k=2) plot(pca2) # PCA diagnostic plot (or outlier map) ## Use the standard plots available for for prcomp and princomp screeplot(pca) biplot(pca)
The Spherical Principal Components procedure was proposed by Locantore et al., (1999) as a functional data analysis method. The idea is to perform classical PCA on the the data, \ projected onto a unit sphere. The estimates of the eigenvectors are consistent and the procedure is extremly fast. The simulations of Maronna (2005) show that this method has very good performance.
Objects can be created by calls of the form new("PcaLocantore", ...)
but the
usual way of creating PcaLocantore
objects is a call to the function
PcaLocantore
which serves as a constructor.
delta
:Accuracy parameter
quan
:Object of class "numeric"
The quantile h used throughout the algorithm
call
, center
, scale
, rank
, loadings
,
eigenvalues
, scores
, k
,
sd
, od
, cutoff.sd
, cutoff.od
,
flag
, n.obs
, eig0
, totvar0
:from the "Pca"
class.
Class "PcaRobust"
, directly.
Class "Pca"
, by class "PcaRobust", distance 2.
signature(obj = "PcaLocantore")
: ...
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
PcaRobust-class
, Pca-class
, PcaClassic
, PcaClassic-class
showClass("PcaLocantore")
showClass("PcaLocantore")
A fast and simple algorithm for approximating the PP-estimators for PCA: Croux and Ruiz-Gazen (2005)
PcaProj(x, ...) ## Default S3 method: PcaProj(x, k = 0, kmax = ncol(x), scale=FALSE, na.action = na.fail, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaProj(formula, data = NULL, subset, na.action, ...)
PcaProj(x, ...) ## Default S3 method: PcaProj(x, k = 0, kmax = ncol(x), scale=FALSE, na.action = na.fail, crit.pca.distances = 0.975, trace=FALSE, ...) ## S3 method for class 'formula' PcaProj(formula, data = NULL, subset, na.action, ...)
formula |
a formula with no response variable, referring only to numeric variables. |
data |
an optional data frame (or similar: see
|
subset |
an optional vector used to select rows (observations) of the
data matrix |
na.action |
a function which indicates what should happen
when the data contain |
... |
arguments passed to or from other methods. |
x |
a numeric matrix (or data frame) which provides the data for the principal components analysis. |
k |
number of principal components to compute. If |
kmax |
maximal number of principal components to compute.
Default is |
scale |
a value indicating whether and how the variables should be
scaled. If |
crit.pca.distances |
criterion to use for computing the cutoff values for the orthogonal and score distances. Default is 0.975. |
trace |
whether to print intermediate results. Default is |
PcaProj
, serving as a constructor for objects of class PcaProj-class
is a generic function with "formula" and "default" methods. For details see
PCAproj
and the relevant references.
An S4 object of class PcaProj-class
which is a subclass of the
virtual class PcaRobust-class
.
Valentin Todorov [email protected]
C. Croux, A. Ruiz-Gazen (2005). High breakdown estimators for principal components: The projection-pursuit approach revisited, Journal of Multivariate Analysis, 95, 206–226.
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PcaProj pc <- PcaProj(x, 6) # we could draw a biplot too: biplot(pc) # we could use another calculation method and another objective function, and # maybe only calculate the first three principal components: pc <- PcaProj(x, k=3, method="qn", CalcMethod="sphere") biplot(pc) # now we want to compare the results with the non-robust principal components pc <- PcaClassic(x, k=3) # again, a biplot for comparision: biplot(pc)
# multivariate data with outliers library(mvtnorm) x <- rbind(rmvnorm(200, rep(0, 6), diag(c(5, rep(1,5)))), rmvnorm( 15, c(0, rep(20, 5)), diag(rep(1, 6)))) # Here we calculate the principal components with PcaProj pc <- PcaProj(x, 6) # we could draw a biplot too: biplot(pc) # we could use another calculation method and another objective function, and # maybe only calculate the first three principal components: pc <- PcaProj(x, k=3, method="qn", CalcMethod="sphere") biplot(pc) # now we want to compare the results with the non-robust principal components pc <- PcaClassic(x, k=3) # again, a biplot for comparision: biplot(pc)
Holds the results of an approximation of the PP-estimators for PCA by a fast and simple algorithm: Croux and Ruiz-Gazen (2005) algorithm.
Objects can be created by calls of the form new("PcaProj", ...)
but the
usual way of creating PcaProj
objects is a call to the function
PcaProj()
which serves as a constructor.
call
, center
, scale
, rank
, loadings
,
eigenvalues
, scores
, k
,
sd
, od
, cutoff.sd
, cutoff.od
,
flag
, n.obs
:from the "Pca"
class.
Class "PcaRobust"
, directly.
Class "Pca"
, by class "PcaRobust"
, distance 2.
signature(obj = "PcaProj")
: ...
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
PcaRobust-class
, Pca-class
, PcaClassic
, PcaClassic-class
showClass("PcaProj")
showClass("PcaProj")
The class PcaRobust
searves as a base class for deriving all other
classes representing the results of the robust Principal Component Analisys methods
A virtual Class: No objects may be created from it.
call
:Object of class "language"
center
:Object of class "vector"
the center of the data
loadings
:Object of class "matrix"
the matrix
of variable loadings (i.e., a matrix whose columns contain the eigenvectors)
eigenvalues
:Object of class "vector"
the eigenvalues
scores
:Object of class "matrix"
the scores - the value
of the projected on the space of the principal components data (the centred
(and scaled if requested) data multiplied
by the loadings
matrix) is returned. Hence, cov(scores)
is the diagonal matrix diag(eigenvalues)
k
:Object of class "numeric"
number of (choosen) principal components
sd
:Object of class "Uvector"
Score distances within the robust PCA subspace
od
:Object of class "Uvector"
Orthogonal distances to the robust PCA subspace
cutoff.sd
:Object of class "numeric"
Cutoff value for the score distances
cutoff.od
:Object of class "numeric"
Cutoff values for the orthogonal distances
flag
:Object of class "Uvector"
The observations whose score distance is larger
than cutoff.sd or whose orthogonal distance is larger than cutoff.od can be considered
as outliers and receive a flag equal to zero.
The regular observations receive a flag 1
n.obs
:Object of class "numeric"
the number of observations
Class "Pca"
, directly.
No methods defined with class "PcaRobust" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("PcaRobust")
showClass("PcaRobust")
Shows the Mahalanobis distances based on robust and/or classical estimates of the location and the covariance matrix in different plots. The following plots are available:
- index plot of the robust and mahalanobis distances
- distance-distance plot
- Chisquare QQ-plot of the robust and mahalanobis distances
- plot of the tolerance ellipses (robust and classic)
- Scree plot - Eigenvalues comparison plot
## S4 method for signature 'CovClassic' plot(x, which = c("all","distance","qqchi2","tolellipse","screeplot"), ask=(which=="all" && dev.interactive()), cutoff, id.n, tol=1e-7, ...) ## S4 method for signature 'CovRobust' plot(x, which = c("all","dd","distance","qqchi2","tolellipse","screeplot"), classic=FALSE, ask=(which=="all" && dev.interactive()), cutoff, id.n, tol=1e-7, ...)
## S4 method for signature 'CovClassic' plot(x, which = c("all","distance","qqchi2","tolellipse","screeplot"), ask=(which=="all" && dev.interactive()), cutoff, id.n, tol=1e-7, ...) ## S4 method for signature 'CovRobust' plot(x, which = c("all","dd","distance","qqchi2","tolellipse","screeplot"), classic=FALSE, ask=(which=="all" && dev.interactive()), cutoff, id.n, tol=1e-7, ...)
x |
an object of class |
which |
Which plot to show? See Details for description of the options. Default is |
.
classic |
whether to plot the classical distances too. Default is |
.
ask |
logical; if 'TRUE', the user is asked before each plot, see 'par(ask=.)'.
Default is |
cutoff |
The cutoff value for the distances. |
id.n |
Number of observations to identify by a label. If not supplied, the number of observations with distance larger than |
tol |
tolerance to be used for computing the inverse see 'solve'. Default is |
... |
other parameters to be passed through to plotting functions. |
Plot mahalanobis distances for x
.
Plot robust and classical mahalanobis distances for x
.
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovClassic(hbk.x) plot(cv) rcv <- CovMest(hbk.x) plot(rcv)
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovClassic(hbk.x) plot(cv) rcv <- CovMest(hbk.x) plot(rcv)
The Archaic Greek Pottery data set contains data on fragments of Greek pottery which were classified into two groups according to their origin: Attic or Eritrean. Six chemical variables, metallic oxide constituents, were measured: Si, Al, Fe, Ca and Ti. The main data set consists of 13 Attic objects and 14 Eritrean ones. There is a separate data set with 13 observations which can be used as a test data set. It consists of 4 observations classified as "probably Attic" and the remaining 9 as "probably Eritrean".
data(pottery)
data(pottery)
Two data frames with 27 an 13 observations on the following 7 variables.
SI
Si content
AL
Al content
FE
Fe content
MG
Mg content
CA
Ca content
TI
Ti content
origin
Origin - factor with two levels: Attic
and Eritrean
The Archaic Greek Pottery data set was first published by Stern and Descoeudres (1977) and later reproduced in Cooper and Weeks (1983) for illustration of linear discriminant analisys. The data set was used by Pires and Branco (2010) for illustration of their projection pursuit approach to linear discriminant analysis.
STERN, W. B. and DESCOEUDRES, J.-P. (1977) X-RAY FLUORESCENCE ANALYSIS OF ARCHAIC GREEK POTTERY Archaeometry, Blackwell Publishing Ltd, 19, 73–86.
Cooper, R.A. and Weekes, A.J.. 1983 Data, Models, and Statistical Analysis, (Lanham, MD: Rowman & Littlefield).
Pires, A. M. and A. Branco, J. (2010) Projection-pursuit approach to robust linear discriminant analysis Journal Multivariate Analysis, Academic Press, Inc., 101, 2464–2485.
data(pottery) x <- pottery[,c("MG", "CA")] grp <- pottery$origin ## ## Compute robust location and covariance matrix and ## plot the tolerance ellipses library(rrcov) (mcd <- CovMcd(x)) col <- c(3,4) gcol <- ifelse(grp == "Attic", col[1], col[2]) gpch <- ifelse(grp == "Attic", 16, 1) plot(mcd, which="tolEllipsePlot", class=TRUE, col=gcol, pch=gpch) ## ## Perform classical LDA and plot the data, 0.975 tolerance ellipses ## and LDA separation line ## x <- pottery[,c("MG", "CA")] grp <- pottery$origin lda <- LdaClassic(x, grp) lda e1 <- getEllipse(loc=lda@center[1,], cov=lda@cov) e2 <- getEllipse(loc=lda@center[2,], cov=lda@cov) plot(CA~MG, data=pottery, col=gcol, pch=gpch, xlim=c(min(MG,e1[,1], e2[,1]), max(MG,e1[,1], e2[,1])), ylim=c(min(CA,e1[,2], e2[,2]), max(CA,e1[,2], e2[,2]))) ab <- lda@ldf[1,] - lda@ldf[2,] cc <- lda@ldfconst[1] - lda@ldfconst[2] abline(a=-cc/ab[2], b=-ab[1]/ab[2], col=2, lwd=2) lines(e1, type="l", col=col[1]) lines(e2, type="l", col=col[2]) ## ## Perform robust (MCD) LDA and plot data, classical and ## robust separation line ## plot(CA~MG, data=pottery, col=gcol, pch=gpch) lda <- LdaClassic(x, grp) ab <- lda@ldf[1,] - lda@ldf[2,] cc <- lda@ldfconst[1] - lda@ldfconst[2] abline(a=-cc/ab[2], b=-ab[1]/ab[2], col=2, lwd=2) abline(a=-cc/ab[2], b=-ab[1]/ab[2], col=4, lwd=2) rlda <- Linda(x, grp, method="mcd") rlda ab <- rlda@ldf[1,] - rlda@ldf[2,] cc <- rlda@ldfconst[1] - rlda@ldfconst[2] abline(a=-cc/ab[2], b=-ab[1]/ab[2], col=2, lwd=2)
data(pottery) x <- pottery[,c("MG", "CA")] grp <- pottery$origin ## ## Compute robust location and covariance matrix and ## plot the tolerance ellipses library(rrcov) (mcd <- CovMcd(x)) col <- c(3,4) gcol <- ifelse(grp == "Attic", col[1], col[2]) gpch <- ifelse(grp == "Attic", 16, 1) plot(mcd, which="tolEllipsePlot", class=TRUE, col=gcol, pch=gpch) ## ## Perform classical LDA and plot the data, 0.975 tolerance ellipses ## and LDA separation line ## x <- pottery[,c("MG", "CA")] grp <- pottery$origin lda <- LdaClassic(x, grp) lda e1 <- getEllipse(loc=lda@center[1,], cov=lda@cov) e2 <- getEllipse(loc=lda@center[2,], cov=lda@cov) plot(CA~MG, data=pottery, col=gcol, pch=gpch, xlim=c(min(MG,e1[,1], e2[,1]), max(MG,e1[,1], e2[,1])), ylim=c(min(CA,e1[,2], e2[,2]), max(CA,e1[,2], e2[,2]))) ab <- lda@ldf[1,] - lda@ldf[2,] cc <- lda@ldfconst[1] - lda@ldfconst[2] abline(a=-cc/ab[2], b=-ab[1]/ab[2], col=2, lwd=2) lines(e1, type="l", col=col[1]) lines(e2, type="l", col=col[2]) ## ## Perform robust (MCD) LDA and plot data, classical and ## robust separation line ## plot(CA~MG, data=pottery, col=gcol, pch=gpch) lda <- LdaClassic(x, grp) ab <- lda@ldf[1,] - lda@ldf[2,] cc <- lda@ldfconst[1] - lda@ldfconst[2] abline(a=-cc/ab[2], b=-ab[1]/ab[2], col=2, lwd=2) abline(a=-cc/ab[2], b=-ab[1]/ab[2], col=4, lwd=2) rlda <- Linda(x, grp, method="mcd") rlda ab <- rlda@ldf[1,] - rlda@ldf[2,] cc <- rlda@ldfconst[1] - rlda@ldfconst[2] abline(a=-cc/ab[2], b=-ab[1]/ab[2], col=2, lwd=2)
The prediction of a "Lda" object
Objects can be created by calls of the form new("PredictLda", ...)
but most often by invoking 'predict' on a "Lda" object. They contain values
meant for printing by 'show'
classification
:a factor variable containing the classification of each object
posterior
:a matrix containing the posterior probabilities
x
:matrix with the discriminant scores
ct
:re-classification table of the training sample
signature(object = "PredictLda")
: Prints the results
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("PredictLda")
showClass("PredictLda")
The prediction of a "Qda" object
Objects can be created by calls of the form new("PredictQda", ...)
but most often by invoking 'predict' on a "Qda" object. They contain values
meant for printing by 'show'
classification
:a factor variable containing the classification of each object
posterior
:a matrix containing the posterior probabilities
x
:matrix with the discriminant scores
ct
:re-classification table of the training sample
signature(object = "PredictQda")
: prints the results
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("PredictQda")
showClass("PredictQda")
The class Qda
serves as a base class for deriving
all other classes representing the results of classical
and robust Quadratic Discriminant Analisys methods
A virtual Class: No objects may be created from it.
call
:the (matched) function call.
prior
:prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the group covariance matrices
covinv
:the inverse of the group covariance matrices
covdet
:the determinants of the group covariance matrices
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
control
:object of class "CovControl"
specifying which estimate
and with what estimation options to use for the group means and covariances
(or NULL
for classical discriminant analysis)
signature(object = "Qda")
: calculates prediction using the results in
object
. An optional data frame or matrix in which to look for variables with which
to predict. If omitted, the scores are used. If the original fit used a formula or
a data frame or a matrix with column names, newdata must contain columns with the
same names. Otherwise it must contain the same number of columns,
to be used in the same order.
signature(object = "Qda")
: prints the results
signature(object = "Qda")
: prints summary information
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
QdaClassic
, QdaClassic-class
, QdaRobust-class
showClass("Qda")
showClass("Qda")
Performs quadratic discriminant analysis and returns the results as an object
of class QdaClassic
(aka constructor).
QdaClassic(x, ...) ## Default S3 method: QdaClassic(x, grouping, prior = proportions, tol = 1.0e-4, ...)
QdaClassic(x, ...) ## Default S3 method: QdaClassic(x, grouping, prior = proportions, tol = 1.0e-4, ...)
x |
a matrix or data frame containing the explanatory variables (training set). |
grouping |
grouping variable: a factor specifying the class for each observation. |
prior |
prior probabilities, default to the class proportions for the training set. |
tol |
tolerance |
... |
arguments passed to or from other methods. |
Returns an S4 object of class QdaClassic
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
Contains the results of classical Quadratic Discriminant Analysis
Objects can be created by calls of the form new("QdaClassic", ...)
but the
usual way of creating QdaClassic
objects is a call to the function
QdaClassic
which serves as a constructor.
call
:The (matched) function call.
prior
:Prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the group covariance matrices
covinv
:the inverse of the group covariance matrices
covdet
:the determinants of the group covariance matrices
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
control
:Object of class "CovControl"
inherited from class Qda
specifying which estimate and with what estimation options to use for the group
means and covariances. It is always NULL
for classical discriminant analysis.
Class "Qda"
, directly.
No methods defined with class "QdaClassic" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
QdaRobust-class
, Qda-class
, QdaClassic
showClass("QdaClassic")
showClass("QdaClassic")
Performs robust quadratic discriminant analysis and returns
the results as an object of class QdaCov
(aka constructor).
QdaCov(x, ...) ## Default S3 method: QdaCov(x, grouping, prior = proportions, tol = 1.0e-4, method = CovControlMcd(), ...)
QdaCov(x, ...) ## Default S3 method: QdaCov(x, grouping, prior = proportions, tol = 1.0e-4, method = CovControlMcd(), ...)
x |
a matrix or data frame containing the explanatory variables (training set). |
grouping |
grouping variable: a factor specifying the class for each observation. |
prior |
prior probabilities, default to the class proportions for the training set. |
tol |
tolerance |
method |
method |
... |
arguments passed to or from other methods |
details
Returns an S4 object of class QdaCov
Still an experimental version!
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
## Example anorexia library(MASS) data(anorexia) ## start with the classical estimates qda <- QdaClassic(Treat~., data=anorexia) predict(qda)@classification ## try now the robust LDA with the default method (MCD with pooled whitin cov matrix) rqda <- QdaCov(Treat~., data= anorexia) predict(rqda)@classification ## try the other methods QdaCov(Treat~., data= anorexia, method="sde") QdaCov(Treat~., data= anorexia, method="M") QdaCov(Treat~., data= anorexia, method=CovControlOgk())
## Example anorexia library(MASS) data(anorexia) ## start with the classical estimates qda <- QdaClassic(Treat~., data=anorexia) predict(qda)@classification ## try now the robust LDA with the default method (MCD with pooled whitin cov matrix) rqda <- QdaCov(Treat~., data= anorexia) predict(rqda)@classification ## try the other methods QdaCov(Treat~., data= anorexia, method="sde") QdaCov(Treat~., data= anorexia, method="M") QdaCov(Treat~., data= anorexia, method=CovControlOgk())
Robust quadratic discriminant analysis is performed by replacing the classical group means and withing group covariance matrices by their robust equivalents.
Objects can be created by calls of the form new("QdaCov", ...)
but the
usual way of creating QdaCov
objects is a call to the function
QdaCov
which serves as a constructor.
call
:The (matched) function call.
prior
:Prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the group covariance matrices
covinv
:the inverse of the group covariance matrices
covdet
:the determinants of the group covariance matrices
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
control
:Object of class "CovControl"
specifying which estimate to
use for the group means and covariances
Class "QdaRobust"
, directly.
Class "Qda"
, by class "QdaRobust", distance 2.
No methods defined with class "QdaCov" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
QdaRobust-class
, Qda-class
, QdaClassic
, QdaClassic-class
showClass("QdaCov")
showClass("QdaCov")
The class QdaRobust
searves as a base class for deriving all other
classes representing the results of robust Quadratic Discriminant Analysis methods
A virtual Class: No objects may be created from it.
call
:The (matched) function call.
prior
:Prior probabilities used, default to group proportions
counts
:number of observations in each class
center
:the group means
cov
:the group covariance matrices
covinv
:the inverse of the group covariance matrices
covdet
:the determinants of the group covariance matrices
method
:a character string giving the estimation method used
X
:the training data set (same as the input parameter x of the constructor function)
grp
:grouping variable: a factor specifying the class for each observation.
control
:Object of class "CovControl"
specifying which estimate to
use for the group means and covariances
Class "Qda"
, directly.
No methods defined with class "QdaRobust" in the signature.
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("QdaRobust")
showClass("QdaRobust")
Each concrete control class, like CovControlMest
, CovControlOgk
,
etc., should implement an restimate
method which will call the correponding
(constructor)-function and will return the obtained S4 class, derived from
CovRobust
.
## S4 method for signature 'CovControlMest' restimate(obj, x, ...)
## S4 method for signature 'CovControlMest' restimate(obj, x, ...)
obj |
an object of class |
x |
Data frame or matrix containing the data |
.
... |
other parameters to be passed through to the estimation function. |
Compute the MCD estimates of multivariate location and
scatter by callingCovMcd
Compute the constrained M-estimates of multivariate location and
scatter by callingCovMest
Compute the Ortogonalized Gnanadesikan-Kettenring (OGK) estimates of multivariate location and
scatter by callingCovOgk
The rice taste data consists of five inputs and a single output whose values are associated with subjective evaluations as follows: xl: flavor, x2: appearance, x3: taste, x4: stickiness, x5: toughness, y: overall evaluation. Sensory test data have been obtained by such subjective evaluations for 105 kinds of rice (e.g., Sasanishiki, Akita-Komachi, etc.). The data set was used by Nozaki et al. (1997) to demonstrate the high performance of a proposed for automatically generating fuzzy if-then rules from numerical data.
data(rice)
data(rice)
A data frame with 105 observations on the following 6 variables:
Favor
compactness
Appearance
circularity
Taste
distance circularity
Stickiness
radius ratio
Toughness
principal axis aspect ratio
Overall_evaluation
maximum length aspect ratio
Nozaki, K., Ishibuchi, H, and Tanaka, H. (1997) A simple but powerful heuristic method for generating fuzzy rules from numerical data Fuzzy Sets and Systems 86 3 p. 251–270.
The salmon data contains two measurements of the growth rings on
the scale of Alaskan and Canadian salmon as well as the gender
of the fishes. There are 50 Alaskan-born and 50 Canadian-born salmon, and this information
is coded in the variable Origin
.
data(salmon)
data(salmon)
A data frame with 100 observations on the following 4 variables.
Gender
female=1 and male=2
Freshwater
diameter of rings for the first-year freshwater growth (hundrets of an inch)
Marine
diameter of rings for the first-year marine growth (hundrets of an inch)
Origin
Origin of the fish: a factor with levels Alaskan
Canadian
Johnson, R.A. and Wichern, D. W. Applied Multivariate Statistical Analysis (Prentice Hall, International Editions, 2002, fifth edition)
data(salmon)
data(salmon)
Produces a score plot from an object (derived from) Pca-class
.
## S4 method for signature 'Pca' scorePlot(x, i=1, j=2, ...)
## S4 method for signature 'Pca' scorePlot(x, i=1, j=2, ...)
x |
an object of class (derived from) |
i |
First score coordinate, defaults to |
j |
Second score coordinate, defaults to |
... |
optional arguments to be passed to the internal graphical functions. |
a plot is produced on the current graphics device.
signature(x = Pca)
: Plot a scatter plot of ith against jth score
of the Pca object with superimposed tollerance (0.975) ellipse. See also biplot
, screeplot
.
Pca-class
,
PcaClassic
,
PcaRobust-class
.
require(graphics) ## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaHubert(hbk) pca scorePlot(pca)
require(graphics) ## PCA of the Hawkins Bradu Kass's Artificial Data ## using all 4 variables data(hbk) pca <- PcaHubert(hbk) pca scorePlot(pca)
The forest soil data set contains measurements on 58 soil pits in the Hubbard Brook Experimental Forest in north-central New Hampshire. The excavations were done in 1983 and 1986. The soil samples were analyzed for the exchangeable cations of aluminium, calcium, magnesium, potassium and sodium. The pit locations in both data sets can be classified by the type of the forest:
1: spruce-fir (11 samples),
2: high elevation hardwood (23 samples) and
3: low elevation hardwood (24 samples)).
Additionally the degree of logging disturbance can be considered (all 0 in the 1983 data set):
0: uncut forest,
1: cut, undisturbed by machinery and
2: cut, disturbed.
The observations are expressed in grams of exchangeable cations per square meter.
data(soil)
data(soil)
A data frame with 116 observations on the following 7 variables.
F
Type of forest
D
Degree of logging disturbance
Al
Level of the exchangable cations in Al
Ca
Level of the exchangable cations in Ca
Mg
Level of the exchangable cations in Mg
K
Level of the exchangable cations in K
Na
Level of the exchangable cations in Na
Morrison D.F., 2005, Multivariate Statistical Methods, Thompson
Vanden Branden K, Hubert M (2005). Robust Classiffication in High Dimensions Based on the SIMCA Method. Cbemometrics and Intelligent Laboratoty Sysiem, 79: 10–21.
data(soil) soil1983 <- soil[soil$D == 0, -2] # only 1983, remove column D (always 0) (cc <- Linda(F~., data=soil)) (pr <- predict(cc)) pr@classification
data(soil) soil1983 <- soil[soil$D == 0, -2] # only 1983, remove column D (always 0) (cc <- Linda(F~., data=soil)) (pr <- predict(cc)) pr@classification
The "Cov" object plus some additional summary information
Objects can be created by calls of the form new("SummaryCov", ...)
,
but most often by invoking 'summary' on a "Cov" object. They contain values
meant for printing by 'show'.
covobj
:Object of class "Cov"
evals
:eigenvalues of the covariance or correlation matrix
signature(obj = "SummaryCov")
: location vector
signature(obj = "SummaryCov")
: covariance matrix
signature(obj = "SummaryCov")
: vector of distances
signature(obj = "SummaryCov")
: vector of eignevalues
signature(obj = "SummaryCov")
: is the estimate a classic one
signature(object = "SummaryCov")
: display the object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("SummaryCov")
showClass("SummaryCov")
Summary information for CovRobust objects meants for printing by 'show'
Objects can be created by calls of the form new("SummaryCovRobust", ...)
,
but most often by invoking 'summary' on an "Cov" object. They contain values
meant for printing by 'show'.
covobj
:Object of class "Cov"
evals
:Eigenvalues of the covariance or correlation matrix
Class "SummaryCov"
, directly.
signature(object = "SummaryCovRobust")
: ...
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
CovRobust-class
, SummaryCov-class
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovMest(hbk.x) cv summary(cv)
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cv <- CovMest(hbk.x) cv summary(cv)
Contains summary information about an Lda
object - Linear Discriminant Analysis object
Objects can be created by calls of the form new("SummaryLda", ...)
,
but most often by invoking 'summary' on an "Lda" object. They contain values
meant for printing by 'show'.
ldaobj
:Object of class "Lda"
signature(object = "SummaryLda")
: display the object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("SummaryLda")
showClass("SummaryLda")
The "Pca" object plus some additional summary information
Objects can be created by calls of the form new("SummaryPca", ...)
,
but most often by invoking 'summary' on a "Pca" object. They contain values
meant for printing by 'show'.
pcaobj
:Object of class "Pca"
importance
:matrix with additional information: importance of components
signature(object = "SummaryPca")
: display the object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("SummaryPca")
showClass("SummaryPca")
Summary information about a Qda
- Quadratic Discriminant Analysis object
Objects can be created by calls of the form new("SummaryQda", ...)
,
but most often by invoking 'summary' on an "Qda" object. They contain values
meant for printing by 'show'.
qdaobj
:Object of class "Qda"
signature(object = "SummaryQda")
: display the object
Valentin Todorov [email protected]
Todorov V & Filzmoser P (2009), An Object Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1–47. doi:10.18637/jss.v032.i03.
showClass("SummaryQda")
showClass("SummaryQda")
Performs one and two sample Hotelling T2 tests as well as robust one-sample Hotelling T2 test
T2.test(x, ...) ## Default S3 method: T2.test(x, y = NULL, mu = 0, conf.level = 0.95, method=c("c", "mcd"), ...) ## S3 method for class 'formula' T2.test(formula, data, subset, na.action, ...)
T2.test(x, ...) ## Default S3 method: T2.test(x, y = NULL, mu = 0, conf.level = 0.95, method=c("c", "mcd"), ...) ## S3 method for class 'formula' T2.test(formula, data, subset, na.action, ...)
x |
a (non-empty) numeric data frame or matrix. |
y |
an optional (non-empty) numeric data frame or matrix. |
mu |
an optional (non-empty) numeric vector of data values (or a single number which will be repeated p times) indicating the true value of the mean (or difference in means if you are performing a two sample test). |
conf.level |
confidence level of the interval |
method |
the method to be used - 'c' for sample mean and covariance matrix and 'mcd' for minimum covariance determinant estimator. A two-sample MCD based T2-test is not yet implemented. |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see
|
subset |
an optional vector specifying a subset of observations to be used (currently not used) |
na.action |
a function which indicates what should happen when
the data contain |
... |
further arguments to be passed to or from methods. |
The formula interface is only applicable for the two-sample tests.
A list with class "htest"
containing the following components:
statistic |
the value of the T2-statistic. |
parameter |
the degrees of freedom for the T2-statistic. |
p.value |
the p-value for the test. |
conf.int |
a confidence interval for the mean vector appropriate to the specified alternative hypothesis. |
estimate |
the estimated mean vector or vectors depending on whether it was a one-sample test or a two-sample test. |
null.value |
the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test. |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating what type of T2-test was performed. |
data.name |
a character string giving the name(s) of the data. |
Valentin Todorov [email protected]
Willems G., Pison G., Rousseeuw P. and Van Aelst S. (2002), A robust hotelling test, Metrika, 55, 125–138.
## One-sample classical test data(delivery) delivery.x <- delivery[,1:2] T2.test(delivery.x) ## One-sample robust test data(delivery) delivery.x <- delivery[,1:2] T2.test(delivery.x, method="mcd") ## Two-sample classical test data(hemophilia) grp <-as.factor(hemophilia[,3]) x <- hemophilia[which(grp==levels(grp)[1]),1:2] y <- hemophilia[which(grp==levels(grp)[2]),1:2] T2.test(x,y) ## or using the formula interface T2.test(as.matrix(hemophilia[,-3])~hemophilia[,3]) ## Not run: ## Two-sample robust test T2.test(x,y, method="mcd") ## error - not yet implemented ## End(Not run)
## One-sample classical test data(delivery) delivery.x <- delivery[,1:2] T2.test(delivery.x) ## One-sample robust test data(delivery) delivery.x <- delivery[,1:2] T2.test(delivery.x, method="mcd") ## Two-sample classical test data(hemophilia) grp <-as.factor(hemophilia[,3]) x <- hemophilia[which(grp==levels(grp)[1]),1:2] y <- hemophilia[which(grp==levels(grp)[2]),1:2] T2.test(x,y) ## or using the formula interface T2.test(as.matrix(hemophilia[,-3])~hemophilia[,3]) ## Not run: ## Two-sample robust test T2.test(x,y, method="mcd") ## error - not yet implemented ## End(Not run)
This data set consists of seven socioeconomic variables observed for 73 countries.
data(un86)
data(un86)
A data frame with 73 observations on the following 7 variables.
POP
Total population in milions
MOR
Number of infant deaths per thousand births
CAR
Number of motorized vehicles per hundred inhabitants
DR
Number of medical doctors per thousand inhabitants
GNP
Gross national product per inhabitant in thousands of US dollars
DEN
Density in inhabitants per square kilometer
TB
Trade balance, defined as total exports/(total exports + total imports)
The data set is from World Statistics in Brief, Number 10, a 1986 UN publication. It was used in Daigle et al. (1992) to illustrate a robust biplot method.
World Statistics in Brief, Number 10, a 1986 United Nations publication
Daigle, G. and Rivest, L. (1992) A robust biplot, The canadian Journal of Statistics, 20, pp 241–255
data(un86) pairs(un86)
data(un86) pairs(un86)
The data are from a national sample of 6000 households with a male head earning less than USD 15,000 annually in 1966. The data were clasified into 39 demographic groups for analysis. The study was undertaken in the context of proposals for a guaranteed annual wage (negative income tax). At issue was the response of labor supply (average hours) to increasing hourly wages. The study was undertaken to estimate this response from available data.
data(wages)
data(wages)
A data frame with 39 observations on the following 10 variables:
HRS
Average hours worked during the year
RATE
Average hourly wage (USD)
ERSP
Average yearly earnings of spouse (USD)
ERNO
Average yearly earnings of other family members (USD)
NEIN
Average yearly non-earned income
ASSET
Average family asset holdings (Bank account, etc.) (USD)
AGE
Average age of respondent
DEP
Average number of dependents
RACE
Percent of white respondents
SCHOOL
Average highest grade of school completed
DASL library 'http://lib.stat.cmu.edu/DASL/Datafiles/wagesdat.html'
D.H. Greenberg and M. Kosters, (1970). Income Guarantees and the Working Poor, The Rand Corporation.
data(wages) names(wages) x <- as.matrix(wages) ok <- is.finite(x %*% rep(1, ncol(x))) wages <- wages[ok, , drop = FALSE] wages.lm <- lm(HRS~AGE, data=wages) plot(HRS ~ AGE, data = wages) abline(wages.lm) class(wages.lm) names(wages.lm) summary(wages.lm) wages.mm <- lmrob(HRS~AGE, data=wages) plot(HRS ~ AGE, data = wages) abline(wages.mm) class(wages.mm) names(wages.mm) summary(wages.mm)
data(wages) names(wages) x <- as.matrix(wages) ok <- is.finite(x %*% rep(1, ncol(x))) wages <- wages[ok, , drop = FALSE] wages.lm <- lm(HRS~AGE, data=wages) plot(HRS ~ AGE, data = wages) abline(wages.lm) class(wages.lm) names(wages.lm) summary(wages.lm) wages.mm <- lmrob(HRS~AGE, data=wages) plot(HRS ~ AGE, data = wages) abline(wages.mm) class(wages.mm) names(wages.mm) summary(wages.mm)
Classical and Robust One-way MANOVA: Wilks Lambda
## S3 method for class 'formula' Wilks.test(formula, data, ..., subset, na.action) ## Default S3 method: Wilks.test(x, grouping, method=c("c", "mcd", "rank"), approximation=c("Bartlett", "Rao", "empirical"), xd=NULL, xq=NULL, xfn = NULL, xwl=NULL, nrep=3000, trace=FALSE, ...) ## S3 method for class 'data.frame' Wilks.test(x, ...) ## S3 method for class 'matrix' Wilks.test(x, grouping, ..., subset, na.action)
## S3 method for class 'formula' Wilks.test(formula, data, ..., subset, na.action) ## Default S3 method: Wilks.test(x, grouping, method=c("c", "mcd", "rank"), approximation=c("Bartlett", "Rao", "empirical"), xd=NULL, xq=NULL, xfn = NULL, xwl=NULL, nrep=3000, trace=FALSE, ...) ## S3 method for class 'data.frame' Wilks.test(x, ...) ## S3 method for class 'matrix' Wilks.test(x, grouping, ..., subset, na.action)
formula |
A formula of the form |
data |
Data frame from which variables specified in |
x |
(required if no formula is given as the principal argument.) a matrix or data frame or Matrix containing the explanatory variables. |
grouping |
grouping variable - a factor specifying the class for each observation (required if no formula argument is given.) |
subset |
An index vector specifying the cases to be used. |
na.action |
A function to specify the action to be taken if |
method |
|
approximation |
|
xd |
multiplication factor for the approximate distribution of
the robust Lambda statistic. If |
xq |
the degrees of freedom for the approximate |
xfn |
the empirical distribution function. If |
xwl |
the simulated values of the robust statistic. If |
nrep |
number of trials for the simulations for computing the
multiplication factor |
trace |
whether to print intermediate results. Default is |
... |
arguments passed to or from other methods. |
The classical Wilks' Lambda statistic for testing the equality of
the group means of two or more groups is modified into a robust
one through substituting the classical estimates by the highly robust
and efficient reweighted MCD estimates, which can be computed efficiently
by the FAST-MCD algorithm - see CovMcd
.
An approximation for the finite sample distribution of the
Lambda statistic is obtained, based on matching the mean and
variance of a multiple of an distribution which
are computed by simultaion.
A list with class "htest"
containing the following components:
statistic |
the value of the Wilks' Lambda statistic. |
parameter |
The corresponding approximation of the Wilks' lambda statistic and the degrees of freedom. |
p.value |
the p-value for the test. |
estimate |
the estimated mean vectors. |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name of the data. |
xd |
multiplication factor for the approximate distribution of the robust Lambda statistic. |
xq |
the degrees of freedom for the approximate |
This function may be called giving either a formula and optional data frame, or a matrix and grouping factor as the first two arguments. All other arguments are optional.
Valentin Todorov [email protected]
Todorov, V. and Filzmoser, P. (2007) Robust statistic for the one-way MANOVA, submetted to the Journal of Environmetrics.
Todorov, V. (2007) Robust selection of variables in linear discriminant analysis, Statistical Methods and Applications, 15, 395.407, doi:10.1007/s10260-006-0032-6.
Nath, R. and Pavur, R. (1985) A new statistic in the one way multivariate analysis of variance, Computatational Statistics and Data Analysis, 2, 297–315
library(MASS) data(anorexia) grp <- as.factor(anorexia[,1]) x <- as.matrix(anorexia[,2:3]) ## Using the default interface, classical test Wilks.test(x, grouping=grp, method="c") ## Using the default interface, rank based test Wilks.test(x, grouping=grp, method="rank") ## For this data set: p=2, n=n1+n2+n3=29+26+17 ## were computed the following multiplication factor xd and degrees of freedom xq ## for the MCD estimates with alpha=0.5 xd <- -0.02162666 xq <- 3.63971 Wilks.test(x, grouping=grp, method="mcd", xd=xd, xq=xq) ## Now the same with the formula interface Wilks.test(Treat~Prewt+Postwt, data=anorexia, method="mcd", xd=xd, xq=xq) ##Iris data with formula interface data(iris) Wilks.test(Species~., data=iris, method="c") ## and with default interface Wilks.test(iris[,1:4],grouping=iris[,5], method="c") # hemophilia data - classical, rank and MCD test data(hemophilia) hemophilia$gr <- as.factor(hemophilia$gr) Wilks.test(gr~., data=hemophilia, method="c") Wilks.test(gr~., data=hemophilia, method="rank") ## already simulated parameters for MCD with alpha=0.5 xd <- -0.01805436 xq <- 1.950301 Wilks.test(gr~., data=hemophilia, xd=xd, xq=xq, method="mcd")
library(MASS) data(anorexia) grp <- as.factor(anorexia[,1]) x <- as.matrix(anorexia[,2:3]) ## Using the default interface, classical test Wilks.test(x, grouping=grp, method="c") ## Using the default interface, rank based test Wilks.test(x, grouping=grp, method="rank") ## For this data set: p=2, n=n1+n2+n3=29+26+17 ## were computed the following multiplication factor xd and degrees of freedom xq ## for the MCD estimates with alpha=0.5 xd <- -0.02162666 xq <- 3.63971 Wilks.test(x, grouping=grp, method="mcd", xd=xd, xq=xq) ## Now the same with the formula interface Wilks.test(Treat~Prewt+Postwt, data=anorexia, method="mcd", xd=xd, xq=xq) ##Iris data with formula interface data(iris) Wilks.test(Species~., data=iris, method="c") ## and with default interface Wilks.test(iris[,1:4],grouping=iris[,5], method="c") # hemophilia data - classical, rank and MCD test data(hemophilia) hemophilia$gr <- as.factor(hemophilia$gr) Wilks.test(gr~., data=hemophilia, method="c") Wilks.test(gr~., data=hemophilia, method="rank") ## already simulated parameters for MCD with alpha=0.5 xd <- -0.01805436 xq <- 1.950301 Wilks.test(gr~., data=hemophilia, xd=xd, xq=xq, method="mcd")
A data set containing skull morphometric measurements on Rocky Mountain and Arctic wolves (Canis Lupus L.). The tdata are published in Morrison (1990), originally from Jolicoeur (1959).
data(wolves)
data(wolves)
A data frame with 25 rows and 12 variables. The variables are as follows (all measurements are in milimeters):
class
: a factor presenting the combinations of location
and sex
. The levels are arf
arm
rmf
and rmm
location
: a factor with levels ar
=Arctic, rm
=Rocky Mountain
sex
: a factor with levels f
=female, m
=male
x1
: palatal length
x2
: postpalatal length
x3
: zygomatic width
x4
: palatal width outside first upper molars
x5
: palatal width inside second upper molars
x6
: postglenoid foramina width
x7
: interorbital width
x8
: braincase width
x9
: crown length
Jolicoeur, P. Multivariate geographical variation in the wolf Canis lupis L., Evolution, XIII, 283–299.
Morrison, D. F. Multivariate Statistical Methods, (3rd ed.), 1990. New York: McGraw-Hill, p. 288–289.
data(wolves) ## Remove the factors location and sex which we will not use for now x <- wolves[,-c(2:3)] ## Plot a pairwise scaterplot matrix pairs(x[,2:10]) mcd <- CovMcd(x[, 2:10]) plot(mcd, which="pairs") lda <- LdaClassic(class~., data=x) lda@center lda@cov predict(lda)
data(wolves) ## Remove the factors location and sex which we will not use for now x <- wolves[,-c(2:3)] ## Plot a pairwise scaterplot matrix pairs(x[,2:10]) mcd <- CovMcd(x[, 2:10]) plot(mcd, which="pairs") lda <- LdaClassic(class~., data=x) lda@center lda@cov predict(lda)