# Bestglm in r

## All subset regression with leaps, bestglm, glmulti, and meifly

### Summary

• For linear regression, use leaps, which allows use of adjusted \( R^2 \) and Mallow Cp.
• For logistic regression, use glmulti.
• For Cox regression, use glmulti.

http://www.umass.edu/statdata/statdata/data/lowbwt.txt

SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. Data were collected at Baystate Medical Center, Springfield, Massachusetts during 1986.

### leaps (regression subset selection)

Regression subset selection including exhaustive search. This is only for linear regression.

Reference: http://www.statmethods.net/stats/regression.html

Perform all subset regression, and choose “nbest” model(s) for each number of predictors up to nvmax.

The result shows how it was performed.

Best model at each variable number

The best model in the 10-variable case includes all variables, as that is the only way to have 10 variables.

Graphical table of best subsets (plot.regsubsets)

By adjusted \( R^2 \), the best model includes lwt, race.cat, preterm, ht, and ui (variables that have black boxes at the higest Y-axis value).

Plot Output from regsubsets Function in leaps package

This is just another way of presenting the same information for adjusted \( R^2 \). The model with 7 variables (counting dummy variables seprately) has the highest adjusted \( R^2 \).

Mallow Cp is used to decide on the number of predictors to include. The stopping rule is to start with the smallest model and gradually increase number of variables, and stop when Mallow Cp is approximately (number of regressors + 1, broken line) for the first time. In this case, the model with 6 regressors is the first one to achieve such a condition.

See which model has the highest adjusted R2

The model with 7 variables (counting dummy variables separately) has the highest adjusted \( R^2 \). Variables marked with TRUE are the ones chosen.

Do regression with the best model

Somehow I had to recreate the best model from the output above.

### bestglm (Best subset GLM)

Best subset glm using AIC, BIC, EBIC, BICq or Cross-Validation. For the normal case, the 'leaps' is used. Otherwise, a slower exhaustive search. The 'xtable' package is needed for vignette 'SimExperimentBICq.Rnw' accompanying this package.

References:

Reformat data

The outcome variable must be named y, no extraneous variables should be present in the dataset.

Perform all-subset linear (gaussian) regression based on Akaike Information Criteria (AIC)

Show result for the best model

The model identical to the one chosen by the adjusted \( R^2 \) was selected.

Logistic regression

Do the same, but as a logistic regression model. The resulting model was identical to the best linear model in this case.

### glmulti (Model selection and multimodel inference made easy)

Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.

References

All-subset linear regression using lm() based on AIC

All-subset logistic regression using glm() based on AIC

Load pbc survival data in survival package

All-subset Cox regression using coxph() based on AIC

### meifly (Interactive model exploration using GGobi)

Exploratory model analysis. Fit and graphical explore ensembles of linear models.

This function just conduct all-subset regression, thus it can handle coxph without problems, but users will have to do model comparison using the result object. Interaction terms cannot be handled, thus inclusion of interaction terms needs creation of product term beforehand.

References:

Fit all subsets (main effects only)

For x, give a data.frame without the outcome variable.

Show the result

As expected, it is the same as the model chosen here ( http://rpubs.com/kaz_yos/exhaustive ).

### Exhaustive model selection without using packages

See: http://rpubs.com/kaz_yos/exhaustive

Sours: https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

Best subset glm using information criteria or cross-validation, carried by using 'leaps' algorithm (Furnival and Wilson, 1974) <doi:10.2307/1267601> or complete enumeration (Morgan and Tatar, 1972) <doi:10.1080/00401706.1972.10488918>. Implements PCR and PLS using AIC/BIC. Implements one-standard deviation rule for use with the 'caret' package.

AuthorA.I. McLeod, Changjiang Xu and Yuanhao Lai
MaintainerYuanhao Lai <[email protected]>
Version0.37.3
Package repositoryView on CRAN
Installation Install the latest version of this package by entering the following in R:

Any scripts or data that you put into this service are public.

bestglm documentation built on March 26, 2020, 7:25 p.m.
Sours: https://rdrr.io/cran/bestglm/
Sours: https://rdrr.io/cran/bestglm/man/bestglm-package.html

### Exercises

• Use the function from the package to obtain the 50 best models for each model size and plot their RSS against model dimension.

• Use the and and options to perform forward, backward and exhaustive model selection and compare the results (only find a single best model at each dimension). Hint: The function returns several information criteria, choose for example Mallow’s Cp.

• Use the plot method associated with objects to visualise the BIC for the various models identified.

• The package offers alternative plot methods to visualise the results from a call to . The code below plots \(C_p\) against dimension for a object called . Try plotting adjusted \(R^2\) against dimension.

Sours: https://garthtarr.github.io/avfs/lab02_soln.html

## bestglm: Best Subset GLM using Information Criterion or Cross-Validation

### Description

Best subset selection using 'leaps' algorithm (Furnival and Wilson, 1974) or complete enumeration (Morgan and Tatar, 1972). Complete enumeration is used for the non-Gaussian and for the case where the input matrix contains factor variables with more than 2 levels. The best fit may be found using the information criterion IC: AIC, BIC, EBIC, or BICq. Alternatively, with IC=`CV' various types of cross-validation may be used.

### Usage

bestglm(Xy, family = gaussian, IC = "BIC", t = "default", CVArgs = "default", qLevel = 0.99, TopModels = 5, method = "exhaustive", intercept = TRUE, weights = NULL, nvmax = "default", RequireFullEnumerationQ = FALSE, ...)

### Arguments

Xy

Dataframe containing the design matrix X and the output variable y. All columns must be named.

family

One of the glm distribution functions. The glm function is not used in the Gaussian case. Instead for efficiency either 'leaps' is used or when factor variables are present with more than 2 levels, 'lm' may be used.

IC

Information criteria to use: "AIC", "BIC", "BICg", "BICq", "LOOCV", "CV".

t

adjustable parameter for BICg, BICq or CV. For BICg, default is g=t=1. For BICq, default is q=t=0.25. For CV, default the delete-d method with d=ceil(n(1-1/(log n - 1))) and REP=t=1000. The default value of the parameter may be changed by changing t.

CVArgs

Used when IC is set to `CV`. The default is use the delete-d algorithm with d=ceil(n(1-1/(log n - 1))) and t=100 repetitions. Note that the number of repetitions can be changed using t. More generally, CVArgs is a list with 3 named components: Method, K, REP, where Method is one of \"HTF\", \"DH\", \"d\" corresponding to using the functions CVHTM (Hastie et al., 2009, K-fold CV), CVDH (adjusted K-fold CV, Davison and Hartigan, 1997) and CVd (delete-d CV with random subsamples, Shao, 1997).

qLevel

the alpha level for determining interval for best q. Larger alpha's result in larger intervals.

TopModels

Finds the best models.

method

Method used in leaps algorithm for searching for the best subset.

intercept

Default TRUE means the intercept term is always included. If set to FALSE, no intercept term is included. If you want only include the intercept term when it is signficant then set IncludeInterceptQ=FALSE and include a column of 1's in the design matrix.

nvmax

maximum number of independent variables allowed. By default, all variables

RequireFullEnumerationQ

Use exhaustive search algorithm instead of 'leaps'

Optional arguments which are passed to or

### Value

A list with class attribute 'bestglm' and named components:

BestModel

An lm-object representing the best fitted regression.

Title

A brief title describing the algorithm used: CV(K=K), CVadj(K=K), CVd(d=K). The range of q for an equivalent BICq model is given.

Subsets

The best subsets of size, k=0,1,...,p are indicated as well the value of the log-likelihood and information criterion for each best subset. In the case of categorical variables with more than 2 levels, the degrees of freedom are also shown.

qTable

Table showing range of q for choosing each possible subset size. Assuming intercept=TRUE, k=1 corresponds to model with only an intercept term and k=p+1, where p is the number of input variables, corresponds to including all variables.

Bestq

Optimal q

ModelReport

A list with components: NullModel, LEAPSQ, glmQ, gaussianQ, NumDF, CategoricalQ, Bestk.

BestModels

Variables in the best list

Methods function 'print.bestglm' and 'summary.bestglm' are provided.

### Details

In the Gaussian case, the loglikelihood may be written \(logL = -(n/2) log (RSS/n)\), where RSS is the residual sum-of-squares and n is the number of observations. When the function 'glm' is used, the log-likelihood, logL, is obtained using 'logLik'. The penalty for EBIC and BICq depends on the tuning parameter argument, . The argument also controls the number of replications used when the delete-d CV is used as default. In this case, the parameter d is chosen using the formula recommended by Shao (1997). See for more details.

In the binomial GLM, nonlogistic, case the last two columns of Xy are the counts of 'success' and 'failures'.

Cross-validation may also be used to select the best subset. When cross-validation is used, the best models of size k according to the log-likelihood are compared for k=0,1,...,p, where p is the number of inputs. Cross-validation is not available when there are categorical variables since in this case it is likely that the training sample may not contain all levels and in this case we can't predict the response in the validation sample. In the case of GLM, the \"DH\" method for CV is not available.

Usually it is a good idea to keep the intercept term even if it is not significant. See discussion in vignette.

Cross-validation is not available for models with no intercept term or when is non-null or when is set to less than the full number of independent variables.

Please see the package vignette for more details and examples.

### References

Xu, C. and McLeod, A.I. (2009). Bayesian Information Criterion with Bernouilli Prior.

Chen, J. and Chen, Z. (2008). Extended Bayesian Information Criteria for Model Selection with Large Model Space. Biometrika 2008 95: 759-771.

Furnival, G.M. and Wilson, R. W. (1974). Regressions by Leaps and Bounds. Technometrics, 16, 499--511.

Morgan, J. A. and Tatar, J. F. (1972). Calculation of the Residual Sum of Squares for All Possible Regressions. Technometrics 14, 317-325.

Miller, A. J. (2002), Subset Selection in Regression, 2nd Ed. London, Chapman and Hall.

Shao, Jun (1997). An Asymptotic Theory for Linear Model Selection. Statistica Sinica 7, 221-264.

, , , ,

### Examples

# NOT RUN { #Example 1. #White noise test. set.seed(123321123) p<-25 #number of inputs n<-100 #number of observations X<-matrix(rnorm(n*p), ncol=p) y<-rnorm(n) Xy<-as.data.frame(cbind(X,y)) names(Xy)<-c(paste("X",1:p,sep=""),"y") bestAIC <- bestglm(Xy, IC="AIC") bestBIC <- bestglm(Xy, IC="BIC") bestEBIC <- bestglm(Xy, IC="BICg") bestBICq <- bestglm(Xy, IC="BICq") NAIC <- length(coef(bestAIC\$BestModel))-1 NBIC <- length(coef(bestBIC\$BestModel))-1 NEBIC <- length(coef(bestEBIC\$BestModel))-1 NBICq <- length(coef(bestBICq\$BestModel))-1 ans<-c(NAIC, NBIC, NEBIC, NBICq) names(ans)<-c("AIC", "BIC", "BICg", "BICq") ans # AIC BIC EBIC BICq # 3 1 0 0 #Example 2. bestglm with BICq #Find best model. Default is BICq with q=0.25 data(znuclear) #standardized data. #Rest of examples assume this dataset is loaded. out<-bestglm(znuclear, IC="BICq") out #The optimal range for q out\$Bestq #The possible models that can be chosen out\$qTable #The best models for each subset size out\$Subsets #The overall best models out\$BestModels # #Example 3. Normal probability plot, residuals, best model ans<-bestglm(znuclear, IC="BICq") e<-resid(ans\$BestModel) qqnorm(e, ylab="residuals, best model") # #To save time, none of the remaining examples are run # } # NOT RUN { #Example 4. bestglm, using EBIC, g=1 bestglm(znuclear, IC="BICg") #EBIC with g=0.5 bestglm(znuclear, IC="BICg", t=0.5) # #Example 5. bestglm, CV data(zprostate) train<-(zprostate[zprostate[,10],])[,-10] #the default CV method takes too long, set t=10 to do only # 10 replications instead of the recommended 1000 bestglm(train, IC="CV", t=10) bestglm(train, IC="CV", CVArgs=list(Method="HTF", K=10, REP=1)) #Compare with DH Algorithm. Normally set REP=100 is recommended. bestglm(train, IC="CV", CVArgs=list(Method="DH", K=10, REP=1)) #Compare LOOCV bestglm(train, IC="LOOCV") # #Example 6. Optimal q for manpower dataset data(manpower) out<-bestglm(manpower) out\$Bestq # #Example 7. Factors with more than 2 levels data(AirQuality) bestglm(AirQuality) # #Example 8. Logistic regression data(SAheart) bestglm(SAheart, IC="BIC", family=binomial) #BIC agrees with backward stepwise approach out<-glm(chd~., data=SAheart, family=binomial) step(out, k=log(nrow(SAheart))) #but BICq with q=0.25 bestglm(SAheart, IC="BICq", t=0.25, family=binomial) # #Cross-validation with glm #make reproducible results set.seed(33997711) #takes about 15 seconds and selects 5 variables bestglm(SAheart, IC="CV", family=binomial) #about 6 seconds and selects 2 variables bestglm(SAheart, IC="CV", CVArgs=list(Method="HTF", K=10, REP=1), family=binomial) #Will produce an error -- NA \dontrun{bestglm(SAheart, IC="CV", CVArgs=list(Method="DH", K=10, REP=1), family=binomial)} \dontrun{bestglm(SAheart, IC="LOOCV", family=binomial)} # #Example 9. Model with no intercept term X<-matrix(rnorm(200*3), ncol=3) b<-c(0, 1.5, 0) y<-X%*%b + rnorm(40) Xy<-data.frame(as.matrix.data.frame(X), y=y) bestglm(Xy, intercept=FALSE) # } # NOT RUN { # }
Sours: https://www.rdocumentation.org/packages/bestglm/versions/0.37.3/topics/bestglm
GLM in R: Poisson regression - crime data - fuller version

### Abstract

While purposeful selection is performed partly by software and partly by hand, the stepwise and best subset approaches are automatically performed by software. Two R functions stepAIC() and bestglm() are well designed for stepwise and best subset regression, respectively. The stepAIC() function begins with a full or null model, and methods for stepwise regression can be specified in the direction argument with character values “forward”, “backward” and “both”. The bestglm() function begins with a data frame containing explanatory variables and response variables. The response variable should be in the last column. Varieties of goodness-of-fit criteria can be specified in the IC argument. The Bayesian information criterion (BIC) usually results in more parsimonious model than the Akaike information criterion.

Keywords: Logistic regression, interaction, R, best subset, stepwise, Bayesian information criterion

### Introduction

The previous article introduces purposeful selection for regression model, which allows incorporation of clinical experience and/or subject matter knowledge into statistical science. There are several other methods for variable selection, namely, the stepwise and best subsets regression. In stepwise regression, the selection procedure is automatically performed by statistical packages. The criteria for variable selection include adjusted R-square, Akaike information criterion (AIC), Bayesian information criterion (BIC), Mallows’s Cp, PRESS, or false discovery rate (1,2). Main approaches of stepwise selection are the forward selection, backward elimination and a combination of the two (3). The procedure has advantages if there are numerous potential explanatory variables, but it is also criticized for being a paradigmatic example of data dredging that significant variables may be obtained from “noise” variables (4,5). Clinical experience and expertise are not allowed in model building process.

While stepwise regression select variables sequentially, the best subsets approach aims to find out the best fit model from all possible subset models (2). If there are p covariates, the number of all subsets is 2p. There are also varieties of statistical methods to compare the fit of subset models. In this article, I will introduce how to perform stepwise and best subset selection by using R.

### Working example

The working example used in the tutorial is from the package MASS. You can take a look at what each variable represents for.

The bwt data frame contains 9 columns and 189 rows. The variable low is an indicator variable with “0” indicates birth weight >2.5 kg and “1” indicates the presence of low birth weight. Age is mother’s age in years. The variable lwt is mothers’ weight in pounds. Race is mother’s race and smoke is smoking status during pregnancy. The number of previous premature labor is plt. Other information includes history of hypertension (bt), presence of uterine irritability (ui), and the number of physician visits during the first trimester (ftv).

### Stepwise selection

We can begin with the full model. Full model can be denoted by using symbol “.” on the right hand side of formula.

As you can see in the output, all variables except low are included in the logistic regression model. Variables lwt, race, ptd and ht are found to be statistically significant at conventional level. With the full model at hand, we can begin our stepwise selection procedure.

All arguments in the stepAIC() function are set to default. If you want to set direction of stepwise regression (e.g., backward, forward, both), the direction argument should be assigned. The default is both.

Because the forward stepwise regression begins with full model, there are no additional variables that can be added. The final model is the full model. Forward selection can begin with the null model (incept only model).

The backward elimination procedure eliminated variables ftv and age, which is exactly the same as the “both” procedure.

Different criteria can be assigned to the stepAIC() function for stepwise selection. The default is AIC, which is performed by assigning the argument k to 2 (the default option).

The stepAIC() function also allows specification of the range of variables to be included in the model by using the scope argument. The lower model is the model with smallest number of variables and the upper model is the largest possible model. Both upper and lower components of scope can be explicitly specified. If scope is a single formula, it specifies the upper component, and the lower model is empty. If scope is missing, the initial model is the upper model.

When I specify the smallest model to include age variable, it will not be excluded by stepwise regression (e.g., otherwise, the age will be excluded as shown above). This function can help investigators to keep variables that are considered to be relevant by subject-matter knowledge. Next, we can have more complicated model for stepwise selection.

Recall that “^” symbol denotes interactions up to a specified degree. In our case, we specified two-degree interactions among all possible combinations of variables. Elements within I() are interpreted arithmetically. The function scale(age) centers variable age at its mean and scales it by standard deviation. “~ .^2 + I(scale(age)^2)+ I(scale(lwt)^2)” is the scope argument and a single formula implies the upper component. The results show that the interaction between age and ftv, smoke and ui are remained in the final model. Other interactions and quadratic terms are removed.

### Best subset regression

Best subset regression selects the best model from all possible subsets according to some goodness-of-fit criteria. This approach has been in use for linear regression for several decades with the branch and bound algorithm (6). Later on, lawless and Singhal proposed an extension that can be used for non-normal error model (7). The application of best subsets for logistic regression model was described by Hosmer and coworkers (8). An R package called “bestglm” contains functions for performing best subsets selection. The bestglm() employs simple exhaustive searching algorithm as described by Morgan (9).

Xy is a data frame containing independent variables and response variable. For logistic regression model when family is set to be binomial, the last column is the response variable. The sequence of Xy is important because a formula to specify response and independent variables are not allowed with bestglm() function. We can move the response variable low to the last column and assign a new name to the new data frame. The IC argument specifies the information criteria to use. Its values can be “AIC”, “BIC”, “BICg”, “BICq”, “LOOCV” and “CV” (10).

Furthermore, factors with more than two levels should be converted to dummy variables. Otherwise, it returns an error message.

To create dummy variables for factors with more than two levels, we use the dummies package. The dummy() function passes a single variable and returns a matrix with the number of rows equal to that of given variable, and the number of columns equal to the number of levels of that variable. Because only n-1 dummy variables are needed to define a factor with n levels, I remove the base level by simple manipulation of vectors. Finally, a new data frame containing dummy variables is created, with the response variable in the last column.

The model selection by AIC always keeps more variables in the model as follows.

Readers can try other options available in bestglm() function. Different options may result in different models.

### Summary

The article introduces variable selection with stepwise and best subset approaches. Two R functions stepAIC() and bestglm() are well designed for these purposes. The stepAIC() function begins with a full or null model, and methods for stepwise regression can be specified in the direction argument with character values “forward”, “backward” and “both”. The bestglm() function begins with a data frame containing explanatory variables and response variables. The response variable should be in the last column. Factor variables with more than two levels should be converted before running bestglm(). The dummies package contains good function to convert factor variable to dummy variables. There are varieties of information criteria for selection of the best subset model.

### Biography

•

Author’s introduction: Zhongheng Zhang, MMed. Department of Critical Care Medicine, Jinhua Municipal Central Hospital, Jinhua Hospital of Zhejiang University. Dr. Zhongheng Zhang is a fellow physician of the Jinhua Municipal Central Hospital. He graduated from School of Medicine, Zhejiang University in 2009, receiving Master Degree. He has published more than 35 academic papers (science citation indexed) that have been cited for over 200 times. He has been appointed as reviewer for 10 journals, including Journal of Cardiovascular Medicine, Hemodialysis International, Journal of Translational Medicine, Critical Care, International Journal of Clinical Practice, Journal of Critical Care. His major research interests include hemodynamic monitoring in sepsis and septic shock, delirium, and outcome study for critically ill patients. He is experienced in data management and statistical analysis by using R and STATA, big data exploration, systematic review and meta-analysis.

### Footnotes

Conflicts of Interest: The author has no conflicts of interest to declare.

### References

1. Hocking RR. A biometrics invited paper. The analysis and selection of variables in linear regression.Biometrics 1976;32:1-49. 10.2307/2529336 [CrossRef] [Google Scholar]

2. Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied Logistic Regression. Hoboken: John Wiley & Sons, Inc, 2013. [Google Scholar]

3. Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag New York, 2013. [Google Scholar]

4. Freedman DA. A note on screening regression equations.The American Statistician 1983;37:152-5. [Google Scholar]

5. Flack VF, Chang PC. Frequency of selecting noise variables in subset regression analysis: a simulation study.The American Statistician 1987;41:84-6. [Google Scholar]

6. Furnival GM, Wilson RW. Regressions by leaps and bounds.Technometrics 1974;16:499-511. 10.1080/00401706.1974.10489231 [CrossRef] [Google Scholar]

7. Lawless JF, Singhal K. ISMOD: an all-subsets regression program for generalized linear models. II. Program guide and examples.Comput Methods Programs Biomed 1987;24:125-34. 10.1016/0169-2607(87)90023-X [PubMed] [CrossRef] [Google Scholar]

8. Hosmer DW, Jovanovic B, Lemeshow S. Best subsets logistic regression.Biometrics 1989;45:1265-70. 10.2307/2531779 [CrossRef] [Google Scholar]

9. Morgan JA, Tatar JF. Calculation of the residual sum of squares for all possible regressions.Technometrics 1972;14:317-25. 10.1080/00401706.1972.10488918 [CrossRef] [Google Scholar]

10. McLeod AI, Changjiang X. bestglm: best subset GLM.–R package ver. 0.34. 2014. Available online: https://cran.r-project.org/web/packages/bestglm/bestglm.pdf

Articles from Annals of Translational Medicine are provided here courtesy of AME Publications

Sours: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842399/

### You will also like:

I tried. Here, let's say, I'm standing with cancer, and Yurka comes up from behind and. Imagining my son's penis penetrating me, I realized with horror that I didn't actually feel horror. But my crotch ached sweetly.

20446 20447 20448 20449 20450