7 Model Selection (introduction)

8 Model selection

This section outlines general guidelines for model selection and refinement, based on the results from the previous examples (sections XX and Beyond Gaussian 1). These steps can be applied to any model fitting process, whether using glmmTMB, brms, or other packages.

Identify data type and plot raw data to understand the distribution and structure of the response variable. This helps in selecting an appropriate model family (e.g., Gaussian, Poisson, Negative Binomial, Beta).
Begin with a simpler model (e.g., a location-only model assuming homoscedasticity) to establish a baseline for interpretation.
Check residual diagnostics to detect possible model misspecification, such as overdispersion or non-uniformity of residuals. If issues are detected, consider more complex models (e.g., location-scale models) that account for heterogeneity in variance.
Gradually increase model complexity: Introduce additional components, such as group-level effects on the scale (variance), and consider estimating correlations between random effects when theoretically or empirically justified—especially when supported by sufficient sample size.
Compare models using information criteria:

For frequentist models: use AIC (Akaike Information Criterion)
For Bayesian models: use LOO (Leave-One-Out Cross-Validation) or WAIC (Widely Applicable Information Criterion).

Ultimately, the goal is not only to improve statistical fit but also to gain biological insights. For example, finding that a scale predictor improves model fit may suggest biologically meaningful heterogeneity.

We demonstrate these steps using the previous examples, where we started with a location-only model and then moved to a location-scale model to account for heterogeneity in variance. We also compared models using AICc and LOO criteria to select the best-fitting model. Although parts of the examples below may repeat content from earlier sections, we present them again here in sequence to illustrate a typical approach to model selection and refinement.

One important thing to keep in mind is that DHARMa does not work with models fitted using brms. This means that you cannot check residual diagnostics for the location-only double-hierarchical model (model 3) if it was fitted with brms. If you want to use DHARMa for residual diagnostics, you need to run model 2.5 instead - in other words, you cannot include the correlation part.

--- title: "Model Selection (introduction)" --- ```{r} #| label: load_packages #| echo: false # Load required packages pacman::p_load( ## data manipulation dplyr, tibble, tidyverse, broom, broom.mixed, ## model fitting ape, arm, brms, broom.mixed, cmdstanr, emmeans, glmmTMB, MASS, phytools, rstan, TreeTools, ## model checking and evaluation DHARMa, loo, MuMIn, parallel, ## visualisation bayesplot, ggplot2, patchwork, tidybayes, ## reporting and utilities gt, here, kableExtra, knitr ) ``` # Model selection This section outlines general guidelines for model selection and refinement, based on the results from the previous examples (sections XX and Beyond Gaussian 1). These steps can be applied to any model fitting process, whether using `glmmTMB`, `brms`, or other packages. 1. **Identify data type and plot raw data** to understand the distribution and structure of the response variable. This helps in selecting an appropriate model family (e.g., Gaussian, Poisson, Negative Binomial, Beta). 2. **Begin with a simpler model** (e.g., a location-only model assuming homoscedasticity) to establish a baseline for interpretation. 3. **Check residual diagnostics** to detect possible model misspecification, such as overdispersion or non-uniformity of residuals. If issues are detected, consider more complex models (e.g., location-scale models) that account for heterogeneity in variance. 4. **Gradually increase model complexity**: Introduce additional components, such as group-level effects on the scale (variance), and consider estimating correlations between random effects when theoretically or empirically justified—especially when supported by sufficient sample size. 5. **Compare models using information criteria**: - For frequentist models: use **AIC** (Akaike Information Criterion) - For Bayesian models: use **LOO** (Leave-One-Out Cross-Validation) or **WAIC** (Widely Applicable Information Criterion). Ultimately, the goal is not only to improve statistical fit but also to gain biological insights. For example, finding that a scale predictor improves model fit may suggest biologically meaningful heterogeneity. We demonstrate these steps using the previous examples, where we started with a location-only model and then moved to a location-scale model to account for heterogeneity in variance. We also compared models using AICc and LOO criteria to select the best-fitting model. Although parts of the examples below may repeat content from earlier sections, we present them again here in sequence to illustrate a typical approach to model selection and refinement. ::: {.callout-tip appearance="simple" icon="false"} One important thing to keep in mind is that `DHARMa` does not work with models fitted using `brms`. This means that you cannot check residual diagnostics for the location-only double-hierarchical model (model 3) if it was fitted with `brms`. If you want to use `DHARMa` for residual diagnostics, you need to run model 2.5 instead - in other words, you cannot include the correlation part.