1. Model selection

In this document, we are using relationship between two size parameters as an example: Predicting tree height from trunk diameter. See the vignette Get started with allometree for a full description of the dataset and allometric equations.

For one species

For a particular species of interest, data will be fit to all allometric equations. The best-fit model is selected based on the lowest bias-corrected Aikaike’s information criterion (AICc) value. Here’s an example to select the best-fit model for the species Albizia saman, using the function ss_modelselect().

data(urbantrees, package = "allometree")

Alb_sam <- urbantrees[urbantrees$species == "Albizia saman", ]  # subset data for one species

results <- ss_modelselect(Alb_sam, 
                          response = "height", predictor = "diameter")  # specify colnames of variables

results is a list of 3 elements. More information on the output can be found at ?ss_modelselect.

  1. Table showing models ranked by AICc value, named all_models_rank.
  2. Best-fit model object, named best_model.
  3. Table showing information on the best-fit model, named best_model_info.
head(results$all_models_rank)
#>   df     AICc    model
#> 1  3 591.9422   lin_w1
#> 2  4 593.0074  quad_w1
#> 3  5 593.4045   cub_w1
#> 4  6 594.0030 quart_w1
#> 5  3 594.0139   lin_w2
#> 6  5 595.5983   cub_w2
results$best_model
#> 
#> Call:
#> lm(formula = y ~ x)
#> 
#> Coefficients:
#> (Intercept)            x  
#>       6.717        9.464
results$best_model_info
#>   modelcode        a        b  c  d  e response_geom_mean correctn_factor
#> 1    lin_w1 6.717431 9.464096 NA NA NA           13.56608               1
#>   predictor_min predictor_max response_min response_max residual_SE mean_SE
#> 1     0.3119437      1.527887            8           20       2.205  4.7889
#>   adj_R2   n
#> 1 0.4276 133

For multiple species

We can also run ss_modelselect() across multiple species in the full dataset urbantrees. For this, we use the wrapper function ss_modelselect_multi().

results_all <- ss_modelselect_multi(urbantrees, species = "species", # specify colname of species
                                    response = "height", predictor = "diameter")

results_all is a list of 3 elements:

  1. List of tables showing each species’ candidate models ranked by AICc value, named ss_models_rank.
  2. List of each species’ best-fit model object, named ss_models.
  3. Table showing each species’ best-fit model information, named ss_models_info.

 

An overview of the best-fit models across the 5 species:

results_all$ss_models_info
#>                    species modelcode         a           b          c        d
#> 1            Albizia saman    lin_w1  6.717431   9.4640964         NA       NA
#> 2            Hopea odorata  quart_w4 -1.149669 149.3178483 -1522.6122 7908.410
#> 3     Syzygium myrtifolium  quart_w2 -4.614100 158.5128488  -748.2492 1638.676
#> 4       Terminalia mantaly   expo_w1  1.241983   3.8600410         NA       NA
#> 5 Xanthostemon chrysanthus loglog_w1  2.947907   0.5916555         NA       NA
#>            e response_geom_mean correctn_factor predictor_min predictor_max
#> 1         NA          13.566083        1.000000    0.31194369     1.5278875
#> 2 -14225.442           5.390829        1.000000    0.03183099     0.2928451
#> 3  -1321.221           7.735643        1.000000    0.04138029     0.5665916
#> 4         NA           7.596013        1.000533    0.03501409     0.5602254
#> 5         NA           5.041605        1.000883    0.02864789     0.3533240
#>   response_min response_max residual_SE mean_SE adj_R2   n
#> 1            8           20      2.2050  4.7889 0.4276 133
#> 2            2           15     11.9430  2.6339 0.5460 483
#> 3            1           18      2.1674  2.2536 0.6662 353
#> 4            3           18      0.2493  0.0615 0.6702 197
#> 5            2           13      0.2123  0.0449 0.6137 418

2. Outlier removal

Let’s say we want to check for outlier/influential data points, remove them, and re-fit the filtered dataset to the pre-selected models as defined in results_all$ss_models_info. First, lets visualise these outliers, say, for the species Xanthostemon chrysanthus.

par(mfrow=c(2,2)); plot(results_all$ss_models$`Xanthostemon chrysanthus`)

 

There are numerous ways to deal with outliers. One way is to remove them based on Cook’s Distance (bottom-right plot). In general use, outliers may be defined as points with a Cook’s distance more than four times the mean Cook’s distance (red line in plot below). Note that this threshold is not fixed, and may be adjusted according to biological knowledge of each tree species.

cooks_dist <- cooks.distance(results_all$ss_models$`Xanthostemon chrysanthus`)
plot(cooks_dist, pch = ".", cex = 2, main = "Outliers by Cook's Distance")
abline(h = 4 * mean(cooks_dist, na.rm = T), col = "red")  # threshold line 
text(x = 1:length(cooks_dist) + 1, y = cooks_dist, 
     labels = ifelse(cooks_dist > 4 * mean(cooks_dist, na.rm=T), names(cooks_dist), ""), col = "red")  # outlier labels

 

Lets remove outliers across all species in urbantrees. We can create a function uncooker() that does so for each species using a for() loop, based on their respective best-fit model in results_all$ss_models:

uncooker <- function(data, modellist){
  
  result <- data[0,]
  
  for(i in 1:length(modellist)){ # loop over all species 
    
    cooks_dist <- cooks.distance(modellist[[i]]) 
    outliers <- as.numeric(which(cooks_dist > 4 * mean(cooks_dist, na.rm=T)))  # threshold
    subset <- data[(data$species == names(modellist[i])),][-outliers,]
    result <- rbind.data.frame(result, subset)
  }
  return(result)
}

# run the function uncooker()
urbantrees_clean <- uncooker(urbantrees, results_all$ss_models)

nrow(urbantrees_clean)
#> [1] 1525

59 data points were removed to obtain the filtered dataset urbantrees_clean (n = 1525).

3. Re-fit filtered data

For one species

We can now fit urbantrees_clean to the best-fit equations we defined in Section 1. Here’s an example for the species Albizia saman again, using the function ss_modelfit(). This function has an additional argument modelcode. In the case of Albizia saman, it is lin_w1 (see results_all$ss_models_info). Note that you can also pick any one of the options found in eqns_info$modelcode.

 

We’ll overwrite our previously-defined variables Alb_sam and results, this time with the filtered dataset:

Alb_sam <- urbantrees_clean[urbantrees_clean$species == "Albizia saman", ]

results <- ss_modelfit(Alb_sam, modelcode = "lin_w1", # specify modelcode
                       response = "height", predictor = "diameter") 

results is now a list of 2 elements:

  1. Resulting model object, named fitted_model.
  2. Table showing information on the resulting model, named fitted_model_info.

For multiple species

Similar to the the wrapper function ss_modelselect_multi(), we can also run ss_modelfit_multi() to fit pre-selected models across multiple species. In this function, we need to input a reference table ref_table (i.e. results_all$ss_models_info) that provides information on the species and their corresponding modelcode.

results_all <- ss_modelfit_multi(urbantrees_clean, 
                                 ref_table = results_all$ss_models_info,
                                 species = "species", # colname in both data & ref_table
                                 modelcode = "modelcode", # colname in ref_table
                                 response = "height", predictor = "diameter")

results_all is now a list of 2 elements:

  1. List of each species’ resulting model object, named ss_models.
  2. Table showing each species’ resulting model information, named ss_models_info.

 

To make our subsequent code less verbose, let’s assign the elements ss_models and ss_models_info to variables with the same name:

ss_models <- results_all$ss_models
ss_models_info <- results_all$ss_models_info

4. Make predictions

We can simulate some data for a species of interest, and use it’s model to make predictions. We’ll use the species Albizia saman again as an example. Let’s first generate the data based on the range of values for the predictor variable (i.e. diameter size) that was used to fit the model, using the function generate_x():

Alb_sam <- urbantrees[urbantrees$species == 'Albizia saman', ]  

# generate data for subsequent predictions
newdata <- generate_x(Alb_sam, response = "height", predictor = "diameter")

 

ss_predict() can be used to make predictions on the simulated data predict_range_full. This appends columns for predicted height (fit), as well as the lower (lwr) and upper (upr) bounds of the prediction interval:

predictions <- ss_predict(newdata, 
                          models = ss_models, 
                          ref_table = ss_models_info, 
                          predictor = "predictor")
head(predictions)
#>         species predictor extrapolated      fit      lwr      upr
#> 1 Albizia saman 0.3119437           No 9.011007 4.679395 13.34262
#> 2 Albizia saman 0.3242260           No 9.145375 4.818864 13.47189
#> 3 Albizia saman 0.3365082           No 9.279744 4.958194 13.60129
#> 4 Albizia saman 0.3487905           No 9.414112 5.097386 13.73084
#> 5 Albizia saman 0.3610727           No 9.548481 5.236439 13.86052
#> 6 Albizia saman 0.3733550           No 9.682849 5.375353 13.99035

 

Note that you can simulate data automatically using the ss_simulate() function, which is also a wrapper to ss_predict(). This can be done for one or multiple species, using the argument select_sp. The simulated data can be extrapolated beyond the range used to fit the model, using the argument extrapolate. Here’s an example using the models that we previously derived from the filtered dataset urbantrees_clean. Let’s say we’re interested in Albizia saman, Terminalia mantaly, and Xanthostemon chrysanthus, and would like to predict their height across a diameter size of 0 – 1 m. We’ll overwrite our previously-defined variable predictions:

predictions <- ss_simulate(ref_table = ss_models_info, models = ss_models, 
                           select_sp = c("Albizia saman", "Terminalia mantaly", "Xanthostemon chrysanthus"),
                           extrapolate = c(0,1)) 
head(predictions)
#> # A tibble: 6 × 6
#> # Groups:   species [1]
#>   species       predictor   fit   lwr   upr extrapolated
#>   <chr>             <dbl> <dbl> <dbl> <dbl> <chr>       
#> 1 Albizia saman     0.312  9.01  4.68  13.3 No          
#> 2 Albizia saman     0.321  9.11  4.78  13.4 No          
#> 3 Albizia saman     0.330  9.21  4.89  13.5 No          
#> 4 Albizia saman     0.340  9.31  4.99  13.6 No          
#> 5 Albizia saman     0.349  9.41  5.10  13.7 No          
#> 6 Albizia saman     0.358  9.51  5.20  13.8 No

5. Plot predictions

We can plot out our predictions using ggplot():

library(ggplot2)

ggplot() +
  facet_wrap(~ species)+ 
  geom_point(data = dplyr::filter(urbantrees, species %in% predictions$species), aes(x = diameter, y = height),
             size=0.35, alpha = 0.9, color = "grey50") +
  
  # prediction interval
  geom_ribbon(data = predictions, aes(x = predictor, ymin = lwr, ymax = upr), 
              alpha = 0.10) +
  
  # regression line
  geom_line(data = predictions, aes(x = predictor, y = fit, lty = extrapolated),
            size = 0.8, alpha = 0.6) +
  
  scale_linetype_manual(values=c("No"= 1 , "Low" = 3, "High" = 3), name = "Extrapolated") +
  
  xlab("Diameter (m)") + ylab("Height (m)") +
  
  coord_cartesian(ylim=c(0, 25), xlim= c(0,1))

 

Interpretations of the allometric relationships will vary depending on what type of parameter is analysed (e.g. tree height, trunk diameter, age, crown height, crown diameter and leaf area). In our example, we’ll have to be cautious in interpretating extrapolated predictions. As you can see in the figure, extrapolations down to a diameter size of zero for Albizia saman and Terminalia mantaly would not make sense (i.e. the tree cannot have a positive height at that diameter size). Also, while an exponential relationship shows rapid scaling in height for Terminalia mantaly, it would have a natural height limit (i.e. would not continue growing taller at such a steep slope). Hence, it is important that allometric relationships are interpreted in conjunction with information on the biology and growth (e.g. environmental and management) conditions associated with the trees.


References

McPherson E. G., van Doorn N. S. & Peper P. J. (2016) Urban Tree Database and Allometric Equations. General Technical Report PSW-GTR-253, USDA Forest Service, 86.

Song, X. P., Lai, H. R., Wijedasa, L. S., Tan, P. Y., Edwards, P. J., & Richards, D. R. (2020), Height–diameter allometry for the management of city trees in the tropics. Environmental Research Letters, 15, 114017. https://doi.org/10.1088/1748-9326/abbbad