Building the BIOCLIM model

Justification for this use case: SImpleSDMLayers can be used as a platform to build your own species distribution models. In this example, which assumes that you have read the vignettes on GBIF integration and variable selection through VIF, we will build our own version of the BIOCLIM model, and apply it to the distribution of Hypomyces lactifluorum in North America.

using SimpleSDMLayers
using GBIF
using Plots
using GLM
using StatsBase
using Statistics
using GeometryBasics

BIOCLIM is a very simple model, which only requires presence information. The first step is therefore to get occurrences of Hypomyces lactifluorum in North America. Because the data in GBIF is only as good as the original data source, sometimes searching by continent gives fewer results than searching by country.

observations = occurrences(
    GBIF.taxon("Hypomyces lactifluorum"; strict=true),
    "hasCoordinate" => "true",
    "country" => "CA",
    "country" => "US",
    "limit" => 300,
)
GBIF records: downloaded 300 out of 2921

We will now page through additional results (300 at a time).

while length(observations) < size(observations)
    occurrences!(observations)
end

At this point, we could read the whole predictor variables directly, and then clip them. This would be fairly wasteful, as we need a small area. For this reason, we will calculate the bounding box first, and then use it to only read the section we want.

left, right = extrema(longitudes(observations)) .+ (-5, 5)
bottom, top = extrema(latitudes(observations)) .+ (-5, 5)
boundaries = (left=left, right=right, bottom=bottom, top=top)
(left = -164.022167, right = -55.250858, bottom = 23.547132, top = 72.105833)

With this information in hand, we can start getting our variables. In this example, we will take all of the BioClim data from WorldClim, at the 10 arc minute resolution, and add the elevation layer. Note that using the bounding box coordinates when calling the layers is much faster than clipping after the fact (assuming that you already have the files downloaded).

predictors =
    convert.(
        Float32, SimpleSDMPredictor(WorldClim, BioClim, 1:19; resolution=10.0, boundaries...)
    );

We will add the elevation to the stack of variables we use – we need to convert everything to Float32 layers, because elevation is originally an Int16 one and a number of operations we will make will require floating points

push!(
    predictors,
    convert(
        Float32, SimpleSDMPredictor(WorldClim, Elevation; resolution=10.0, boundaries...)
    ),
);

It is not a bad idea to plot all of the predictors:

plot(plot.(predictors, grid=:none, axes=false, frame=:none, leg=false, c=:imola)...)

Clearly, some of them show strong autocorrelation; we will therefore re-use our VIF code to select a subset that has uncorrelated variables.

function vif(model)
    R² = r2(model)
    return 1 / (1-R²)
end

function stepwisevif(
    layers::Vector{T}, selection=collect(1:length(layers)), threshold::Float64=5.0
) where {T<:SimpleSDMLayer}
    x = hcat([layer[keys(layer)] for layer in layers[selection]]...)
    X = (x .- mean(x; dims=1)) ./ std(x; dims=1)
    vifs = zeros(Float64, length(selection))
    for i in eachindex(selection)
        vifs[i] = vif(lm(X[:, setdiff(eachindex(selection), i)], X[:, i]))
    end
    all(vifs .<= threshold) && return selection
    drop = last(findmax(vifs))
    popat!(selection, drop)
    @info "Variables remaining: $(selection)"
    return stepwisevif(layers, selection, threshold)
end
stepwisevif (generic function with 3 methods)

We will apply this function with the default parameters:

layers_to_keep = stepwisevif(predictors)
7-element Vector{Int64}:
  7
  8
  9
 15
 18
 19
 20

When this is done, we can plot the layers again to check that they are all more or less unique:

plot(
    plot.(
        predictors[layers_to_keep], grid=:none, axes=false, frame=:none, leg=false, c=:imola
    )...,
)