Node-based analysis of species distributions

This example demonstrates how to do a node-based comparison of species distributions, as described in Borregaard, M.K., Rahbek, C., Fjeldså, J., Parra, J.L., Whittaker, R.J. and Graham, C.H. (2014). Node-based analysis of species distributions. Methods in Ecology and Evolution 5: 1225-1235.

We will reimplement the method from the paper from first principles, using SpatialEcology functionality and the ecojulia phylogenetics package Phylo. We start by loading the basic data objects and end up with defining a function with the full functionality of the published paper.

We will work with the distributions and phylogenetic relationships for all species of the family Tyrannidae in the Americas. The species occurrences are defined on a regular grid with a cellsize of 1 lat/long degree. This is one of the datasets used in the Borregaard et al. (2014) paper.

Load data and create objects

First, let's load the data.

Species occurrence data for spatial ecological analysis exists in a variety of different formats. A common format is to have the data in one or several CSV files.

In this case, we have the data in two CSV tables, one of species occurrences in each grid cell, and one with the lat-long coordinates of each grid cell.

The CSV table of occurrences is in the widely used phylocom format, which is a long-form format for associating the occurrence of species in sites. It consists of three columns, a column of species names, one of abundances (here all have the value 1, as it's a presence-absence data set) and a column of sites.

using CSV, DataFrames, SpatialEcology
phylocom = CSV.read("../../data/tyrann_phylocom.tsv", DataFrame)

4 rows × 3 columns

PlotRecordSpecies
Int64Int64String6…
165041Empidonax_hammondii
265041Empidonax_alnorum
365041Sayornis_saya
465041Contopus_cooperi

The coordinates is a simple DataFrame with a column of sites, one of latitude and one of longitude

coord = CSV.read("../../data/tyrann_coords.tsv", DataFrame)

4 rows × 3 columns

LongLatcell
Float64Float64Int64
1-156.571.56504
2-155.571.56505
3-162.570.56858
4-161.570.56859

We ensure that the column of sites are represented as strings in both data sets. We then construct the Assemblage object. The site columns are used to match the two DataFrames together.

phylocom.Plot = string.(phylocom.Plot)
coord.cell = string.(coord.cell)
tyrants = Assemblage(phylocom, coord)
Assemblage with 390 species in 3716 sites

Species names:
Empidonax_hammondii, Empidonax_alnorum, Sayornis_saya...Muscisaxicola_capistratus, Neoxolmis_rufiventris

Site names:
6504, 6505, 6858...51588, 51589

Let's have a look at the data

using Plots
default(color = cgrad(:Spectral, rev = true))
plot(tyrants)

Next, we'll read in the phylogenetic tree

using Phylo
tree = open(parsenewick, "../../data/tyrannid_tree.tre")
sort!(tree) # sort the nodes on the tree in order of size - useful for plotting
plot(tree, treetype = :fan, tipfont = (5,))

Extract information from a single clade

The Phylo package uses iterators over vertices in the phylogeny for almost everything. For example, to get a vector of all internal (non-tip) nodes in the phylogeny, we would create an iterator over the names of all nodes in the tree, filtered by the function isleaf, which checks whether a node has any descendants, and then collect the iterator to a vector.

nodes = nodenamefilter(!isleaf, tree)
nodevec = collect(nodes)
first(nodevec, 4) # hide

Let's pick a random node from the vector to demonstrate how we can get information on that node.

randnode = nodevec[131]
"Node 311"

We can get a list of the names of all tips/species descending from the node by getting all descendant nodes with getdescendants and filtering with isleaf. We need to pass an anonymous function to filter here, because the isleaf function takes two arguments.

nodespecies(tree, node) = filter(x -> isleaf(tree, x), getdescendants(tree, node))
nodespecies(tree, randnode)
first(nodespecies(tree, randnode), 4) # hide

We can use that species list to subset an Assemblage object. For instance, we can make a function to create a smaller Assemblage of all species descending from our selected node.

get_clade(assemblage, tree, node) = view(assemblage, species = nodespecies(tree, node))

rand_clade = get_clade(tyrants, tree, randnode)
plot(rand_clade, title = randnode)

Comparing the richness of sister clades

The question we are interested in addressing here is: At a given node where the lineage splits into two sister clades - are the two descendant clades distributed differently? This could be an indication that an evolutionary or biogeographic event happened at that time, of consequence for the current distribution of the species.

So let's get the two descendant clades, and plot their distribution in comparison to the parent clade

function plot_node(assemblage, tree, node)
    ch1, ch2 = getchildren(tree, node)[1:2]
    assm = get_clade(assemblage, tree, node)
    assmch1 = get_clade(assm, tree, ch1)
    assmch2 = get_clade(assm, tree, ch2)

    plot(
        plot(assm), plot(assmch1), plot(assmch2),
        layout = (1,3), size = (1000, 350), title = ["parent" "child 1" "child 2"]
    )
end

plot_node(tyrants, tree, randnode)

It is clear that the two clades have distinct distributions, with the first descendant appearing to be overrepresented in the tropical rainforest biome, mainly in the Amazon.

But is the difference great enough that we can say that this is not just a random pattern? We can use randomization to find out.

Using randomization to assess significance of distribution differences

SpatialEcology Assemblages can be randomized using the curveball matrix randomization algorithm defined in RandomBooleanMatrices.jl.

This algorithm randomizes a species-by-site matrix while keeping row and column sums constant, and is very fast. We can instantiate a matrixrandomizer object from our assemblage, and then use this object to repeatedly generate randomized communities

rmg = matrixrandomizer(rand_clade)
newcomm = rand(rmg)
plot(newcomm, title = "randomized version of $randnode")

Because row and column sums are kept constant, the richness of the randomized community is the same. But the richness of the two descendant clades will be different - let's look at our focal node:

ch1, ch2 = getchildren(tree, randnode)[1:2]
randch1 = get_clade(newcomm, tree, ch1)
randch2 = get_clade(newcomm, tree, ch2)
plot(
    plot(rand_clade), plot(randch1), plot(randch2),
    layout = (1,3), size = (1000, 350), title = ["parent" "child 1" "child 2"]
)

This represents a random expectation for the species richness of the two descendant clades should be.

We can repeat this process 100 times and store the species richness of one of the clades in order to get a sampling distribution. The mean and standard deviation of this distribution can be used to assess how unexpected our empirically observed distribution is.

We will focus just on one of the descendants, child clade ch2. The pattern for the other descendant is essentially a mirror image.

using Random: rand!
function simulate_descendants(clade, tree, descendant; nsims = 99)
    rmg = matrixrandomizer(clade)
    ret = zeros(nsims + 1, nsites(clade))  # a matrix to hold the richness values from the simulations
    # the empirical richness in the first row
    ret[1, :] = richness(get_clade(clade, tree, descendant))
    for i in 2:nsims + 1
        # and simulated richness in the rest of the nsims rows
        ret[i, :] .= richness(get_clade(rand!(rmg), tree, descendant))
    end
    ret
end

sims = simulate_descendants(rand_clade, tree, ch2)
100×3716 Array{Float64,2}:
 4.0  4.0  5.0  4.0  4.0  4.0  4.0  4.0  …  4.0  6.0  8.0  8.0  4.0  6.0  8.0
 3.0  3.0  3.0  3.0  4.0  2.0  3.0  1.0     2.0  2.0  4.0  1.0  2.0  3.0  1.0
 4.0  2.0  0.0  4.0  2.0  3.0  2.0  3.0     0.0  2.0  5.0  3.0  2.0  2.0  4.0
 1.0  2.0  3.0  1.0  2.0  2.0  0.0  2.0     1.0  3.0  3.0  1.0  2.0  3.0  2.0
 0.0  3.0  3.0  2.0  1.0  2.0  2.0  3.0     1.0  2.0  2.0  0.0  2.0  2.0  5.0
 2.0  2.0  2.0  2.0  2.0  1.0  3.0  1.0  …  1.0  2.0  4.0  5.0  0.0  3.0  5.0
 2.0  4.0  3.0  2.0  4.0  1.0  4.0  2.0     2.0  2.0  3.0  6.0  2.0  0.0  5.0
 2.0  2.0  1.0  1.0  1.0  1.0  2.0  3.0     2.0  1.0  4.0  2.0  2.0  3.0  6.0
 0.0  3.0  2.0  1.0  1.0  1.0  3.0  4.0     1.0  3.0  2.0  5.0  1.0  4.0  4.0
 3.0  2.0  0.0  2.0  0.0  1.0  2.0  3.0     0.0  3.0  6.0  4.0  1.0  5.0  4.0
 ⋮                        ⋮              ⋱       ⋮                        ⋮
 2.0  3.0  1.0  2.0  3.0  2.0  1.0  2.0     1.0  2.0  3.0  3.0  2.0  4.0  2.0
 2.0  2.0  1.0  3.0  0.0  2.0  2.0  2.0     2.0  2.0  2.0  2.0  0.0  2.0  6.0
 2.0  0.0  4.0  1.0  0.0  3.0  2.0  1.0     2.0  4.0  3.0  2.0  2.0  3.0  6.0
 1.0  1.0  4.0  2.0  2.0  1.0  3.0  2.0     1.0  3.0  4.0  6.0  2.0  2.0  2.0
 0.0  2.0  4.0  1.0  3.0  2.0  1.0  1.0  …  0.0  2.0  5.0  3.0  3.0  2.0  4.0
 1.0  2.0  5.0  2.0  1.0  2.0  2.0  3.0     2.0  2.0  2.0  3.0  3.0  3.0  4.0
 1.0  3.0  2.0  3.0  3.0  1.0  1.0  2.0     3.0  3.0  4.0  4.0  0.0  2.0  5.0
 2.0  1.0  3.0  3.0  2.0  2.0  2.0  1.0     2.0  3.0  4.0  4.0  2.0  1.0  6.0
 2.0  2.0  2.0  1.0  2.0  2.0  3.0  2.0     2.0  1.0  3.0  5.0  3.0  3.0  3.0

Then we calculate the mean and standard deviation across simulations and use this to express the empirical richness values as standardized effect sizes. The resulting standardized effect size for each grid cell constitutes the SOS metric of Borregaard et al. (2014).

To calculate this for our focal cell and plot it we can do:

using Statistics
function calculate_SOS(sims)
    sd = std.(eachcol(sims))
    me = mean.(eachcol(sims))
    (sims[1, :] .- me) ./ sd
end

sims = simulate_descendants(rand_clade, tree, ch1)
SOS = calculate_SOS(sims)
plot(SOS, rand_clade, clim = (-8,8), fillcolor = :RdYlBu, title = "SOS for clade $randnode)

We see a clear distinction between the two clades descending from our focal node, where one descendant is overrepresented in tropical moist forest and the other in colder regions.

The strength of divergence among the two clades is summarized by the GND value (Borregaard et al. 2014)

using StatsBase: tiedrank

function calculate_GND(sims)
  # two internal convenience functions
  logit(p) = log(p/(1-p))
  invlogit(p) = exp(p)/(1+exp(p))

  n = size(sims, 1)
  r = [tiedrank(x)[1]/(n + 1) for x in eachcol(sims)]
  p = 1 .- 2 .* abs.(r .- 0.5) .- 1/n
  α = mean(logit.(p))
  1-invlogit(α)
end

GND = calculate_GND(sims)
first(GND, 4) # hide

Putting it all together

We can use all of the above to go through the entire phylogeny and generate SOS and GND values.

First, let us create a function that calculates both metrics

function process_node(assemblage, tree, nodename; nsims = 100)
    clade = get_clade(assemblage, tree, nodename)
    children = getchildren(tree, nodename)

    if length(children) != 2 || any(x -> isleaf(tree, x) || nspecies(get_clade(assemblage, tree, x)) < 4, children)
        return (fill(NaN, nsites(clade)), NaN)
    end

    sims = simulate_descendants(clade, tree, children[1]; nsims)
    calculate_SOS(sims), calculate_GND(sims)
end

# use as
SOS, GND = process_node(tyrants, tree, randnode)
([-2.431332243434026, -2.09175660340568, -2.4058378542322436, -2.06400048343385, -2.0201275189798236, -1.8533956175485518, -2.0050251716560235, -2.1454962570494778, -1.9487144769227502, -2.291100974449801  …  -2.634051875071559, -2.9233194431337335, -2.2248222961117308, -1.9669023811084203, -2.54525467435664, -3.0876640072803037, -3.030760580303149, -2.0619783863640926, -2.6304877337720867, -2.861876303342024], 0.8801736719060362)

Final step: Applying the method to the entire tree

Finally, we can now go through every node on the tree and calculate the metrics.

This function recreates the functionality of the main Node_analysis function of the nodiv R package

using ProgressLogging
function node_based_analysis(assemblage::Assemblage, tree::AbstractTree)
   nodevec = [getnodename(tree, x) for x in traversal(tree, preorder) if !isleaf(tree, x)] #shuffle!(collect(nodenamefilter(!isleaf, tree)))
   SOSs = Matrix{Float64}(undef, nsites(tyrants), length(nodevec))
   GNDs = Vector{Float64}(undef, length(nodevec))
   @progress for (i, node) in enumerate(nodevec)
       SOSs[:,i], GNDs[i] = process_node(tyrants, tree, node)
   end
   SOSs, GNDs
end

SOSs, GNDs = node_based_analysis(tyrants, tree);
([-0.369677634540273 NaN … NaN NaN; -0.4155618940343646 NaN … NaN NaN; … ; -0.6230210280811734 NaN … NaN NaN; -0.5943756214313964 NaN … NaN NaN], [0.36249997009864054, NaN, NaN, NaN, NaN, 0.2274335316899314, NaN, NaN, 0.7744303831313963, NaN  …  NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN])

Let's visualize the GND values on the tree

plot(tree,
     showtips = false, marker_z = GNDs,
     color = cgrad(:YlOrRd, 10, categorical = true),
     markersize = 15 .* GNDs, markerstrokewidth = 0,
     size = (600, 1000), clim = (0,1)
     )

We notice that a few of the nodes stand clearly out with significant distributional changes. We can then proceed to map the SOS values for these nodes and interpret the distributional history.