Working with DataFrames

Both SimpleSDMLayers.jl and GBIF.jl offer an optional integration with the DataFrames.jl package. Therefore, our previous example with the kingfisher Megaceryle alcyon could also be approached with a DataFrame-centered workflow.

We will illustrate this using the same data and producing the same figures as in the previous example. To do so, we will use GBIF.jl to produce the occurrence DataFrame we will use throughout this example. However, it is also possible to use a DataFrame of your choosing instead of one generated by GBIF.jl, as long as it holds one occurrence per row, a column with the latitude coordinates, and a column with longitude coordinates. For the rest, it can hold whatever information you like. Most of our functions assume by default that the coordinates are stored in columns named :latitude and :longitude (the order doesn't matter), but you can generally specify other names with latitude = :lat in case you don't want to rename them (we will show you how below).

So let's start by getting our data:

# Load packages
using SimpleSDMLayers
using GBIF
using Plots
using Statistics
# Load DataFrames too
using DataFrames

# Load environmental data
temperature, precipitation = SimpleSDMPredictor(WorldClim, BioClim, [1,12])

# Get GBIF occurrences
kingfisher = GBIF.taxon("Megaceryle alcyon", strict=true)
kf_occurrences = occurrences(kingfisher,
                             "hasCoordinate" => "true",
                             "decimalLatitude" => (0.0, 65.0),
                             "decimalLongitude" => (-180.0, -50.0),
                             "limit" => 200)
for i in 1:4
  occurrences!(kf_occurrences)
end
@info kf_occurrences
[ Info: Loading DataFrames support for SimpleSDMLayers.jl
[ Info: Loading DataFrames support for GBIF.jl
[ Info: GBIF records: downloaded 1000 out of 100000

Once the data is loaded, we can easily convert the environmental layers to a DataFrame with the corresponding coordinates. We can do this for a single layer or for multiple layers at the same time:

# Single layer
temperature_df = DataFrame(temperature)
# Multiple layers
env_layers = [temperature, precipitation]
env_df = DataFrame(env_layers)
rename!(env_df, :x1 => :temperature, :x2 => :precipitation)
first(env_df, 5)

5 rows × 4 columns

longitudelatitudetemperatureprecipitation
Float64Float64Float32?Float32?
1-179.917-89.9167-31.017143.0
2-179.917-89.75-30.391940.0
3-179.917-89.5833-33.482240.0
4-179.917-89.4167-33.610437.0
5-179.917-89.25-33.719940.0

Note that the resulting DataFrame will include missing values for the elements set to nothing in the layers. We might want to remove those rows using filter! or dropmissing!:

dropmissing!(env_df, [:temperature, :precipitation]);
last(env_df, 5)

5 rows × 4 columns

longitudelatitudetemperatureprecipitation
Float64Float64Float32Float32
1179.91770.9167-11.1038153.0
2179.91771.0833-12.7957161.0
3179.91771.25-12.8151148.0
4179.91771.4167-12.3703136.0
5179.91771.5833-12.3328122.0

GBIF.jl allows us to convert a set of occurrences to a DataFrame just as easily:

kf_df = DataFrame(kf_occurrences)
last(kf_df, 5)

5 rows × 18 columns

keynamedatasetpublished_inobserved_inlatitudelongitudedaterankobserverlicensekingdomphylumclassorderfamilygenusspecies
Int64Abstrac…?Abstrac…?Abstrac…?Abstrac…?Abstrac…?Abstrac…?DateTim…?Abstrac…?Abstrac…?Abstrac…?String?String?String?String?String?String?String?
13058875313Megaceryle alcyoniNaturalist research-grade observationsUSCA49.1578-123.7892021-03-01T16:57:00SPECIESCarol McDougallhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
23058875510Megaceryle alcyoniNaturalist research-grade observationsUSUS29.5821-95.41312021-03-03T11:22:00SPECIESltjeffershttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
33058877591Megaceryle alcyoniNaturalist research-grade observationsUSUS43.5766-116.1492021-03-02T13:50:56SPECIESLauren Studleyhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
43058885594Megaceryle alcyoniNaturalist research-grade observationsUSUS37.2213-121.7442021-03-03T15:51:00SPECIESHoward Friedmanhttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon
53058886482Megaceryle alcyoniNaturalist research-grade observationsUSMX17.1508-96.75742021-03-07T12:56:00SPECIESManuel Grosselethttp://creativecommons.org/licenses/by-nc/4.0/legalcodeAnimaliaChordataAvesCoraciiformesAlcedinidaeMegaceryleMegaceryle alcyon

We can then extract the temperature values for all the occurrences.

temperature[kf_df]
1000-element Array{Float32,1}:
 18.931166
 18.931166
 17.857887
 17.08904
  9.419209
 12.142196
 10.534634
 12.142196
 24.394215
 14.874448
  ⋮
 18.559376
 18.931166
 18.26275
 10.0455265
  9.921742
 20.455812
  9.230989
 15.448521
 20.617886

Or we can clip the layers according to the occurrences:

temperature_clip = clip(temperature, kf_df)
precipitation_clip = clip(precipitation, kf_df)
SDM predictor → 289×738 grid with 75299 Float32-valued cells
  Latitudes	(13.083333333333334, 61.083333333333336)
  Longitudes	(-172.58333333333334, -49.75)

In case your DataFrame has different column names for the coordinates, for example :lat and :lon, you can clip it like this:

kf_df_shortnames = rename(kf_df, :latitude => :lat, :longitude => :lon)
clip(temperature, kf_df_shortnames; latitude = :lat, longitude = :lon)
SDM predictor → 289×738 grid with 75299 Float32-valued cells
  Latitudes	(13.083333333333334, 61.083333333333336)
  Longitudes	(-172.58333333333334, -49.75)

We can finally plot the layer and occurrence values in a similar way to any DataFrame or Array.

histogram2d(temperature_clip, precipitation_clip, c = :viridis)
scatter!(temperature_clip[kf_df], precipitation_clip[kf_df],
         lab= "", c = :white, msc = :orange)

To plot the occurrence values over space, you can use:

contour(temperature_clip, c = :alpine, title = "Temperature",
        frame = :box, fill = true)
scatter!(kf_df.longitude, kf_df.latitude,
         lab = "", c = :white, msc = :orange, ms = 2)