Working with microbial abundances
The primary type for working with microbial abundances is the CommunityProfile, which is a sparse matrix with MicrobiomeSamples as column indices and features (eg Taxons or GeneFunctions) as row indices.
For example, let's make a CommunityProfile with 3 samples, 5 species and 5 genera.
julia> samps = MicrobiomeSample.(["s1", "s2", "s3"])
3-element Vector{MicrobiomeSample}:
MicrobiomeSample("s1", {})
MicrobiomeSample("s2", {})
MicrobiomeSample("s3", {})
julia> taxa = [[Taxon("s$i", :species) for i in 1:5]; [Taxon("g$i", :genus) for i in 1:5]]
10-element Vector{Taxon}:
Taxon("s1", :species)
Taxon("s2", :species)
Taxon("s3", :species)
Taxon("s4", :species)
Taxon("s5", :species)
Taxon("g1", :genus)
Taxon("g2", :genus)
Taxon("g3", :genus)
Taxon("g4", :genus)
Taxon("g5", :genus)
julia> using SparseArrays
julia> mat = spzeros(10, 3); # 10 x 3 matrix filled with zeros
julia> for i in 1:10, j in 1:3
if all(isodd, (i,j))
mat[i,j] = i+j
elseif all(iseven, (i, j))
mat[i,j] = i / j
end
end
julia> mat
10×3 SparseMatrixCSC{Float64, Int64} with 15 stored entries:
2.0 ⋅ 4.0
⋅ 1.0 ⋅
4.0 ⋅ 6.0
⋅ 2.0 ⋅
6.0 ⋅ 8.0
⋅ 3.0 ⋅
8.0 ⋅ 10.0
⋅ 4.0 ⋅
10.0 ⋅ 12.0
⋅ 5.0 ⋅
julia> comm = CommunityProfile(mat, taxa, samps)
CommunityProfile{Float64, Taxon, MicrobiomeSample} with 10 features in 3 samples
Feature names:
s1, s2, s3...g4, g5
Sample names:
s1, s2, s3Accessing CommunityProfile contents
It is easy to get out the underlying abundance matrix, features, and samples using abundances, features, and samples respectively:
julia> abundances(comm)
10×3 SparseMatrixCSC{Float64, Int64} with 15 stored entries:
2.0 ⋅ 4.0
⋅ 1.0 ⋅
4.0 ⋅ 6.0
⋅ 2.0 ⋅
6.0 ⋅ 8.0
⋅ 3.0 ⋅
8.0 ⋅ 10.0
⋅ 4.0 ⋅
10.0 ⋅ 12.0
⋅ 5.0 ⋅
julia> features(comm)
10-element Vector{Taxon}:
Taxon("s1", :species)
Taxon("s2", :species)
Taxon("s3", :species)
Taxon("s4", :species)
Taxon("s5", :species)
Taxon("g1", :genus)
Taxon("g2", :genus)
Taxon("g3", :genus)
Taxon("g4", :genus)
Taxon("g5", :genus)
julia> samples(comm)
3-element Vector{MicrobiomeSample}:
MicrobiomeSample("s1", {})
MicrobiomeSample("s2", {})
MicrobiomeSample("s3", {})You can also just get an array of feature and sample names using featurenames and samplenames respectively.
julia> featurenames(comm)
10-element Vector{String}:
"s1"
"s2"
"s3"
"s4"
"s5"
"g1"
"g2"
"g3"
"g4"
"g5"
julia> samplenames(comm)
3-element Vector{String}:
"s1"
"s2"
"s3"Finally, you can pull out the metadata of all samples or a subset using get. The returned value is a vector of NamedTuples, which is compliant with the Tables.jl interface, so it's easy to load into other formats (like DataFrames.jl for example):
julia> get(comm)
3-element Vector{NamedTuple{(:sample,), Tuple{String}}}:
(sample = "s1",)
(sample = "s2",)
(sample = "s3",)Of course, for now, the metadata is pretty boring - jump ahead to Working with metadata to do some more fun things.
Indexing and selecting
CommunityProfiles wrap a sparse matrix, and you can access the values as you would a normal matrix. In julia, you can pull out specific values using [row, col]. So for example, to get the 3rd row, 2nd column, of matrix mat:
julia> mat = reshape(1:12, 4, 3) |> collect
4×3 Matrix{Int64}:
1 5 9
2 6 10
3 7 11
4 8 12
julia> mat[3,2]
7You can also get "slices", eg to get rows 2-4, column 1:
julia> mat[2:4, 1]
3-element Vector{Int64}:
2
3
4To get all of one dimension, you can just use a bare :
julia> mat[:, 1:2]
4×2 Matrix{Int64}:
1 5
2 6
3 7
4 8For CommunityProfiles, indexing with integer values will return the value of the matrix at that position, while indexing with slices will return a new CommunityProfile. To get the values of a matrix slice, use abundances after indexing.
julia> comm[1,3]
4.0
julia> comm[6:8,3]
CommunityProfile{Float64, Taxon, MicrobiomeSample} with 3 features in 1 samples
Feature names:
g1, g2, g3
Sample names:
s3
julia> comm[1:3,3] |> abundances
3×1 SparseMatrixCSC{Float64, Int64} with 2 stored entries:
4.0
⋅
6.0Indexing with strings and regular expressions
It is often inconvenient to find the numerical index of a particular feature or sample. Instead, you can use strings or regular expressions to get slices of a CommunityProfile, which will match on the String representation of the features or samples. As with numerical indexing, if the index returns a unique (feature, sample) pair, it will return the abundance of that pair. Otherwise, it will return a new CommunityProfile.
julia> comm["g__g1", "s1"]
0.0
julia> comm[r"[gs]1", "s1"]
CommunityProfile{Float64, Taxon, MicrobiomeSample} with 2 features in 1 samples
Feature names:
s1, g1
Sample names:
s1Working with metadata
The metadata dictionaries of MicrobiomeSamples can accessed and updated inside a CommunityProfile using insert!, delete!, set!, and unset!. either by directly acting on the underlying sample object, or using special methods that take the sample name as the second argument:
julia> set!(samples(comm)[1], :subject, "kevin")
MicrobiomeSample("s1", {:subject = "kevin"})
julia> insert!(comm, "s2", :subject, "anika")
MicrobiomeSample("s2", {:subject = "anika"})You can also retrieve all metadata associated with a table using metadata, which will return a Tables.jl-compatible vector or NamedTuples, where each "row" corresponds to one sample. All metadata fields found in any sample will be returned in every row, with the value missing in any samples that do not have that field set.
julia> get(comm)
3-element Vector{NamedTuple{(:sample, :subject)}}:
(sample = "s1", subject = "kevin")
(sample = "s2", subject = "anika")
(sample = "s3", subject = missing)And you can bulk-insert! or set! metadata by passing a similar Table-like object with the a field (:sample by default) matching sample names found in the CommunityProfile.
julia> md = [(sample="s1", subject="kevin", foo="bar"), (sample="s3", subject="annelle", foo="baz")]
2-element Vector{NamedTuple{(:sample, :subject, :foo), Tuple{String, String, String}}}:
(sample = "s1", subject = "kevin", foo = "bar")
(sample = "s3", subject = "annelle", foo = "baz")
julia> set!(comm, md)
julia> md2 = [(name="s1", other="Hello, World!"), (name="s2", other="Goodbye!")]
2-element Vector{NamedTuple{(:name, :other), Tuple{String, String}}}:
(name = "s1", other = "Hello, World!")
(name = "s2", other = "Goodbye!")
julia> insert!(comm, md2; namecol=:name)
julia> get(comm)
3-element Vector{NamedTuple{(:sample, :subject, :foo, :other)}}:
(sample = "s1", subject = "kevin", foo = "bar", other = "Hello, World!")
(sample = "s2", subject = "anika", foo = missing, other = "Goodbye!")
(sample = "s3", subject = "annelle", foo = "baz", other = missing)Importing from tabular data
The CommunityProfile type is compatible with Tables.jl, so should be able to import data from tabular sources with the help of some other packages.
For example, using the DataFrames.jl package (which you can install at the REPL with ] add DataFrames):
julia> using DataFrames, Microbiome
julia> df = DataFrame(features = ["gene$i|s__Some_species" for i in 1:5], s1 = 1:2:10, s2 = 0:2:9)
5×3 DataFrame
Row │ features s1 s2
│ String Int64 Int64
─────┼─────────────────────────────────────
1 │ gene1|s__Some_species 1 0
2 │ gene2|s__Some_species 3 2
3 │ gene3|s__Some_species 5 4
4 │ gene4|s__Some_species 7 6
5 │ gene5|s__Some_species 9 8
julia> gfs = genefunction.(df.features)
5-element Vector{GeneFunction}:
GeneFunction("gene1", Taxon("Some_species", :species))
GeneFunction("gene2", Taxon("Some_species", :species))
GeneFunction("gene3", Taxon("Some_species", :species))
GeneFunction("gene4", Taxon("Some_species", :species))
GeneFunction("gene5", Taxon("Some_species", :species))
julia> mss = MicrobiomeSample.(names(df)[2:end])
2-element Vector{MicrobiomeSample}:
MicrobiomeSample("s1", {})
MicrobiomeSample("s2", {})
julia> CommunityProfile(Matrix(df[!, 2:end]), gfs, mss)
CommunityProfile{Int64, GeneFunction, MicrobiomeSample} with 5 features in 2 samples
Feature names:
gene1, gene2, gene3, gene4, gene5
Sample names:
s1, s2Alternatively, with a CSV file, and the CSV.jl package:
julia> println.(eachline("taxprof.csv"));
features,s1,s2
s__Escherichia_coli,1,0
s__Bifidobacterium_longum,3,2
s__Prevotella_copri,5,4
julia> using CSV, CSV.Tables
julia> tbl = CSV.read("taxprof.csv", Tables.columntable);
julia> txs = taxon.(tbl[1])
3-element Vector{Taxon}:
Taxon("Escherichia_coli", :species)
Taxon("Bifidobacterium_longum", :species)
Taxon("Prevotella_copri", :species)
julia> mss = [MicrobiomeSample(string(k)) for k in keys(tbl)[2:end]]
2-element Vector{MicrobiomeSample}:
MicrobiomeSample("s1", {})
MicrobiomeSample("s2", {})
julia> mat = hcat([tbl[i] for i in 2:length(tbl)]...)
3×2 Matrix{Int64}:
1 0
3 2
5 4
julia> CommunityProfile(mat, txs, mss)
CommunityProfile{Int64, Taxon, MicrobiomeSample} with 3 features in 2 samples
Feature names:
Escherichia_coli, Bifidobacterium_longum, Prevotella_copri
Sample names:
s1, s2You may also be interested in BiobakeryUtils.jl which has convenient functions for reading in file types generated by the bioBakery suite of computational tools.
Types and Methods
Microbiome.CommunityProfile — TypeCommunityProfile{T, F, S} <: AbstractAbundanceTable{T, F, S}An AbstractAssemblage from EcoBase.jl that uses a SparseMatrixCSC under the hood.
CommunityProfiles are tables with AbstractFeature-indexed rows and AbstractSample-indexed columns. Note - we can use the name of samples and features to index.
Microbiome.samples — Functionsamples(at::AbstractAbundanceTable)Returns samples in at. To get samplenames instead, use samplenames.
samples(at::AbstractAbundanceTable, name::AbstractString)Returns sample in at with name name.
Microbiome.features — Functionfeatures(at::AbstractAbundanceTable)Returns features in at. To get featurenames instead, use featurenames.
Microbiome.samplenames — Functionsamplenames(at::AbstractAbundanceTable)Get a vector of sample names from at, equivalent to name.(samples(at))