KneadData Tutorial with BiobakeryUtils.jl
- ๐๏ธ This tutorial is meant to be run in parallel with / mirror the official KneadData
- โ๏ธ If you have questions about MetaPhlAn itself, please direct them to the bioBakery help forum
- ๐ค If you have questions about using the MetaPhlAn tools in julia, please open an issue, or start a discussion over on
Microbiome.jl
! - ๐ For a function / type reference, jump to the bottom
Installation and setup
If you haven't already, check out the "Getting Started" page to install julia, create an environment,xd and install BiobakeryUtils.jl, and hook up or install the MetaPhlAn v3 command line tools.
This tutorial assumes:
- You are running julia v1.6 or greater
- You have activated a julia Project that has
BiobakeryUtils.jl
installed - The
kneaddata
python package is installed, and accessible from yourPATH
.
If any of those things aren't true, or you don't know if they're true, go back to "Getting Started" to see if you skipped a step. If you're still confused, please ask (see 3rd bullet point at the top)!
Contamination databases
By default, kneaddata will only trim reads based on quality scores. If you would also like to remove contaminating sequences (eg from human or mouse DNA reads), you'll need to download them.
BiobakeryUtils.kneaddata_database
โ Functionkneaddata_database(db, kind, path)
See kneaddata_database --help
kneaddata_database("human_genome", "bowtie2", "/some/database/dir/")
To see what databases are available, you need to use the command line, kneaddata_database --available
.
Demo files
The demo files for the kneaddata tutorial can be found in this package's test
folder, which you can find with
julia> demo = abspath(joinpath(dirname(Base.find_package("BiobakeryUtils")), "..", "test", "files", "kneaddata"));
julia> readdir(demo)
10-element Vector{String}:
"SE_extra.fastq"
"demo_db.1.bt2"
"demo_db.2.bt2"
"demo_db.3.bt2"
"demo_db.4.bt2"
"demo_db.rev.1.bt2"
"demo_db.rev.2.bt2"
"seq1.fastq"
"seq2.fastq"
"singleEnd.fastq"
Running on a single-end sequencing data
You can use the kneaddata
commandline tool using the kneaddata()
function from BiobakeryUtils.jl
julia> kneaddata(joinpath(demo, "singleEnd.fastq"), "kneaddataOutputSingleEnd"; reference_db=joinpath(demo, "demo_db"))
โ Info: Running command: kneaddata -i /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/singleEnd.fastq -o
โ kneaddataOutputSingleEnd --trimmomatic /home/kevin/.julia/conda/3/envs/BiobakeryUtils/share/trimmomatic -db
โ /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/demo_db
Reformatting file sequence identifiers ...
Initial number of reads ( /tmp/jl_JXPuAs/kneaddataOutputSingleEnd/reformatted_identifiersjlcp_ry6_singleEnd ): 16902.0
# ... etc
Running on paired-end sequencing data
To run on paired end data, simply pass an array of file paths to the input
argument.
julia> kneaddata([joinpath(demo, "seq1.fastq"), joinpath(demo, "seq2.fastq")],
"kneaddataOutputPairedEnd"; reference_db=joinpath(demo, "demo_db"))
โ Info: Running command: kneaddata -i /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/seq1.fastq -i
โ /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/seq2.fastq -o kneaddataOutputPairedEnd --trimmomatic
โ /home/kevin/.julia/conda/3/envs/BiobakeryUtils/share/trimmomatic -db
โ /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/demo_db
Initial number of reads ( /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/seq1.fastq ): 42473.0
Initial number of reads ( /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/seq2.fastq ): 42473.0
Running Trimmomatic ...
Total reads after trimming ( /tmp/jl_JXPuAs/kneaddataOutputPairedEnd/seq1_kneaddata.trimmed.1.fastq ): 35341.0
Total reads after trimming ( /tmp/jl_JXPuAs/kneaddataOutputPairedEnd/seq1_kneaddata.trimmed.2.fastq ): 35341.0
Total reads after trimming ( /tmp/jl_JXPuAs/kneaddataOutputPairedEnd/seq1_kneaddata.trimmed.single.1.fastq ): 5385.0
Total reads after trimming ( /tmp/jl_JXPuAs/kneaddataOutputPairedEnd/seq1_kneaddata.trimmed.single.2.fastq ): 847.0
Changing Defaults
To use the default-altering options, pass them as key words to the kneaddata()
function.
Eg, to set maximum memory utilization to 200 Mb, add max_memory="200m"
to the function call.
API Reference
BiobakeryUtils.kneaddata
โ Functionkneaddata(inputfile, outputfile; kwargs...)
Run kneaddata
command line tool on inputfile
, creating outputfile
. Requires kneaddata
to be installed and accessible in the PATH
(see Getting Started).
kneaddata
options can be passed via keyword arguments. For example, if on the command line you would run:
$ kneaddata -i some.fastq.gz -o test --n 8 --bypass-trim
using this function, you would write:
kneaddata("some.fastq.gz", "test"; n = 8, bypass_trim=true)
To pass multiple databases, pass an array of paths to the reference_db
argument
Conda installations of trimmomatic
(a dependency of kneaddata
) don't work properly out of the box. If you have installed kneaddata
using commandline conda (instead of Conda.jl
), use trimmomatic = /path/to/trimmomatic
, where /path/to/trimmomatic
is something like /home/username/miniconda3/envs/biobakery3/share/trimmomatic
. If you used BiobakeryUtils.install_deps()
, you don't need to worry about this.