KneadData Tutorial with BiobakeryUtils.jl
- ๐๏ธ This tutorial is meant to be run in parallel with / mirror the official KneadData
- โ๏ธ If you have questions about MetaPhlAn itself, please direct them to the bioBakery help forum
- ๐ค If you have questions about using the MetaPhlAn tools in julia, please open an issue, or start a discussion over on
Microbiome.jl! - ๐ For a function / type reference, jump to the bottom
Installation and setup
If you haven't already, check out the "Getting Started" page to install julia, create an environment,xd and install BiobakeryUtils.jl, and hook up or install the MetaPhlAn v3 command line tools.
This tutorial assumes:
- You are running julia v1.6 or greater
- You have activated a julia Project that has
BiobakeryUtils.jlinstalled - The
kneaddatapython package is installed, and accessible from yourPATH.
If any of those things aren't true, or you don't know if they're true, go back to "Getting Started" to see if you skipped a step. If you're still confused, please ask (see 3rd bullet point at the top)!
Contamination databases
By default, kneaddata will only trim reads based on quality scores. If you would also like to remove contaminating sequences (eg from human or mouse DNA reads), you'll need to download them.
BiobakeryUtils.kneaddata_database โ Functionkneaddata_database(db, kind, path)See kneaddata_database --help
kneaddata_database("human_genome", "bowtie2", "/some/database/dir/")To see what databases are available, you need to use the command line, kneaddata_database --available.
Demo files
The demo files for the kneaddata tutorial can be found in this package's test folder, which you can find with
julia> demo = abspath(joinpath(dirname(Base.find_package("BiobakeryUtils")), "..", "test", "files", "kneaddata"));
julia> readdir(demo)
10-element Vector{String}:
"SE_extra.fastq"
"demo_db.1.bt2"
"demo_db.2.bt2"
"demo_db.3.bt2"
"demo_db.4.bt2"
"demo_db.rev.1.bt2"
"demo_db.rev.2.bt2"
"seq1.fastq"
"seq2.fastq"
"singleEnd.fastq"Running on a single-end sequencing data
You can use the kneaddata commandline tool using the kneaddata() function from BiobakeryUtils.jl
julia> kneaddata(joinpath(demo, "singleEnd.fastq"), "kneaddataOutputSingleEnd"; reference_db=joinpath(demo, "demo_db"))
โ Info: Running command: kneaddata -i /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/singleEnd.fastq -o
โ kneaddataOutputSingleEnd --trimmomatic /home/kevin/.julia/conda/3/envs/BiobakeryUtils/share/trimmomatic -db
โ /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/demo_db
Reformatting file sequence identifiers ...
Initial number of reads ( /tmp/jl_JXPuAs/kneaddataOutputSingleEnd/reformatted_identifiersjlcp_ry6_singleEnd ): 16902.0
# ... etcRunning on paired-end sequencing data
To run on paired end data, simply pass an array of file paths to the input argument.
julia> kneaddata([joinpath(demo, "seq1.fastq"), joinpath(demo, "seq2.fastq")],
"kneaddataOutputPairedEnd"; reference_db=joinpath(demo, "demo_db"))
โ Info: Running command: kneaddata -i /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/seq1.fastq -i
โ /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/seq2.fastq -o kneaddataOutputPairedEnd --trimmomatic
โ /home/kevin/.julia/conda/3/envs/BiobakeryUtils/share/trimmomatic -db
โ /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/demo_db
Initial number of reads ( /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/seq1.fastq ): 42473.0
Initial number of reads ( /home/kevin/.julia/dev/BiobakeryUtils/test/files/kneaddata/seq2.fastq ): 42473.0
Running Trimmomatic ...
Total reads after trimming ( /tmp/jl_JXPuAs/kneaddataOutputPairedEnd/seq1_kneaddata.trimmed.1.fastq ): 35341.0
Total reads after trimming ( /tmp/jl_JXPuAs/kneaddataOutputPairedEnd/seq1_kneaddata.trimmed.2.fastq ): 35341.0
Total reads after trimming ( /tmp/jl_JXPuAs/kneaddataOutputPairedEnd/seq1_kneaddata.trimmed.single.1.fastq ): 5385.0
Total reads after trimming ( /tmp/jl_JXPuAs/kneaddataOutputPairedEnd/seq1_kneaddata.trimmed.single.2.fastq ): 847.0Changing Defaults
To use the default-altering options, pass them as key words to the kneaddata() function.
Eg, to set maximum memory utilization to 200 Mb, add max_memory="200m" to the function call.
API Reference
BiobakeryUtils.kneaddata โ Functionkneaddata(inputfile, outputfile; kwargs...)Run kneaddata command line tool on inputfile, creating outputfile. Requires kneaddata to be installed and accessible in the PATH (see Getting Started).
kneaddata options can be passed via keyword arguments. For example, if on the command line you would run:
$ kneaddata -i some.fastq.gz -o test --n 8 --bypass-trimusing this function, you would write:
kneaddata("some.fastq.gz", "test"; n = 8, bypass_trim=true)To pass multiple databases, pass an array of paths to the reference_db argument
Conda installations of trimmomatic (a dependency of kneaddata) don't work properly out of the box. If you have installed kneaddata using commandline conda (instead of Conda.jl), use trimmomatic = /path/to/trimmomatic, where /path/to/trimmomatic is something like /home/username/miniconda3/envs/biobakery3/share/trimmomatic. If you used BiobakeryUtils.install_deps(), you don't need to worry about this.