We have an in-house pipeline for processing 16S amplicon sequencing data thats currently written in R markdown and run locally. We want to transition the pipeline to AnVIL to process larger datasets however we’re unsure about how to modify the paths in the pipeline to the fastq files and our in-house reference database. We’ve successfully uploaded the fastq files to “Files/ uploads/” in the workspace on AnVIL. Should we upload the in-house reference database to the same location? And, how should we modify the paths in the pipeline? The current path to the in-house database is written as follows: “…/data/db/custom.16S.db/sears.refseq16S-v2.fa.gz”.
Many thanks in advance!
Thanks so much for your question!
The “Files” location is a good place to store files that you will need for your pipeline. You can store your reference database here too. This storage is part of your Workspace’s cloud storage.
If you are using the RStudio environment to run your pipeline, you will need to first copy these files over to the compute environment’s persistent disk. You can do that in R using the avfiles_restore()
function from the AnVIL
package. Once transferred, you should then see the files in the “Files” pane on RStudio and can choose the appropriate path.
Note that persistent disk files are deleted when you delete your compute environment, but this will not affect the original copies of the files in your Workspace’s cloud storage.
You can see a tutorial that uses avfiles_restore()
here: 11 Exercises | AnVIL Demos
Other resources that might be helpful:
Thanks so much for your quick reply. I was able to get the ‘avfiles_restore( )’ command to work, but only for individual fastq files by including the path to the source (e.g., "uploads/subfolder/.fastq file”). Is there a way to do this for an entire folder of files? The subfolder, for instance. I tried a few different file paths but got errors unless I included only one file. I also tried the gsutil cp command in terminal which is seems like the ‘avfiles_restore( )’ command may be running under the hood. In reviewing the the AnVlL vignette it seems like using the data tables and the ‘avtables()’ command may be a way to achieve this for the fastq files; however, I’m not sure about the database folder that also contains numerous files. Any additional advice is greatly appreciated. Many thanks!
Would avfiles_restore( recursive=TRUE )
do what you need? More options described in the Reference Manual
Yes! Adding the ‘recursive=TRUE’ to the ‘avfiles_restore()’ command allowed me to upload entire folders. Thanks so much!