Longer than normal start up times

I have noticed that creating computational environments on Terra/AnVIL is taking much longer than usual. Using the same start up script which installs packages using BiocManager::install() , last year it would take about 8 minutes to finish and now it is taking ~25 minutes. I’ve noticed this with a number of start up scripts installing different packages so I don’t think it is due to a specific package in R. Do you have any thoughts on this? Thanks!

This is directly related to a change in personnel that has affected the ability of the Bioconductor team to keep producing precompiled binaries for terra/anvil cloud environments. A short term solution can be achieved by a) running the 20 minute installation process in an AnVIL Rstudio cloud env, b) collecting the compiled images of all the necessary packages from the .libPaths() folder where they were installed, and making a tar.gz of these, c) placing the tarball in accessible cloud storage, d) writing a startup script that retrieves the tarball and untars it into the .libPaths destination. The environment that runs that script will have R with all the packages installed.

We are at work on producing appropriate container images and binaries for current Bioconductor, based on the new sudo-enabled container. Once this is completed the workaround noted above will be unnecessary.

@Vincent_Carey Thanks so much for that suggestion, I think that will work great for the workshop that I am helping to run! I will try it out this week and let you know how it goes. Thanks again!

I have a tarball for your workshop because the problem was raised on an internal anvil Slack

With this startup script, you get all pkgs but pryr (archived at maintainers request). If really needed, we can figure out a way to introduce it.

#!/usr/bin/env bash
wget https://mghp.osn.xsede.org/bir190004-bucket01/BiocMonetSoft/monetpkgs.tar.gz
tar zxvf monetpkgs.tar.gz -C /home/rstudio/R/x86_64-pc-linux-gnu-library/4.5-3.20 --overwrite

I made the tar.gz and put it in egress-free storage. Others can figure out how to put it in GCP and simplify matters. This gives an Rstudio in AnVIL with defaults for bioc 3.21 in I would say about 2 minutes.

The package binaries in this tarball will probably not work in any system other than the current anvil rstudio env We need to take care that they don’t get handed to users of other systems who may install them and then encounter incompatibilities

Thanks so much for this! Do you happen to have a script you used to make the tar.gz on Terra/AnVIL? I actually need to do this for 8 different modules that use different sets of packages (some of them need me to install archived CRAN packages or very specific versions of a GitHub repos as well). As for the tar.gz storage, the workspace has a template workspace that everyone clones and in that google cloud storage there’s a folder for each module that the start up scripts can pull from, so I will probably end up putting all the tar,gz files in there. Thanks again for solving this, this is a really great solution for the workshop!

This should get you most of the way there. An issue is ensuring that dependencies are satisfied. You can use BiocPkgTools package to enumerate dependencies; installed.packages() also tells about Depends, Imports and LinkingTo fields.

packs2zip = function(packs, outz=“~/mypacks.zip”, force=FALSE, Ncpus=2, update=FALSE) {
BiocManager::install(packs, force=force, Ncpus=Ncpus, update=update)
curd = getwd()
on.exit(setwd(curd))
setwd(.libPaths()[1]) # idiosyncratic
allp = dir(full=TRUE)
kp = lapply(packs, function(x) grep(x, allp, value=TRUE))
zip(outz, unique(unlist(kp)))
}

I think for this to work you are obligated to list in “packs” all the dependencies of packages not already installed in the Rstudio cloud env. Another approach would be to identify the “new” packages (those you request plus the ones installed by BiocManager as dependencies) using date timestamps in the library folder.

@Vincent_Carey Thanks for sending that code. I was able to get this done for all 7 RStudio modules and they now run great and only take a few minutes to spin up. I did it with a couple of bash scripts and I’m sure some dependencies issue could arise, but luckily didn’t for these modules. First I would launch an environment on Terra with a build start up script

#!/usr/bin/env bash

R -e ‘BiocManager::install(c(“BiocStyle”,“iNETgrate”,“pryr”,“enrichplot”,“org.Hs.eg.db”, “clusterProfiler”, “dendextend”, “IlluminaHumanMethylation450kanno.ilmn12.hg19”))’

cd /home/rstudio
tar -czvf MONET_Module3_Packages.tar.gz -C /home/rstudio/R/x86_64-pc-linux-gnu-library/4.5-3.20 .

gcloud storage cp MONET_Module3_Packages.tar.gz “gs://NNNNNNNNNN/module_data/module_packages/”

and then the other need the environment they pull the tar file

#!/usr/bin/env bash

gcloud storage cp “gs://NNNNNNNNNN/module_data/module_packages/MONET_Module3_Packages.tar.gz” /home/rstudio

cd /home/rstudio
tar zxvf MONET_Module3_Packages.tar.gz -C /home/rstudio/R/x86_64-pc-linux-gnu-library/4.5-3.20 --overwrite
rm MONET_Module3_Packages.tar.gz

This seems like it is working, but I definitely am not a package building expert, so if anything with this way seem not great or will likely fail in the future, just let me know and I can try to change it. Thanks again for all the help with this!

Great, I am glad it is working. I don’t see any obvious flaws, just be sure to test as much as you can because things can install cleanly but fail on use … because some C symbol is needed but is not found in dynamic linking. If you run into such problems let me know. With luck this whole process can be avoided once we have the improved Rstudio image and container binaries running, which may happen in the next couple of weeks.