AnVIL Office Hours 16JUN2022 @ 11 AM ET

The AnVIL Outreach Working Group is hosting virtual AnVIL Office Hours on Thursday, June 16, 2022 at 11:00 am ET. These Office Hours are an opportunity for you to get your questions about working on AnVIL answered in person – whether you’re trying to set up a billing account, launch Galaxy or RStudio, looking for methods and featured workspaces, and more. Members of the AnVIL team will be available to help users including PIs, analysts, and data submitters get unstuck, troubleshoot issues, and discover online resources that provide further information.

Please post your questions in this thread ahead of the session!

Register here to receive the meeting link: https://forms.gle/Yx5cXsPrTnPsJd539.

I’ve uploaded my own data (in .RData form) to my workspace. However, I’m not sure how to access the data from the R script. Also, is there a way to run R script other than Jupyter Notebook on Anvil?

Hi @Yi-Ting_Tsai,

You can run RStudio in AnVIL in addition to Jupyter Notebooks. Here is a video on how to start an environment for RStudio: 6.2 Starting RStudio | Getting Started on AnVIL. Another useful tool is the AnVIL package (Bioconductor - AnVIL).

Your request (287266) has been received and is being reviewed by our support staff.

Thanks for writing in! I also feel that an RStudio environment might be the best solution for working with your data. Since it’s already in the Workspace Bucket, all we’ll need to do once we’ve started our RStudio Environment is run a few terminal commands from the Terminal pane:

  1. Run bucket=“$WORKSPACE_BUCKET” (This sets a terminal variable for where our data is stored that we can call later using $bucket.)
  2. Then run gsutil ls to see all of the data in the workspace bucket. Locate the name of your file in the output.
  3. Then run gsutil cp $bucket/yourfile.rdata ~/yourfile.rdata (change “yourfile” to the name of the file in your workspace. This will bring the data into the RStudio environment for processing.)

For more information, you can read this document: How (and why) to save RStudio data to workspace storage. Please let me know if this information is helpful for you.

Q: How to use RStudio to divide multiple for loops to run parallel jobs independently? One for loop will output an output file, and each loop is independent of the other. The process is used to simulate data and runs on a cluster, but is taking too long. There are 505,000 sets of data to simulate. Each simulation is submitted as a different job. The current resources requested are 33 Mb 96 CPUs and 360 GB for memory. Each job takes about 15 minutes.
A: Any code run in RStudio is run sequentially. You could have multiple workspaces and run RStudio in different workspaces to focus on running different code.

For this type of job, workflows would be best suited. Apache Spark is another interactive possibility, though workflows for submissions would be best - a batch submission system. Once the workflow (WDL) is created, it can be reused. You can set up the WDL to run the R code, and the output would be consolidated to a workspace bucket that can be explored using RStudio. RStudio does not have the capability to set up VMs and shut them down as needed.

It is helpful to review WDL tutorials, and to look at and adapt existing WDLs.

This workspace from a 2019 ASHG workshop shows an analysis using the GENESIS R package:

At a later point, there will need to be a Docker container for the workflow. Feel free to write in if you have any questions when you get there!

Q: My workflows are creating PNGs. Is there a way to display PNGs in the Terra Data tab? Currently there is a link to download the PNGs and store in a publicly accessible folder in Google Cloud. In Google Sheets, you can link from the bucket to display the images. The goal is for display is exploration. For each dataset, they are looking for binding motifs and this helps viewers to determine if a motif is present.
A: Currently, the Data tab doesn’t have capabilities to display the images. Once a workflow finishes creating the PNGs, an additional step in the workflow could generate a dashboard. A notebook could be created to explore the images. Images can also be embedded in the workspace Dashboard, if the purpose is for others to view the workspace and explore the images.

Q: Is there a way to export the status or results of the jobs? It also appears that some jobs failed, but difficult to identify what exactly happened. Is there a way to see the peak or average resource usage of the jobs to adjust provisioning in the next runs? Was using the free memory command, but this makes the log very messy.
A: There is a job history page where you can explore the status of jobs. There is a tool named fiss that can interact with Terra to pull submission information, though there is not a direct export tool in the UI:

There is not currently a built-in mechanism to monitor resources used, though some have built this from scratch. Here is a Feature Request you can comment your support for. The Terra Support team will communicate the interest to the Product team:

Q: There is a GPU quota of 300, but it doesn’t seem that 300 are running at the same time.
A: If multiple people use the quota simultaneously, you may be waiting for others to finish using them (if working in the same workspace). It may be that Google doesn’t have enough on hand. If there is interest in examining the quota or submitting a quota increase request, feel free to post here. This may happen if you are using preemptive resources.

Q: There is one tutorial on how to link Google cloud billing so that for each submission the cost is reported. In the tutorial it mentions that the Terra generated project cannot be used.
A: The Google project will need to be created fresh outside of Terra. It needs to be a non-Terra created project because Terra has unique ownership of the project, so a Google project will give you additional ownership privileges.

Suppose you have a for loop that accumulates results in a vector result

result <- numeric(10)
for (i in 1:10) {
    ## do a lot of work, represented by...
    Sys.sleep(5)
    ## after the calculation add a value to the result
    result[i] <- sqrt(i)
}

Convert the body of the loop into a function that returns a value

my_fun <- function(i) {
    ## do a lot of work, represented by...
    Sys.sleep(5)
    ## return the result
    sqrt(i)
}

and instead of using a for loop, use lapply()

result <- lapply(1:10, my_fun)

So far so good, but it still takes 5 seconds x 10 tasks = 50 seconds

system.time(result <- lapply(1:10, my_fun))

Now use BiocParallel to do the computation in parallel; if you have 10 cores, then this will take just 5 seconds

library(BiocParallel)
system.time(bpresult <- bplapply(1:20, my_fun))
identical(result, bpresult)

Currently the easiest way to get a performance improvement is to simply request a machine with more ‘cores’. Also, the amount of memory needs to be enough to support all cores working simultaneously, so if one iteration of my_fun() takes 4 Gb, and you request a machine with 16 cores, you would need 16 x 4 = 64 Gb. Performing parallel computations like this on a single machine is much easier than mastering spark or workflows, although in the long run these might be ‘better’ solutions.

One ‘lesson learned’ is that R code, like any computer code, can be written in such a way that is very inefficient, e.g., try this

x <- integer(); for (i in 1:1000) x <- c(x, i)
x <- integer(); for (i in 1:100000) x <- c(x, i)

This is just making a sequence 1, 2, …; the second one takes phenomenally long! But check out

x <- integer(); for (i in 1:100000) x[i] <- i

This executes almost instantly! So if you have code that takes a long time, and sort-of intuitively it seems like a modern computer should be doing much better than that, then perhaps it would pay to speak with an R ‘expert’ to see if there are obvious inefficiencies.

1 Like