AnVIL Office Hours 28JUL2022 @ 11 AM ET

The AnVIL Outreach Working Group is hosting virtual AnVIL Office Hours on Thursday, July 28, 2022 at 11:00 am ET. These Office Hours are an opportunity for you to get your questions about working on AnVIL answered in person – whether you’re trying to set up a billing account, launch Galaxy or RStudio, looking for methods and featured workspaces, and more. Members of the AnVIL team will be available to help users including PIs, analysts, and data submitters get unstuck, troubleshoot issues, and discover online resources that provide further information.

Please post your questions in this thread ahead of the session!

Register here to receive the meeting link: https://forms.gle/iuyX7BaZTX8AUVxa8.

Hi,

I have 2 questions regarding workflows.

  1. I am now a writer of a workspace. I can run notebooks with this billing account. However, when I’m trying to create a workflow, it says that I don’t have a billing project. Do you know how can I use the same billing account to run a workflow?

  2. I followed the “workflow quick start” (https://support.terra.bio/hc/en-us/articles/360043454592-Workflows-Quickstart-Part-1-Run-pre-configured-workflow) but I couldn’t find the “Part1_CRAM-to_BAM” workflow. I’m currently using the “cram-to-bam” workflow, but there is no options in “Select root entity type” question.

Any guidance is appreciated.

From Terra Support 2:25 pm ET:

Your request (290775) has been received and is being reviewed by our support staff.

Thanks for writing in! Here’s my answer to your two questions

  1. In most instances having Writer access should allow you to run a workflow without being prompted. There may be some modifications made to the workspace you’re accessing that prompt for a Billing Account or the workflow is trying to access data that prompts for a billing account. Can you provide a screenshot of the error message you’re getting when you try to run the workflow? Could you prove a link to the workspace you’re seeing this issue with? As a workaround, you could try cloning the workspace and then running the workflow there. That might help resolve the problem and get you working quickly.

  2. The Workflows Quickstart Part 1 - Run pre-configured workflow document is meant to be used with a copy of the Workflows Quickstart Workspace. The Part1_CRAM-to_BAM is on the workflows page of that workspace. My suggestion would be, if you haven’t already, please make a clone of the Quickstart Workspace, and you can run the workflow from there.

I hope that information was help. Please let me know if you have any questions.

Q: Trying to put together a WDL workflow from an R script (https://support.terra.bio/hc/en-us/articles/4404673920539-Wrapping-R-scripts-in-WDL), but having issues creating the docker image based on the tutorial docker image. It’s also not clear how to upload the WDL file on Terra.

A: The docker file location might be invalid. The file has to be named Dockerfile without an extension (note it is case sensitive). It may be difficult to strip the file of the extension (especially on Mac), so this can be done in terminal using mv. Once the docker image is built, it must be pushed to a public repository like Dockerhub. Terra could then find the image from the repository, using a path that your WDL file can cite. In Dockerhub, this can be found in the Tag tab. Then in the example WDL of the tutorial the sum_docker string can be replaced with the docker image name string.

Q: How can you share the WDL with Terra?

A: Using these instructions, you can upload the workflow to the Broad Methods Repository: https://support.terra.bio/hc/en-us/articles/360031366091-Create-edit-and-share-a-new-workflow.

Q: Are the inputs created automatically?

A: The optional variables likely have a default value in the script. You can upload your own data to the workspace and use those data as inputs in the workflow as long as the workflow is expecting the input. To do so, use the gs:// file path for files in your workspace bucket.

Q: I would like to run the R scripts in parallel and there are a hundred parameters for each of the scripts, they are all independent. It would be the same R data file and WDL for each task. Each task takes about 20 minutes.

A: You may want to leverage the data tables for this. Under tables, you could create a .tsv file with your sample IDs and the numbers you are using as parameters. Instead of running it with inputs defined by file paths, you would run workflows with inputs defined by data table, then select the entity type, and it would launch the jobs in parallel. You can write into the WDL itself to scatter - which would be like a for loop (training-resources/WDL at main · ucsc-cgp/training-resources · GitHub). To note, scattering will request a separate VM for each job. It would be worth considering the cost based on your tasks, since scatter spins up a separate VM for each run.

Q: It is currently running on the local cluster, and about 100 jobs in parallel would take about 1 month. We want to run it in the cloud to speed up the task and to possibly run about 500,000 jobs.

A: A strategy would be to run 1 job and see the cost to make sure you have the resources to run the amount of jobs you want. There is also a pre-emptible machine in Google, that might make this cheaper, especially if your task runs quickly and doesn’t get bumped by another user.

You may want to consider whether you want to provision a low- or high-resource instance. The best recommendation here may be to run in parallel in R (see: AnVIL Office Hours 16JUN2022 @ 11 AM ET - #8 by Martin_Morgan).

Q: Have run some genomics jobs that generated a number of output tables for each participant, so that in my summary tables, each participant has links to participant- and analysis-specific output tables. I’d like to create an overall summary table from multiple Terra tables where for each participant, I want to pull a column from the linked table and collate into one table. Is there a way to do this programmatically with WDL or another Terra-specific method?

A: During the analysis, you could have configured the WDL to inject the values into a column into the data table, which would be set by writing an output in the WDL configuration. This could be done using a WDL to have a command like set or awk.

You could interactively test that this is being done correctly with RStudio or Jupyter. Using an API to pull the gs:// tag to the file, then collating in a Jupyter or RStudio notebook might be the quickest way. This code snippet shows how to pull files into a data table (Terra). For your use case, where you will need to query into the linked file, there would need to be a bit more code added.

If you’re more comfortable at the command line, all the commands will have to localize the file, and you should be able to localize them all at once manually using gsutil, then write your own bash command to pull out the specific column from the file. This is a more manual process. The gsutil would not egress to local in the Terra command line.

To get the links, you could download the data table and use an R or python function.