Q: Trying to put together a WDL workflow from an R script (https://support.terra.bio/hc/en-us/articles/4404673920539-Wrapping-R-scripts-in-WDL), but having issues creating the docker image based on the tutorial docker image. It’s also not clear how to upload the WDL file on Terra.
A: The docker file location might be invalid. The file has to be named Dockerfile
without an extension (note it is case sensitive). It may be difficult to strip the file of the extension (especially on Mac), so this can be done in terminal using mv
. Once the docker image is built, it must be pushed to a public repository like Dockerhub. Terra could then find the image from the repository, using a path that your WDL file can cite. In Dockerhub, this can be found in the Tag tab. Then in the example WDL of the tutorial the sum_docker string can be replaced with the docker image name string.
Q: How can you share the WDL with Terra?
A: Using these instructions, you can upload the workflow to the Broad Methods Repository: https://support.terra.bio/hc/en-us/articles/360031366091-Create-edit-and-share-a-new-workflow.
Q: Are the inputs created automatically?
A: The optional variables likely have a default value in the script. You can upload your own data to the workspace and use those data as inputs in the workflow as long as the workflow is expecting the input. To do so, use the gs:// file path for files in your workspace bucket.
Q: I would like to run the R scripts in parallel and there are a hundred parameters for each of the scripts, they are all independent. It would be the same R data file and WDL for each task. Each task takes about 20 minutes.
A: You may want to leverage the data tables for this. Under tables, you could create a .tsv file with your sample IDs and the numbers you are using as parameters. Instead of running it with inputs defined by file paths, you would run workflows with inputs defined by data table, then select the entity type, and it would launch the jobs in parallel. You can write into the WDL itself to scatter - which would be like a for loop (training-resources/WDL at main · ucsc-cgp/training-resources · GitHub). To note, scattering will request a separate VM for each job. It would be worth considering the cost based on your tasks, since scatter spins up a separate VM for each run.
Q: It is currently running on the local cluster, and about 100 jobs in parallel would take about 1 month. We want to run it in the cloud to speed up the task and to possibly run about 500,000 jobs.
A: A strategy would be to run 1 job and see the cost to make sure you have the resources to run the amount of jobs you want. There is also a pre-emptible machine in Google, that might make this cheaper, especially if your task runs quickly and doesn’t get bumped by another user.
You may want to consider whether you want to provision a low- or high-resource instance. The best recommendation here may be to run in parallel in R (see: AnVIL Office Hours 16JUN2022 @ 11 AM ET - #8 by Martin_Morgan).