Are files from other google storage bucket copied over when running WDL?

truckload · May 7, 2025, 6:37pm

Hi,
I have a bioinformatic scenario where I have several questions regarding accessing the data files on cloud storage.
For example, I would like to use the sequence dataset in the Terra workspace,
Terra. It seem gvcf files are available for each of the 3202 samples. I would like to create a multi-sample joint called vcf from a subset of those samples.

Below is the example code to run GATK to combine those gvcf files:

gatk CombineGVCFs
-R reference.fasta
–variant sample1.g.vcf.gz
–variant sample2.g.vcf.gz
-O cohort.g.vcf.gz

If I write a WDL to run the command above, it seem that I need to provide the link to the sample1.g.vcf.gz and sample2.g.vcf.gz, which would be the absolute path of their cloud bucket location.
My questions are following:

I shall use the direct link to their gs location, right?
Can I use terra datatable (create for the subset of samples) in combination with wdl to accomplish the task?
Where is data table actually stored? On the bucket of my workspace? If I directly upload my tsv to my bucket, (using gsutil or transfer from github to bucket), will those show up in the DATA tab in Terra?
When the WDL run the GATK, will each of those g.vcf.gz file be copied over into my VM?
To run GATK, terra has special environment setup where the GATK is installed. My guess is that when the environment starts, the software package and the corresponding reference databases are duplicated on the VM. Is my guess correct? Would it be fair to say that whatever the computing environment needs, the file/software need to be copied from the cloud storage bucket into the computing environment. I need to find out, when I configure my environment, shall I use the size of those files as a reference.
If I only have cloud credit for google cloud platform, will I be able to access the AnVIL data stored on Amazon cloud? Or any public datasets hosted on the Amazon cloud?
Thanks.

Javier-CP · May 8, 2025, 7:02pm

Hi @truckload,

Here are some comments/answers to your questions:

1-2. You can use multiple rows of the data table as input using arrays if you want a single output. In case you want multiple outputs, sample_sets are an option:

Data tables are their own data structure on AnVIL and are separate from Workspace buckets. If uploaded directly to your bucket, those will not show up in the Data tab of the Workspace. You can add a Data Table using the Import Data feature, of if that is not possible for you, RStudio on AnVIL is another option using:

Bioconductor - AnVIL

Any of the specified files will be transferred to the VM.
WDL workflows will specify a Docker image for all the dependencies. This image will create a Docker container for each tool/dependency. For example: Exome Germline Single Sample Overview | WARP
Any data that is publicly available on the internet is accesible by AnVIL.

truckload · May 8, 2025, 7:16pm

Thank you so much!
I will check out the AnVIL package.

Topic		Replies	Views
Gcloud with start up script on jupyter Help terra	7	41	June 2, 2025
Error message: Bucket is a requester pays bucket but no user project provided Data Access terra	3	512	February 24, 2022
Finding a specific reference file for GTEx Data Access	5	436	April 1, 2022
Download files from Cloud environment Help terra , datamanagement	4	684	January 4, 2022
Use service account to access data in AnVIL_GTEx_V8_hg38 workspace Data Access datamanagement , terra	15	262	August 2, 2024

Are files from other google storage bucket copied over when running WDL?

gatk CombineGVCFs -R reference.fasta –variant sample1.g.vcf.gz –variant sample2.g.vcf.gz -O cohort.g.vcf.gz

Related topics

gatk CombineGVCFs
-R reference.fasta
–variant sample1.g.vcf.gz
–variant sample2.g.vcf.gz
-O cohort.g.vcf.gz