Hi,
I have a bioinformatic scenario where I have several questions regarding accessing the data files on cloud storage.
For example, I would like to use the sequence dataset in the Terra workspace,
Terra. It seem gvcf files are available for each of the 3202 samples. I would like to create a multi-sample joint called vcf from a subset of those samples.
Below is the example code to run GATK to combine those gvcf files:
gatk CombineGVCFs
-R reference.fasta
–variant sample1.g.vcf.gz
–variant sample2.g.vcf.gz
-O cohort.g.vcf.gz
If I write a WDL to run the command above, it seem that I need to provide the link to the sample1.g.vcf.gz and sample2.g.vcf.gz, which would be the absolute path of their cloud bucket location.
My questions are following:
- I shall use the direct link to their gs location, right?
- Can I use terra datatable (create for the subset of samples) in combination with wdl to accomplish the task?
- Where is data table actually stored? On the bucket of my workspace? If I directly upload my tsv to my bucket, (using gsutil or transfer from github to bucket), will those show up in the DATA tab in Terra?
- When the WDL run the GATK, will each of those g.vcf.gz file be copied over into my VM?
- To run GATK, terra has special environment setup where the GATK is installed. My guess is that when the environment starts, the software package and the corresponding reference databases are duplicated on the VM. Is my guess correct? Would it be fair to say that whatever the computing environment needs, the file/software need to be copied from the cloud storage bucket into the computing environment. I need to find out, when I configure my environment, shall I use the size of those files as a reference.
- If I only have cloud credit for google cloud platform, will I be able to access the AnVIL data stored on Amazon cloud? Or any public datasets hosted on the Amazon cloud?
Thanks.