Q: With regards to dbGaP, PI has permission to see dbGaP datasets for the projects that are in progress. How can we access dbGaP datasets in AnVIL?
A: In Profile page in AnVIL, there is a place where you can link external identities. You can link your eRA Commons, which will be the method for connecting your credentials in AnVIL that will help to access dbGaP datasets. Requesting Data Access - AnVIL Portal
Your Terra account uses Google Log-in with your institutional email which has been linked to a Google Identity. The external identity connection through eRA Commons is used under the hood in your Terra account to ensure you are able to view and analyze datasets which you have access to through dbGaP.
Q: Where can I find dbGaP data in AnVIL?
A: You can browse data that are available in the AnVIL Data Explorer. Datasets - AnVIL Data Explorer . You can select which datasets you’d like to request access to if you don’t have access already. Once access is granted, you’ll be able to select the datasets and move these into your Terra Workspace (a new workspace or an existing one).
Q: What dbGaP data are in AnVIL?
A: Not all dbGaP datasets are in AnVIL, but a subset of NHGRI datasets are available in the AnVIL. You can look for these data in the Data Explorer. If you want to access data that are not in AnVIL, you would need to upload those data to AnVIL on your own.
–
Q: Starting on AnVIL, the way of interacting with our data in a bucket in the workspace and accessing it through gsutil was a new concept. Now we are leveraging workflows. Typical analyses run through FASTQ, GATK, PLINK, and other downstream analyses. But it’s important to access what you’ve done and to make sure you’ve done it correctly, because it’s rare that you can run a workflow end-to-end without any issues. So it’s important to do analysis steps in stages. Had been using dsub to submit jobs and command line tools to quickly view my datasets. Working on AnVIL, you need to stay aware that you’re working in buckets, keep track of where your data is, and it’s important to stay aware of your footprint.
How many AnVIL users use BigQuery in the cloud? How do users interact with their data on AnVIL when they have a large number of datasets in AnVIL but they want to look into a small chunk of files? If there is a database, I could query a small chunk of it. Are there better tools to use?
I don’t want to restart my analysis to look into my files again. In AoU you have to start over, but it’s nice in AnVIL you can view your files in your bucket. It’s also challenging because many analyses are browser based.
A: Hail is likely going to be a good community to connect with related to the questions you’re asking about: Scaling variant discovery to a million genomes with the Genomic Variant Store - Terra . It’s a bit more limited than BigQuery, but it can be helpful to work with vcf files. It may be worth asking the Hail community if you’re able to perform the work you’d like to do using Hail. https://discuss.hail.is
We anticipate maybe 1-2 power users who are using BigQuery, but there are likely more Hail users than BigQuery users. We’re investigating the opportunities there internally.
One of the challenges in working with population data, some datasets are good but some are problematic? Yes.
Q: The challenge with Hail is that it uses a lot of resources to process vcfs, but a friction point is converting between vcf and vds files back and forth. It’s also challenging to work with large quantities of data then encounter issues, pause the environment to ask questions, then restart to resume the work.
A: It could be helpful to ask the tool communities about the best ways to interact with these analyses at scale.