AnVIL Demo: Open Discussion Forum on August 20, 2025

​Join our next AnVIL Demo!

Topic: Open Discussion Forum

August 20, 2025 at 10:00 AM EST on Zoom

11:00 AM - 12:00 PM ET – Open Q&A

We’ll open up the floor to questions about the demo presented, and will have AnVIL and Terra support on call to answer any questions about AnVIL you might have!

:pencil: Sign up: Register for AnVIL Demos

What are AnVIL Demos?

AnVIL Demos are a monthly, virtual meeting where we highlight what you can do on the NHGRI Analysis, Visualization, and Informatics Lab-space (AnVIL), a cloud-based computing platform for genomic data science! AnVIL Demos will start out with a 30-minute demonstration on the platform followed by open time for Q&A and user support.

The demos will highlight a range of topics, from a capability of the platform to a scientific analysis powered by AnVIL. If you’re interested in showcasing how you use AnVIL at a future AnVIL Demos session, reach out to Natalie Kucher (nkucher3@jhu.edu). After the demo, we’ll open up the floor to answer questions about the demo and to answer any general questions you might have about AnVIL.

:play_or_pause_button: Watch past AnVIL Demos recordings from our YouTube playlist!

Resources

Upcoming Events

Sign up to hear about future AnVIL Demos and announcements at bit.ly/anvil-mailing-list and learn about upcoming events at our AnVIL Portal Events Page!

Q: With regards to dbGaP, PI has permission to see dbGaP datasets for the projects that are in progress. How can we access dbGaP datasets in AnVIL?

A: In Profile page in AnVIL, there is a place where you can link external identities. You can link your eRA Commons, which will be the method for connecting your credentials in AnVIL that will help to access dbGaP datasets. Requesting Data Access - AnVIL Portal

Your Terra account uses Google Log-in with your institutional email which has been linked to a Google Identity. The external identity connection through eRA Commons is used under the hood in your Terra account to ensure you are able to view and analyze datasets which you have access to through dbGaP.

Q: Where can I find dbGaP data in AnVIL?

A: You can browse data that are available in the AnVIL Data Explorer. Datasets - AnVIL Data Explorer . You can select which datasets you’d like to request access to if you don’t have access already. Once access is granted, you’ll be able to select the datasets and move these into your Terra Workspace (a new workspace or an existing one).

Q: What dbGaP data are in AnVIL?

A: Not all dbGaP datasets are in AnVIL, but a subset of NHGRI datasets are available in the AnVIL. You can look for these data in the Data Explorer. If you want to access data that are not in AnVIL, you would need to upload those data to AnVIL on your own.

Q: Starting on AnVIL, the way of interacting with our data in a bucket in the workspace and accessing it through gsutil was a new concept. Now we are leveraging workflows. Typical analyses run through FASTQ, GATK, PLINK, and other downstream analyses. But it’s important to access what you’ve done and to make sure you’ve done it correctly, because it’s rare that you can run a workflow end-to-end without any issues. So it’s important to do analysis steps in stages. Had been using dsub to submit jobs and command line tools to quickly view my datasets. Working on AnVIL, you need to stay aware that you’re working in buckets, keep track of where your data is, and it’s important to stay aware of your footprint.

How many AnVIL users use BigQuery in the cloud? How do users interact with their data on AnVIL when they have a large number of datasets in AnVIL but they want to look into a small chunk of files? If there is a database, I could query a small chunk of it. Are there better tools to use?

I don’t want to restart my analysis to look into my files again. In AoU you have to start over, but it’s nice in AnVIL you can view your files in your bucket. It’s also challenging because many analyses are browser based.

A: Hail is likely going to be a good community to connect with related to the questions you’re asking about: Scaling variant discovery to a million genomes with the Genomic Variant Store - Terra . It’s a bit more limited than BigQuery, but it can be helpful to work with vcf files. It may be worth asking the Hail community if you’re able to perform the work you’d like to do using Hail. https://discuss.hail.is

We anticipate maybe 1-2 power users who are using BigQuery, but there are likely more Hail users than BigQuery users. We’re investigating the opportunities there internally.

One of the challenges in working with population data, some datasets are good but some are problematic? Yes.

Q: The challenge with Hail is that it uses a lot of resources to process vcfs, but a friction point is converting between vcf and vds files back and forth. It’s also challenging to work with large quantities of data then encounter issues, pause the environment to ask questions, then restart to resume the work.

A: It could be helpful to ask the tool communities about the best ways to interact with these analyses at scale.