AnVIL Demos: Introducing the All of Us + AnVIL Imputation Service on October 15, 2025

Topic: Introducing the All of Us and AnVIL Imputation Service

October 15, 2025 at 10:00 AM ET on Zoom

10:00 AM - 10:30 AM ET – Demo on AnVIL

Learn how to use the All of Us + AnVIL Imputation Service, a tool for increasing the number of variants in population studies. This presentation will discuss the motivation behind creating this new service, scientific validation demonstrating higher imputation accuracy across ancestries, and how to use the new imputation service portal as well as the command-line tool to impute your data.

10:30 AM - 11:00 AM ET – Q&A

We’ll open up the floor to questions about the demo presented, and will have AnVIL and Terra support on call to answer any questions about AnVIL you might have!

:pencil: Sign up: Register for AnVIL Demos

What are AnVIL Demos?

AnVIL Demos are a monthly, virtual meeting where we highlight what you can do on the NHGRI Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org/), a cloud-based computing platform for genomic data science! AnVIL Demos will start out with a 30-minute demonstration on the platform followed by open time for Q&A and user support.

The demos will highlight a range of topics, from a capability of the platform to a scientific analysis powered by AnVIL. If you’re interested in showcasing how you use AnVIL at a future AnVIL Demos session, reach out to Natalie Kucher (nkucher3@jhu.edu). After the demo, we’ll open up the floor to answer questions about the demo and to answer any general questions you might have about AnVIL.

:play_or_pause_button: Watch past AnVIL Demos recordings from our YouTube playlist!

Resources

Upcoming Events

Sign up to hear about future AnVIL Demos and announcements at http://bit.ly/anvil-mailing-list and learn about upcoming events at https://anvilproject.org/events!

Helpful Links:

Q&A

Q: Is it possible to have VCF version conversion as part of the imputation service, or is it easy enough to do separately?

A: This is not currently offered by the service. The team considered the tradeoffs of offering file conversion and because the service is funded by NIH, in order to keep costs low and prioritize compute for imputation, this will stay the user’s responsibility. One consideration of vcf file conversion to highlight is that data may be dropped, so researchers may prefer to handle these file conversions on their own.

There is documentation on how to convert vcf file formats in the imputation service documentation.

Q: Can you give some insight into the audience of the service? Is this generally done by researchers or clinicians who are running imputation for patient care? What is the scale of the imputation?

A: The team did a bit of user research as this service was being developed. What people most go on to do with imputation results is Genome Wide Association Studies (GWAS) or Polygenic Risk Score (PRS) analyses, both of which are aimed at connecting phenotypes (e.g., disease or drug response) to genetic makeup. The team is doing outreach to researchers and labs, focusing on an academic audience.

The service was released about 1.5 months ago so the data on usage is still very early. Existing imputation panels such as TOPMed usually encounter 1 million samples for array imputation a year, and the Michigan Imputation server processes about 10 million samples a year.

Q: In order to access the server, can anyone use this? What permissions do people need for their data? do you need an IRB?

A: To access the imputation service, users only need a Terra account in order to access the service to upload their data and run the tool. The AoU+AnVIL Imputation Service securely stores and analyzes the user-provided samples.

Q: Do the imputation service jobs run in GCP us-central1?

A: Currently yes. As the service grows, there many be expansion into other regions.

Q: Can you talk a little bit about your reference panels? In terms of having short read/long read dataset references? And is it a mix of sequencing technologies?

A: The All of Us and AnVIL reference panel for array imputation is made up of all short read data. There are some reference panels being developed for structural variant imputation, which will use long read data. AoU used Illumina / DRAGEN for sequencing.

Q: What sort of population isolates do you have? For some isolates, a population-specific reference panels are needed to get relevant and accurate imputation.

A: Users can review which population-specific reference panels are provided as part of the All of Us and AnVIL imputation service. As long as your input sample population is represented in the overall panel, the service will be sensitive to impute based on that specific population.

Q: What is required for adding more references to the reference panel? Will there be future periodic updates to the reference panel?

A: Preparing a new or updated reference panel would require re-compute across the entire reference panel, so it takes a large amount of time & resources to do. The next version of the AoU-AnVIL reference panel would likely be the next time All of Us releases a large call set that could augment the imputation in a meaningful way.

Q: Could users add their data to a reference panel?

A: Not through the service, since adding data to the reference panel would require re-compute across the entire panel. Users are also not able to access the panel directly due to data access limitations of the data that make up the panel.

Q: Are there ways for users to determine which reference panel performs better for their input vcf based on the known representation in the panel vs their input population?

A: Users may refer to analyses put out by the imputation panel service providers. Likely the most trusted determination on performance would be to test some samples on each service and review the results. The services also use different metrics to evaluate performance, so these may be difficult to compare without your own testing.

Q: Does TOPMed use Beagle?

A: No, TOPMed uses minimac3.

Q: It’s interesting that the minimum quota cost for a submission is 500. Is this to encourage users to batch their submissions to minimize the backend infrastructure costs to operate/provide the service?

A: This is to keep things fair with cost for all users - jobs that are fewer than 500 samples cost the same to run as jobs that are 500 samples. The service wants to provide cost-efficient analysis for users that will enable the most amount of samples to be imputed. Users can still provide inputs with fewer than 500 samples in a job, but they will be “billed” from their quota as if that job ran on 500 samples.

Q: Does the user need to have Terra Billing set up for their Terra account? Or is basic registration sufficient to use the imputation service?

A: Users do not need a Terra Billing account to use the AoU-AnVIL imputation service. The team is developing a way for users to be able to provide their own method to pay for the imputation service if they wish to exceed the 2500 sample quota. This method would be outside of the Terra billing mechanisms.

Q: Do you have estimates for average run times? Like 1 hour, 1 day, 1 week, etc. assuming no errors?

A: Based on testing, running 1,000 samples or fewer takes less than a day typically. Running >10,000 samples take 2-3 days, as the most time-costly aspect of this is the indexing at the end. Currently there are no issues with queuing jobs that users submit, as the timing is largely number based on the number of samples that a user provides.