AnVIL Demos: Data validation workflows in AnVIL on May 20, 2026

Topic: Data validation workflows in AnVIL

May 20, 2026 at 10:00 AM ET on Zoom

Slides: AnVIL demo May 2026: Data validation - Google Slides

10:00 AM - 10:30 AM ET – Demo on AnVIL

Learn how the GREGoR consortium ensures a consistent, high quality dataset combining uploads from multiple data submitters with an automated validation workflow. This demo will introduce the concept of a JSON-formatted data model and demonstrate running a validation workflow to ensure that submitted data follows the data model. We’ll also talk about the code behind the workflow so you can see how to adapt it for your own data submissions.

10:30 AM - 11:00 AM ET – Q&A

We’ll open up the floor to questions about the demo presented, and will have AnVIL and Terra support on call to answer any questions about AnVIL you might have!

:pencil: Sign up: Register for AnVIL Demos

What are AnVIL Demos?

AnVIL Demos are a monthly, virtual meeting where we highlight what you can do on the NHGRI Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org/), a cloud-based computing platform for genomic data science! AnVIL Demos will start out with a 30-minute demonstration on the platform followed by open time for Q&A and user support.

The demos will highlight a range of topics, from a capability of the platform to a scientific analysis powered by AnVIL. If you’re interested in showcasing how you use AnVIL at a future AnVIL Demos session, reach out to Natalie Kucher (nkucher3 / at / jhu.edu). After the demo, we’ll open up the floor to answer questions about the demo and to answer any general questions you might have about AnVIL.

:play_or_pause_button: Watch past AnVIL Demos recordings from our YouTube playlist!

Resources

Upcoming Events

Sign up to hear about future AnVIL Demos and announcements athttp://bit.ly/anvil-mailing-list and learn about upcoming events athttps://anvilproject.org/events!

Q: Can you give a feel for the typical size of a dataset that a center would submit? How many participants and samples? Could you also speak about the people on the center side who are doing the preparation, training folks that use this workflow?

A: GREGoR is up to 12,000 participants now, using quarterly upload cycles (100s-1000s of participants added in each set). As folks upload the data, the center will validate. There’s a cutoff date when this is run and folks will upload in the next cycle.

We give similar trainings to the uploaders who are part of the GREGoR consortium. Some centers have 1 person who is responsible, they usually get information from multiple folks in their centers. Some centers have a team where folks are responsible for different data types.

Q: The folks who do this preparation, some do this on AnVIL and some run it locally - is this right?

A: Impression is most people are doing data preparation locally then uploading to the AnVIL bucket to run the data validation workflow presented. For PRIMED, Stephanie has done all the preparation in AnVIL. Folks have a process for the preparation locally, but come to AnVIL for the validation workflow.

Q: Do the uploaders usually have minor or major validation failures?

A: Usually see a number of validation failures in the history, get a few iterations before the workflow succeeds.

Q: Early on, showed the data table that was GREGoR specific, looked different than the data dictionary in AnVIL as the core AnVIL Data Model. Is your data table shown aligned with the AnVIL requirements for data upload? In AnVIL, the data model has required fields and optional fields. Do you treat the AnVIL Data Model optional fields as required, and how do you decide which you submit or don’t submit?

A: We developed the GREGoR Data Model using the AnVIL Data Model as an inspiration. What is optional/required for the GREGoR Data Model is based on what an analyst will need to use the data effectively vs what would be nice to know if you had the information. We do encourage all data submitters to fill in optional data as much as possible, but if it’s optional it’ll pass validation even if it’s not included.

In recent data release cycles, we analyze the completeness of the data model by center. Can recommend going back to fill in missing data if it’s missing for most datasets.

For AnVIL Data Model vs GREGoR Data Model, the requirements for AnVIL data submission have changed. At the time, had to have a certain amount of information, didn’t need to be named a certain way. The requirements may have changed now.

A: Another thing we emphasized in GREGoR Data Model was to link the data tables. This isn’t necessarily present in base AnVIL Data Model.

Q: Are there any data tables that are not included?

A: There are many more datatypes in GREGoR that weren’t part of the presentation. Each datatype has its own table because each datatype has different metadata for the experiment. No data types beyond the called variants tables.

Q: Are there any lessons learned from bringing people onto AnVIL?

A: Helped to have tutorial sessions on getting started with anVIL. Relied on documentation that exist already - have links to AnVIL resources and documentation. Coordinating Center was new to AnVIL as well. As they did this, collected the resources used to share with the other uploaders. At various consortia and WG meetings, went through different tutorials with the groups.

Q: Your AnVIL Data Model’s R Package sounds great and useful. I’m curious, you said this can be adapted to any data model, could envision this being useful for other AnVIL users to streamline data organization and upload. Sounds like this was a game changer in your consortium. Is this something that if presumably AnVIL offered, you wouldn’t need to develop? or is your use case pretty specialized where you have these periodic upload cycles that span many months and years?

A: I think it could be useful for anyone uploading data. Specifically useful when you have multiple different groups uploading data and ensuring they’re aligned to a specific format. It’s a challenge any time you’ve got multiple people who are tasked with data collection and uploading. It could be useful for someone who is doing a single upload, to ensure you’re providing data that matches the set model for your data.

Q: Have you been working with other consortia to show how you use this data model that are adapting your approach?

A: We haven’t yet, but if others are interested, we’d be happy to meet and help them out.

Q: I saw you have the workflow on github and in Dockstore, is the example workflow public? It may be helpful for people to see.

A: The workspace I showed is not public, but there is another workspace that has 1000 Genomes data in PRIMED data model that is public. This uses the PRIMED Data Model Terra .