Q: Can you give a feel for the typical size of a dataset that a center would submit? How many participants and samples? Could you also speak about the people on the center side who are doing the preparation, training folks that use this workflow?
A: GREGoR is up to 12,000 participants now, using quarterly upload cycles (100s-1000s of participants added in each set). As folks upload the data, the center will validate. There’s a cutoff date when this is run and folks will upload in the next cycle.
We give similar trainings to the uploaders who are part of the GREGoR consortium. Some centers have 1 person who is responsible, they usually get information from multiple folks in their centers. Some centers have a team where folks are responsible for different data types.
Q: The folks who do this preparation, some do this on AnVIL and some run it locally - is this right?
A: Impression is most people are doing data preparation locally then uploading to the AnVIL bucket to run the data validation workflow presented. For PRIMED, Stephanie has done all the preparation in AnVIL. Folks have a process for the preparation locally, but come to AnVIL for the validation workflow.
Q: Do the uploaders usually have minor or major validation failures?
A: Usually see a number of validation failures in the history, get a few iterations before the workflow succeeds.
Q: Early on, showed the data table that was GREGoR specific, looked different than the data dictionary in AnVIL as the core AnVIL Data Model. Is your data table shown aligned with the AnVIL requirements for data upload? In AnVIL, the data model has required fields and optional fields. Do you treat the AnVIL Data Model optional fields as required, and how do you decide which you submit or don’t submit?
A: We developed the GREGoR Data Model using the AnVIL Data Model as an inspiration. What is optional/required for the GREGoR Data Model is based on what an analyst will need to use the data effectively vs what would be nice to know if you had the information. We do encourage all data submitters to fill in optional data as much as possible, but if it’s optional it’ll pass validation even if it’s not included.
In recent data release cycles, we analyze the completeness of the data model by center. Can recommend going back to fill in missing data if it’s missing for most datasets.
For AnVIL Data Model vs GREGoR Data Model, the requirements for AnVIL data submission have changed. At the time, had to have a certain amount of information, didn’t need to be named a certain way. The requirements may have changed now.
A: Another thing we emphasized in GREGoR Data Model was to link the data tables. This isn’t necessarily present in base AnVIL Data Model.
Q: Are there any data tables that are not included?
A: There are many more datatypes in GREGoR that weren’t part of the presentation. Each datatype has its own table because each datatype has different metadata for the experiment. No data types beyond the called variants tables.
Q: Are there any lessons learned from bringing people onto AnVIL?
A: Helped to have tutorial sessions on getting started with anVIL. Relied on documentation that exist already - have links to AnVIL resources and documentation. Coordinating Center was new to AnVIL as well. As they did this, collected the resources used to share with the other uploaders. At various consortia and WG meetings, went through different tutorials with the groups.
Q: Your AnVIL Data Model’s R Package sounds great and useful. I’m curious, you said this can be adapted to any data model, could envision this being useful for other AnVIL users to streamline data organization and upload. Sounds like this was a game changer in your consortium. Is this something that if presumably AnVIL offered, you wouldn’t need to develop? or is your use case pretty specialized where you have these periodic upload cycles that span many months and years?
A: I think it could be useful for anyone uploading data. Specifically useful when you have multiple different groups uploading data and ensuring they’re aligned to a specific format. It’s a challenge any time you’ve got multiple people who are tasked with data collection and uploading. It could be useful for someone who is doing a single upload, to ensure you’re providing data that matches the set model for your data.
Q: Have you been working with other consortia to show how you use this data model that are adapting your approach?
A: We haven’t yet, but if others are interested, we’d be happy to meet and help them out.
Q: I saw you have the workflow on github and in Dockstore, is the example workflow public? It may be helpful for people to see.
A: The workspace I showed is not public, but there is another workspace that has 1000 Genomes data in PRIMED data model that is public. This uses the PRIMED Data Model Terra .