Q: Interested in getting started on AnVIL, interested in uploading data and accessing data. Target is mainly genomic analysis, functional genomics. Has done a lot of wet lab sequencing expertise, analyzing short repeats. Interested in getting deeper experience in programming and genomic data science.
A: The 1000 Genomes project has 3202 genomes sequence data, which is very well characterized. Another project, MAGE collected RNA-Seq data from these samples and can be helpful as well. MAGE has zenodo and Dropbox links, is available on SRA as well. It is being ingested into AnVIL.
Q: Still at beginning of analysis, working on bringing more trainees into AnVIL. Comfortable with all the resources. Initially, the accounts were unable to link to cloud computing funding. It took some work with their IT department to fix issues with Google Artifact Registry.
Currently working with Kion to monitor their spending. Last meeting we discussed planning the spending and setting up budgets. They calculated the average over each month and set budget alerts. These assume a linear spending rate, but the project sometimes incurs a lot more or a lot less spending over time. This releases pressure on reporting to funding office. Kion has been a great resource to track spending and provide spending reports, so they can focus on the science. We work on AnVIL, Terra, and All of Us and Kion provides a dashboard where you can view spending in each platform and can track spending at the user level.
A: Another new resource for managing costs in AnVIL can be found here: Tools to Manage Terra Costs - Terra.
Q: Recently applied for an opportunity in the NHGRI Office of Data Science Initiative, looking for more trainees to join.
Q: Want to know operation sequence when using All of Us VariantDataset (VDS) files. It’s often much simpler with a workflow to use VCF files than VDS. Genome wide, vcf files have a lot of accompanied tools. In Hail forums, users post about filtering the intervals first. There are complications with dealing with new data formats where not a lot of information is available on how to use these. It would be helpful to have a demo on dealing with All of Us VDS files. There are challenges to using Hail, understanding how to localize files and mount the resources. Hail uses spark, so they are distributed already, so they don’t necessarily fit well as a WDL. Hail is seeming to move into new features but seems that the documentation is lagging. There is a tradeoff for waiting for Hail advances to mature and documentation to catch up vs moving forward with existing VCF pipelines and moving them to the cloud. There are also some issues encountered with accessing large numbers of files (5000+) and getting alerts that the cloud traffic is too high and cannot be done. The workaround is to copy sections to the environment. Want to use WDL, but hit a bottleneck.
A: Working in the cloud, things are scattered across location and technologies. It sounds like you’ve been able to connect with the right communities (All of Us, Hail) that are committing to the VDS file format and the underlying Hail technology.
We would love to hear your Feature Requests (including a Hail demo) Feature Requests - AnVIL. This can help us communicate with the developers on what users need.
It would be great to have Hail attend to share an AnVIL Demo. We’ve only seen Hail used in Jupyter, not WDLs. It would be great to hear where Hail is in 2025.