Demo Workspace: Terra
Q: Will VCF to VRS variant conversion sometimes result in a difference in chromosome coordinates from the original VCF?
A: Yes. First, the coordinates used by VRS are inter-residue, so the VRS location start value is typically 1 lower than the VCF position value. Second, due to full-length normalization, there will be places where the interval is larger than it is in VCF.
Q: After converting a VCF to VRS file, is there a summary of changes or differences?
A: There are no changes made to the VCF positions directly. The annotated VCF includes two new fields, one that describes the VRS Allele IDs, another that specifies the VRS coordinates.
Q: I see, so the output VCF file is the original VCF + new fields?
A: Yes; specifically, new key-value pairs added to the VCF INFO field with this information (VRS_ID, VRS_Start, VRS_End, VRS_State keys)
Q: Can you share more on possible annotation sources that are available?
A: By default, VRS makes use of SeqRepo, which is a software package that allows high-throughput retrieval of sequence content from different sequence assemblies, including GrCh37 and 38, also transcript and protein sequence collections. SeqRepo is more like an underlying library to do full length normalization and computed digests. For justification representation of variants, it requires knowledge of the context of the sequence.
Major community resources that have adopted it are gnomAD and MAVE-DB. VRS was particularly helpful to define the experimental sequences in a consistent basis. There are many private implementations of VRS, too. VRS computed identifiers are great for aggregating and indexing.
Q: Are many annotations looking for where the variant has been observed? Or looking for additional annotations of the variant?
A: Computed identifiers help consistently look across resources to find evidence or observations of these variants before. VRS as a data model is helpful for attaching lots of different evidence and knowledge statements to variant subjects. ClinVar has a mix of high level VCV classifications as well as underlying SCV assertions. You can go farther and extract the evidence lines used. These can point to the same variant or related variants. VRS is used to create documents to describe them.
MetaKB is on somatic cancer knowledge side to allow representations that combine evidence in the different contexts that they’re observed.
Q: Are there still challenges with long reads and structural variants that VRS is tackling?
A: Some limitations that have been addressed - one barrier to using full-length justification is that when you have very large repetitive regions, you have records with very large alternate alleles (when the diff could be 3 nucleotides). This is a challenge when conventions of VCF and HGVS are shorter and better. There was a recent standard version update, VRS 2, with a reference length encoding pointing to the region, reference subsequence that is repeating, and change to reference length as a result of the variant using integer representation.
VRS 2 also captures adjacencies under the same vocabulary. SVs where you have movements of entire regions with rotations and various orientations, VRS 2 is handling this as well. Cool to see union of SV and small variant world using shared vocabularity and structures.
Big thing on the horizon is the pangenome, or graph and kmer representation of pangenomes and variants that exist on these graphs. The VRS group is working with the HPRC community to build compatibility to use the same data objects to represent these variants in a pangenome. They will discuss this at GA4GH Connect in April 2025.
Q: What about if a researcher uses VRS Annotator, identifies a cool variant underlying a cool phenotype. If they publish, do they use a VRS ID? How do they communicate it to the wider community?
A: There are many communities that use their own conventions to describe a variant. VRS is not being perscriptive of how to call it, just how to use the data of a variant. In conjuction with Categorical Variant Representation Specification (CatVRS), have ability to create JSON structures that represent variants in any context. Covers EGFR loss of function to BRAF r600e, can all be described as a JSON object that has an ID that can be VRS computed ID or resource specific ID.
You can add this JSON document as a supplementary file as submission to a journal, and in reference text, you just use your ID that ties it to the representational specification, and can still follow the norm for the community.