Collapsing and aggregating subgene regions into major gene Symbol groupings.
The basic analysis completed in the previous post was good for a start, but several issues came to light as I examined the gene/submission lists:
- Many gene Symbols appear in the form [gene]-[suffix]
- These suffixed genes do not appear in the Genetic Testing Registry.
- Many genes have significant Submissions quantities (over 10) attached to these subgene Symbols.
For example, the gene region known as “SNAR” contains many named subregions; here’s a listing of all of the subregions of SNAR known to the National Library of Medicine.
Since the Submission counts for various genes in ClinVar appear to be spread out across these subgene Symbols, the final analysis of whether any particular gene had a test in GTR could be significantly impacted by whether its subgene(s) were considered along with it.
I asked a top expert in the genetic testing field (a former coworker) whether it would be valid to “aggregate” the ClinVar Submission counts for each of these Symbols. Her gene specialty is TTN (aka “Titan”), for which a highly submitted ClinVar region is “TTN-AS1”. Her expert opinion was that combining the TTN-AS1 results with the TTN results for the purposes of cross-reference with the Genetic Testing Registry made good sense. (Emphasis on this aggregation being for these purposes only, not necessarily for any other type of analysis.)
The focus of my Data Management task thus became transforming the ClinVar gene-to-submission table by “collapsing” all genes appearing to be subgene regions.
A manual inspection of the highest submission count gene regions showed that we can “collapse” the Submission counts in this way with high enough confidence that we are not unduly amplifying signal, even if some false positives are included in the aggregate.
Below is are two generated tables containing gene Symbol, number of ClinVar Submissions for this gene region, and number of Unique Tests in GTR for this gene region. Since the length of this table is in the thousands, results in this write-up have been limited to the top 15 in each dimension, sorted in Figure 1 by Submissions (ClinVar) and in Figure 2 by unique_tests (GTR).