ClinVar / GTR Conclusion


The analysis of ClinVar and the Genetic Testing Registry in terms of gene Symbols, number of ClinVar Submissions, and number of unique tests in GTR demonstrates a positive linear relationship between research submitted to ClinVar and number of clinical and/or research tests in the Genetic Testing Registry.

While the vast majority of genes reported in ClinVar and tests in GTR follow this positive linear relationship, as the graph below shows, certain notable outliers emerge from the data:

  • The LMNA gene appears to have a high number of GTR tests (272) while showing a relatively low number of ClinVar submissions (515) in proportion to other genes.
  • BRCA2 ranks high in both number of GTR tests (174) and number of ClinVar submissions (7584). Its nearest neighbor in terms of submissions and unique_tests is BRCA1. Together these genes comprise the most tested and most well-researched cancer-causing genes.
  • The gene region known as “TTN” (aka “Titan”) sits well above most genes with a ClinVar submission count at 4609, while showing only 6 tests in GTR.

Positive Linear Relationship

[Click to see source code]

This linear regression scatterplot demonstrates a positive linear relationship between ClinVar Submissions (the x axis) and number of unique tests in GTR (the y axis).

Scatterplot for x=Submissions (ClinVar) per gene and y=unique_tests per gene
Scatterplot for x=Submissions (ClinVar) per gene and y=unique_tests per gene

Clinvar Submissions: univariate distribution

The following skewed-right distribution graph of Submissions per gene Symbol shows that most genes cluster for ClinVar submissions around 1 to 1000, while some heavily-researched genes like TTN, BRCA1, and BRCA2 have many thousands of ClinVar Submissions.

The number of distinct genes in ClinVar is roughly 26,000. Since most of these genes have relatively low Submission counts, the values in the distribution, for the purposes of a more readable graph, have been log10 normalized.

Log10-normalized distribution of ClinVar Submissions per gene.
Log10-normalized distribution of ClinVar Submissions per gene.

GTR unique_tests: univariate distribution

The following skewed-right distribution graph of unique_tests per gene Symbol shows that most genes have few tests (under 25), while some heavily-researched genes like BRCA1 and LMNA have far more registered genetic tests (over 200).

GTR: distribution of unique_tests per gene Symbol (skewed right)
GTR: distribution of unique_tests per gene Symbol (skewed right)

A log10 normalization across the same data produces this graph:

Number of unique_tests per gene Symbol, log10 normalized.
GTR: distribution of unique_tests per gene Symbol, log10 normalized.

We might be able to explain the outliers by looking at the pattern of assignment of ClinicalSignificance to the genes recorded in ClinVar Submissions. For example, we might expect to see a very low rate of “pathogenic” calls on variants within the TTN gene, or a very high rate of “pathogenic” calls on variants in LMNA. (An exercise for another day.)

The above graph does not contain “NA” values; that is, genes noted in ClinVar without tests in GTR cannot be shown on this graph.

ClinVar / GTR Conclusion

Clinvar / GTR Data Management

Collapsing and aggregating subgene regions into major gene Symbol groupings.

[Click to see source code]

The basic analysis completed in the previous post was good for a start, but several issues came to light as I examined the gene/submission lists:

  • Many gene Symbols appear in the form [gene]-[suffix]
  • These suffixed genes do not appear in the Genetic Testing Registry.
  • Many genes have significant Submissions quantities (over 10) attached to these subgene Symbols.

For example, the gene region known as “SNAR” contains many named subregions; here’s a listing of all of the subregions of SNAR known to the National Library of Medicine.

Since the Submission counts for various genes in ClinVar appear to be spread out across these subgene Symbols, the final analysis of whether any particular gene had a test in GTR could be significantly impacted by whether its subgene(s) were considered along with it.

I asked a top expert in the genetic testing field (a former coworker) whether it would be valid to “aggregate” the ClinVar Submission counts for each of these Symbols. Her gene specialty is TTN (aka “Titan”), for which a highly submitted ClinVar region is “TTN-AS1”. Her expert opinion was that combining the TTN-AS1 results with the TTN results for the purposes of cross-reference with the Genetic Testing Registry made good sense. (Emphasis on this aggregation being for these purposes only, not necessarily for any other type of analysis.)

The focus of my Data Management task thus became transforming the ClinVar gene-to-submission table by “collapsing” all genes appearing to be subgene regions.

A manual inspection of the highest submission count gene regions showed that we can “collapse” the Submission counts in this way with high enough confidence that we are not unduly amplifying signal, even if some false positives are included in the aggregate.

Below is are two generated tables containing gene Symbol, number of ClinVar Submissions for this gene region, and number of Unique Tests in GTR for this gene region. Since the length of this table is in the thousands, results in this write-up have been limited to the top 15 in each dimension, sorted in Figure 1 by Submissions (ClinVar) and in Figure 2 by unique_tests (GTR).

Continue reading “Clinvar / GTR Data Management”

Clinvar / GTR Data Management

Clinvar / GTR Basic Analysis


[ Click to view: Source code | Data ]

Data analysis consisted of joining targeted Genetic Testing Registry and ClinVar tables making use of the ‘Symbol’ column in both tables as the shared key.

Frequency distributions for each gene in ClinVar and GTR were calculated, showing a fairly abnormal distribution for gene coverage in ClinVar in relationship to the tests available in GTR for each gene.

As a result of this analysis, some basic questions could then be posed and answered — namely, which genes are well-covered in ClinVar but not represented at all in GTR, and which genes reported as tests in GTR are the least well supported by evidence in ClinVar? The following 2 tables represent the top 10 genes in each category.

Top 10 ClinVar genes by Submissions having no tests available in GTR
Top 10 ClinVar genes by Submissions having no tests available in GTR
Top 10 GTR gene tests least well supported by ClinVar Submission evidence
Top 10 GTR gene tests least well supported by ClinVar Submission evidence.

Data Prep Discussion

Three variables were chosen for this analysis:

  • Symbol (i.e. standardized gene name)
  • Submissions (ClinVar submission count, i.e. number of times gene referenced in ClinVar)
  • unique_tests (GTR unique test count, i.e. number of separate tests represented for this gene in the Genetic Testing Registry)

The “Submissions” column comes from the clinvar.gene_specific_summary table generated by the NCBI. Since this table comes precomputed, no missing data can be detected or is suspected.

The “unique_tests” column was generated in the attached Python code by subselecting rows withing the GTR.test_condition_gene table to restrict to the following conditions:

  • concept_type=”gene”
  • Symbol contains a gene name (not NULL or ‘-‘)

Subselecting the GTR data in this way eliminated all rows that did not specify a Symbol (gene name) for their test.

Results from this initial data analysis can be found below. The code that produced this readout can be found here (bitbucket).

Data used in this experiment can be downloaded here (bitbucket).

Program Output

Continue reading “Clinvar / GTR Basic Analysis”

Clinvar / GTR Basic Analysis

Clinvar / GTR Research Questions

Clinvar and GTR: discussion

The subject of this data analysis experiment is a mashup of two datasets, Clinvar and the Genetic Testing Registry. Please see the linked blog posts for a detailed introduction to these datasets, as well as their technical details and links to formal documentation.

Gene (Symbol) based analysis

Does the research represented in ClinVar, indicated by the HUGO gene name symbols assigned to individual variant accessions, demonstrate a relationship with the frequency of distribution of genetic tests for these genes in the Genetic Testing Registry?

Which gene tests in GTR are backed by the most ClinVar submissions?

Are there genes well-represented in terms of ClinVar submissions that are not well represented in the GTR database in terms of gene panel coverage? Or are these two distributions fairly well aligned?

Condition (concept) based analysis

Graph the frequency of conditions (represented by regularized MedGen concept codes aka CUIs) cited in ClinVar versus the frequency of conditions tested for in GTR.

What is the apparent coverage for condition-based testing (GTR) in terms of numbers of accessioned variants for those conditions (ClinVar)?

Further analysis

How does the data landscape change when GTR test_type is restricted to “Clinical”?

Is there a correlation between frequency of pubmed citations for a particular variant and number of GTR tests for the gene in which that variant is found?


  1. Genetic testing (as represented in GTR) follows a gene distribution pattern similar to the distribution of ClinVar submissions.
  2. The greater the number of pubmed citations for variants within particular genes, the greater the number of genetic tests for those genes.

Links to Relevant Research

The NIH genetic testing registry: a new, centralized database of genetic tests to enable access to comprehensive information and improve transparency

Database resources of the National Center for Biotechnology Information

Evaluating the NIH’s New Genetic Testing Registry

Free the Data: The End of Genetic Data as Trade Secrets

A general framework for estimating the relative pathogenicity of human genetic variants

ClinVitae: a unified database of clinically-observed genetic variants aggregated from public sources

ClinVar: public archive of relationships among sequence variation and human phenotype

In Tackling the VUS Challenge, Are Public Databases the Solution or a Liability for Labs?


Continue reading “Clinvar / GTR Research Questions”

Clinvar / GTR Research Questions

Genetic Testing Registry Dataset

The Genetic Testing Registry (GTR) is an NCBI dataset (and like Clinvar, available via FTP and eutils) that publishes information on specific genetic tests provided by various institutions. Some tests focus on a particular gene (e.g. BRCA1); some tests comprise disease or condition “panels”; and some entries in GTR bundle the most commonly problematic genes (especially for cancer) into a single test.

For example, a typical disease testing panel for HHT (Hereditary hemorrhagic telangiectasia) should encompass at least the ENG and ACVR1 genes, and potentially also SMAD4. These genes are bundled into “panel” groupings — for example, this HHT Panel by GeneDX — to enable a genetic test provider to respond comprehensively to a doctor’s diagnostic indications for a patient.

As with the Clinvar dataset, the GTR data will be imported for analysis into MySQL using the medgen-mysql toolkit.

Technical Details: Access and Manipulation

Continue reading “Genetic Testing Registry Dataset”

Genetic Testing Registry Dataset

ClinVar Dataset

From the ClinVar NCBI home page:

ClinVar aggregates information about genomic variation and its relationship to human health.

As of September 14, 2015, there are 158,991 accessioned submissions in ClinVar. These data points represent cases of observation of a gene variation “in nature” — meaning human DNA variations read from sequenced genomic samples, analyzed by variant interpretation scientists and compared in the clinical literature for information pertaining to its pathogenicity and relevance to particular disease conditions.

The genetic testing industry has come to rely heavily on ClinVar for reporting on the (probable) pathogenicity of any given variant. These variants are described in ClinVar and (to varying degrees) within the clinical literature by a short piece of text formatted according to the HGVS (Human Genome Variability Society) standard.

Most ClinVar submissions contain reference to a gene, though many (over 50,000 of them) do not.

The gene names in ClinVar make use of the HUGO gene naming convention, which is the convention that this blog and research will use as well.

In the variant_summary table, ClinVar also makes reference to a “GeneID” variable; this field refers to a gene entry in the NCBI Gene database, which we will access programmatically via metapub.

Conveniently, the GTR database also uses HUGO gene names — a convenience we will exploit in this data exploration.

Technical Details: Access and Manipulation

Continue reading “ClinVar Dataset”

ClinVar Dataset