Clinvar / GTR Basic Analysis

Summary

[ Click to view: Source code | Data ]

Data analysis consisted of joining targeted Genetic Testing Registry and ClinVar tables making use of the ‘Symbol’ column in both tables as the shared key.

Frequency distributions for each gene in ClinVar and GTR were calculated, showing a fairly abnormal distribution for gene coverage in ClinVar in relationship to the tests available in GTR for each gene.

As a result of this analysis, some basic questions could then be posed and answered — namely, which genes are well-covered in ClinVar but not represented at all in GTR, and which genes reported as tests in GTR are the least well supported by evidence in ClinVar? The following 2 tables represent the top 10 genes in each category.

Top 10 ClinVar genes by Submissions having no tests available in GTR
Top 10 ClinVar genes by Submissions having no tests available in GTR
Top 10 GTR gene tests least well supported by ClinVar Submission evidence
Top 10 GTR gene tests least well supported by ClinVar Submission evidence.

Data Prep Discussion

Three variables were chosen for this analysis:

  • Symbol (i.e. standardized gene name)
  • Submissions (ClinVar submission count, i.e. number of times gene referenced in ClinVar)
  • unique_tests (GTR unique test count, i.e. number of separate tests represented for this gene in the Genetic Testing Registry)

The “Submissions” column comes from the clinvar.gene_specific_summary table generated by the NCBI. Since this table comes precomputed, no missing data can be detected or is suspected.

The “unique_tests” column was generated in the attached Python code by subselecting rows withing the GTR.test_condition_gene table to restrict to the following conditions:

  • concept_type=”gene”
  • Symbol contains a gene name (not NULL or ‘-‘)

Subselecting the GTR data in this way eliminated all rows that did not specify a Symbol (gene name) for their test.

Results from this initial data analysis can be found below. The code that produced this readout can be found here (bitbucket).

Data used in this experiment can be downloaded here (bitbucket).

Program Output

=======
CLINVAR: loaded clinvar.gene_specific_summary.csv

> Total genes in ClinVar: 26375

> ClinVar top genes by Submission count

      Symbol  Submissions
1745   BRCA2         7584
1744   BRCA1         5562
24557    TTN         3047
24405   TSC2         1987
883      APC         1800

=======
GTR: loaded GTR.test_condition_gene.csv

> Total genes tested according to GTR: 3863

> GTR top genes by number of tests

Symbol
LMNA     272
PTEN     185
FGFR3    177
BRCA2    174
CFTR     170
Name: unique_tests, dtype: int64

=======
CLINVAR/GTR Basic Analysis

> Number of genes known in ClinVar not found in any Genetic Testing Registry test

	Expected: 26375 - 3863 = 22512

	Actual:  22495


----------------------------
Questions we can now answer!
----------------------------
Which genes well-covered in Clinvar have no tests in GTR?
   (showing top 10 results)

               Symbol  Submissions  unique_tests
24558         TTN-AS1         1562           NaN
8871            IRAK1         1029           NaN
17285         PCDH11Y          469           NaN
11891    LOC102723833          438           NaN
16045       NIPBL-AS1          320           NaN
1895         C11orf65          262           NaN
19787  RPL36A-HNRNPH2          244           NaN
4704            DGCR9          177           NaN
4697           DGCR10          176           NaN
24481           TSSK2          173           NaN

------------------------
Which genes tested in GTR are the least well supported in ClinVar?
   (showing top 10 results)

         Symbol  Submissions  unique_tests
1392     ATPAF1            1             1
23507    TOMM40            1             4
24931    UQCRC1            1             1
24934     UQCRH            1             1
18298      PPT2            1             4
1336      ATP5B            1             1
3184   CEACAM16            1            13
15382     MT-TD            1            10
2660       CBLC            1             1
15391     MT-TM            1            10

Done!
=======
Advertisements
Clinvar / GTR Basic Analysis

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s