ClinVar Dataset

From the ClinVar NCBI home page:

ClinVar aggregates information about genomic variation and its relationship to human health.

As of September 14, 2015, there are 158,991 accessioned submissions in ClinVar. These data points represent cases of observation of a gene variation “in nature” — meaning human DNA variations read from sequenced genomic samples, analyzed by variant interpretation scientists and compared in the clinical literature for information pertaining to its pathogenicity and relevance to particular disease conditions.

The genetic testing industry has come to rely heavily on ClinVar for reporting on the (probable) pathogenicity of any given variant. These variants are described in ClinVar and (to varying degrees) within the clinical literature by a short piece of text formatted according to the HGVS (Human Genome Variability Society) standard.

Most ClinVar submissions contain reference to a gene, though many (over 50,000 of them) do not.

The gene names in ClinVar make use of the HUGO gene naming convention, which is the convention that this blog and research will use as well.

In the variant_summary table, ClinVar also makes reference to a “GeneID” variable; this field refers to a gene entry in the NCBI Gene database, which we will access programmatically via metapub.

Conveniently, the GTR database also uses HUGO gene names — a convenience we will exploit in this data exploration.

Technical Details: Access and Manipulation

The ClinVar dataset is made publicly downloadable by NCBI via FTP.

The data in the tab_delimited subdirectory contains the specific regions of data of interest to this project.

The medgen-mysql toolkit will be used to load ClinVar tables into a MySQL database for analysis.

Additionally, the metapub toolkit will be used to explore this NCBI database via Python. Details about the pathways metapub uses to query and crosslink information between Clinvar, Medgen, and GTR can be found on the NCBI’s Clinvar Maintenance and Use documentation.

Tabular Data of Interest

We will be mainly concerned with the information in the variant_summary table. The var_citations table may come into play for additional research questions regarding frequency of pubmed articles for variants.

+----------------------------+
| Tables_in_clinvar          |
+----------------------------+
| README                     |
| clingen_gene_curation_list |
| clinvar_hgvs               |
| cross_references           |
| disease_names              |
| gene_condition_source_id   |
| gene_specific_summary      |
| log                        |
| molecular_consequences     |
| var_citations              |
| variant_summary            |
+----------------------------+

mysql> describe variant_summary;
+----------------------+--------------+------+-----+---------+-------+
| Field                | Type         | Null | Key | Default | Extra |
+----------------------+--------------+------+-----+---------+-------+
| AlleleID             | int(11)      | NO   | MUL | NULL    |       |
| variant_type         | varchar(50)  | NO   |     | NULL    |       |
| variant_name         | varchar(255) | YES  | MUL | NULL    |       |
| GeneID               | int(11)      | NO   |     | NULL    |       |
| Symbol               | varchar(20)  | NO   |     | NULL    |       |
| ClinicalSignificance | varchar(200) | YES  | MUL | NULL    |       |
| rs                   | int(11)      | YES  | MUL | NULL    |       |
| dbvar_nsv            | text         | YES  |     | NULL    |       |
| RCVaccession         | text         | YES  |     | NULL    |       |
| TestedInGTR          | char(1)      | YES  | MUL | NULL    |       |
| PhenotypeIDs         | varchar(500) | YES  |     | NULL    |       |
| Origin               | text         | YES  |     | NULL    |       |
| Assembly             | text         | YES  |     | NULL    |       |
| Chromosome           | varchar(20)  | YES  |     | NULL    |       |
| Start                | int(11)      | YES  | MUL | NULL    |       |
| Stop                 | int(11)      | YES  | MUL | NULL    |       |
| Cytogenetic          | text         | YES  |     | NULL    |       |
| ReviewStatus         | text         | YES  |     | NULL    |       |
| HGVS_c               | varchar(200) | YES  | MUL | NULL    |       |
| HGVS_p               | varchar(200) | YES  | MUL | NULL    |       |
| NumberSubmitters     | int(11)      | YES  |     | NULL    |       |
| LastEvaluated        | text         | YES  |     | NULL    |       |
| Guidelines           | text         | YES  |     | NULL    |       |
| OtherIDs             | varchar(500) | YES  |     | NULL    |       |
| VariationID          | int(11)      | YES  | MUL | NULL    |       |
+----------------------+--------------+------+-----+---------+-------+

mysql> describe var_citations;
+-----------------+--------------+------+-----+---------+-------+
| Field           | Type         | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------+-------+
| AlleleID        | int(11)      | NO   | MUL | NULL    |       |
| VariationID     | int(11)      | NO   | MUL | NULL    |       |
| rs              | int(11)      | YES  | MUL | NULL    |       |
| nsv             | int(11)      | YES  | MUL | NULL    |       |
| citation_source | varchar(100) | YES  | MUL | NULL    |       |
| citation_id     | int(11)      | YES  | MUL | NULL    |       |
+-----------------+--------------+------+-----+---------+-------+

Experiment Codebook

variant_summary.RCVaccession — character, list of RCV accessions that report this variant
variant_summary.VariationID — unique Variation ID assigned to each variant (? confirmation needed)
variant_summary.GeneID — integer, GeneID in NCBI’s Gene database
variant_summary.Symbol — character, comma-separated list of GeneIDs overlapping the variation (NULL or ‘-‘ if not named)
variant_summary.HGVS_c — character, RefSeq cDNA-based HGVS expression
variant_summary.NumberSubmitters — integer, number of submissions with this variant.
variant_summary.ClinicalSignificance — character, comma-separated list of values of clinical significance reported for this variation
variant_summary.AlleleID — integer value as stored in the AlleleID field in ClinVar
var_citations.AlleleID — integer, corresponds to AlleleID in variant_summary
var_citations.VariationID — integer, corresponds to VariationID in variant_summary
var_citations.citation_source — character, name of citation index to which citation_id belongs
var_citations.citation_id — integer, unique ID within citation_source index for this article (citation)

Advertisements
ClinVar Dataset

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s