Exploring and building a knowledge base for the gene-disease association in mitochondrial diseases

Building the knowledge base on mitochondrial disease genes

MitDisease knowledge base used MalaCards (http://www.malacards.org/)16, GeneCards (https://www.genecards.org/)17, and MITOMAP (https://mitomap.org/)8 databases to expand mitochondrial disease names and extract disease-related genes. In addition, he referred to the phenolyzer rules19 tool for scoring and ranking genes and diseases from different databases. The detailed construction of the mitochondrial disease gene knowledge base has been described in the following three aspects (see Fig. 6).

Figure 6

Process of building the knowledge base of mitochondrial disease genes.

Acquiring the names of diseases related to mitochondria

First, in the MalaCards16 In the Human Diseases Database, we used the keywords such as mitochondr* and MtDNA to search for the names of diseases related to mitochondria. In GeneCards17 database, we used 37 genes (13 polypeptide-encoding genes, 22 tRNAs and 2 rRNAs) encoded by mitochondria as keywords to derive gene-related disease names. In addition, the keyword mitochondr* was used in the GeneCards database to obtain the TOP2000 ranked genes, and then the gene-related disease names were batch-extracted to retrieve the mitochondria-related disease names. In the MITOMAP8, a comprehensive database of human mitochondrial DNA, we collected the information on mitochondria-related diseases provided by MITOMAP8 website. The disease names obtained from the different databases were integrated as preliminary candidate mitochondrial diseases. Second, exploration programs were used to capture candidate disease alias information in batches, and diseases or illnesses with mitochondria-related words (mitochondria, mitochondrial, Mtdna) in the alias were retained. Finally, we manually checked whether candidate diseases were mitochondrial diseases by referring to Mesh, Malacrads, OMIM and other disease databases such as HmtDB.9, MSeqDRten, Human Mitochondrial Genome Polymorphism Database11, combining this with literature reports of related diseases from the NCBI and criteria for mitochondrial diseases20,21,22, thus obtaining the final names of mitochondrial diseases (see Supplementary Material 1, mitoDiseaseAlias.txt). In order to ensure the reliability and quality of the data obtained on mitochondria-related diseases, a total of 5 curators worked to verify the results, at least 4 out of 5 curators gave the same results before reaching a conclusion.

Collecting information on mitochondrial diseases

Using the final mitochondrial disease names as keywords, we extracted the genetic information of mitochondrial diseases from the MalaCards and GeneCards databases in batches, standardized all genes for Enter ID, and then filtered out genes that could not match. Enter ID by symbol or alias (see Supplemental Material 2, Homo_sapiens.gene_split). In the GeneCards database, disease-related genes, gene description information, disease gene relevance scores can be obtained. Additionally, we extracted information about “Aliases & Classifications and Variations” in batches from the MalaCards database (see Supplementary Material 3, Variations_GENE_DISEASE).

Notation and classification of mitochondrial genes and diseases

Disease-related genes obtained from the Malacards and Genecards databases were accompanied by gene classification and scores, and the scores indicated the reliability of the gene-disease association. First, in order to merge related genes from different databases for the same disease, scores from different databases were normalized from 0 to 1 (see Supplementary Material 4, DB_GENECARDS_GENE_DISEASE_SCORE; see Supplementary Material 5, DB_MALACARDS_GENE_DISEASE_SCORE). Based on the evidence of the gene-disease relationship and the extent of confirmation of this relationship, we set the score of 100 as the threshold for each gene-disease pair in the gene-disease databases (GeneCards and MalaCards). If the gene-disease score is greater than 100, it is normalized to 1; if the score is less than 100, the score is divided by 100 as a normalized value. Second, according to the rules of the Phenotype Based Gene Analyzer (phenolyzer)19, a tool focused on gene discovery based on user-specific disease/phenotype terms, the gene-disease scores from different databases were again normalized and ranked, the specific algorithms used were as follows:

Gene-disease association score Eq. (1):

$$ Sleft( {{text{Gene}},;{text{Disease}}} right) = sum {{text{Score}}left( {{text{Gene}} ,;{text{Disease}}} right) times {text{Reliability}}} $$

(1)

Score (gene, disease) in the equation. (1) represented the normalized gene-disease association score retrieved from a data source calculated in the first step. Reliability in eq. (1) represented the weighting value of the data source rating, which was determined based on the reliability of the data source. The reliability of GeneCards databases and MalaCards databases was 0.25 and 1, respectively. S(Gene, Disease) represented the sum of gene-disease association scores retrieved from all data sources.

Normalization after Eq notation. (2):

$$ tilde{S}left( {{text{Gene}},;{text{Disease}}} right) = frac{{Sleft( {{text{Gene}}, ;{text{Disease}}} right)}}{{max left{ {Sleft( {{text{Gene}},;{text{Disease}}} right) } right}}} $$

(2)

The eq. (2) normalized the score by dividing the actual score by the maximum score, and the value of the normalized gene-disease association score is between 0 and 1.

Annotation of gene function

We downloaded the pathway and gene ontology annotation information from the Reactome (https://reactome.org/, version 1)23, KEGG PATHWAY (https://www.kegg.jp/kegg/pathway.html, KEGG API as of September 2020)24.25, and GO (http://geneontology.org/, version 1.4)26 databases (see Supplemental Material 6, Reactome_enrichment; see Supplemental Material 7, KEGG_enrichment; see Supplemental Material 8, GOBP_enrichment/GOCC_enrichment/GOMF_enrichment), then extracts the database ID and corresponding genetic information to complete the classification of the gene function annotation library (see Figure 7). Based on gene function annotation library, we used scipy package (scipy.stats.hypergeom for Fisher Test) in Python to enrich and analyze disease-related genes and calculate p-value by multiple tests -FDR, using the python package (statsmodels.stats.multitest.fdrcorrection).

Picture 7
number 7

Annotation of gene function.

Website realization

The MitDisease website provided a user-friendly web interface. HTML 5.5.31 and JavaScript were used for front-end development of the MitDisease knowledge base, and the non-relational MongoDB database was used for data storage and management. In this website, the crawling of web pages, the calling of calculation programs, and the API interface have all been completed by Python and its dependent packages.

Donald E. Patel