GAG relies on a mySQL database with 8 tables, gathering annotation data from several sources and computed cross-references.
GenBank and Ensembl data are split into 6 tables (3 for each institution, labeled ncbi_* and embl_*), respectively storing data for RNA transcripts (embl_seq_gene and ncbi_seq_gene), link between RNA transcripts and gene IDs (embl_map_gene and ncbi_map_gene), and gene annotation (embl_annot_gene and ncbi_annot_gene ; including outgoing references to Uniprot/Swissprot).
HGNC data are stored in a separate table (hgnc_data).
Finally, the main table storing computed cross-references is xref_gene, which includes:
Taxa_ID | The numerical taxa ID for considered gene |
NCBI_Gene_ID | GenBank Gene ID |
Embl_Gene_ID | Ensembl Gene ID |
Status | A descriptive field indicating whether the cross-reference proposed is currently admitted by Genbank/Ensembl (Known/Validated and Known/Corrected), or predicted by the GAG process (Predicted). |
Annotation_Score | Number of common words between Genbank and Ensembl descriptive annotation. |
Common_Symbol | Do both Ensembl and Genbank gene share a same symbol |
Common_Human_Homologs | Do both Ensembl and Genbank gene share at least one human homolog gene |
Common_Mouse_Homologs | Do both Ensembl and Genbank gene share at least one mouse homolog gene |
Common_Chicken_Homologs | Do both Ensembl and Genbank gene share at least one chicken homolog gene |
Common_UniProt_ID | Do both Ensembl and Genbank gene share the same UniProt ID |