Ontologies are very powerful tools for standardizing genomic information—they allow us to make it more searchable, add value to its interpretation, and make it easier to manage.
Today I was asked by a bioinformatics developer how my work on ontology has informed my work in genomics. (For over a decade, I have worked on both biological ontology development and genome annotation & analysis.) I gave a pretty rambling answer, so I wanted to take a moment to clarify my thoughts and then write about it, hence this blog post. Consider this a mini-primer on what ontology can do for genomics, as well as how ontologies, biocuration, and genome annotation might relate to one another. Thanks for reading!
One major lesson I have learned about ontologies as they apply to genomics is that ontologies are very powerful tools for standardizing genomic information—they allow us to make it more searchable, add value to its interpretation, and make it easier to manage. I also learned early on to not trust database annotations on their face without scrutinizing how they were generated or understanding by what evidence they are supported. I’ll address that later – but first some background on ontologies and then genome annotation.
What Is an Ontology?
An ontology is a type of controlled vocabulary where each term is clearly defined and so has a precise meaning, much like words in a dictionary or products in a catalog. But ontologies go further by connecting each term to other(s) with defined relationships, such as ‘is a’ or ‘part of’, for example a nucleus is ‘part of’ a cell; DNA ‘is a’ nucleic acid. There are a number of other relations, as well.
Ontologies are generally structured in a tree-like fashion with fewer more general terms at the root and more numerous specific terms furthest from the root (at the “leaf nodes”).
The Gene Ontology (GO) is perhaps the best known biological ontology, and one I have both learned from and contributed to, so I will talk about GO in this mini-primer.
The GO describes biological processes and molecular functions of gene products (including proteins), as well as their locations in the cell. GO terms describe attributes of gene products. Gene products can have multiple attributes, for example be known to participate in several biological processes.
To “annotate” something is to make a note about it. When you annotate a protein with a given GO term, you are describing an attribute of that protein, in essence saying that whatever is described by the GO term’s definition applies to that protein.
Each GO term has a unique numeric ID, a short descriptive name (called a label by ontologists), and a definition that spells out the term’s complete meaning.
Terms have other parts, too, such as synonyms, which allow us to use different phrases to search for the same GO term – and importantly arrive at the same name and definition. Synonyms are very useful to humans, who use inconsistent language.
Each GO term is related to other GO terms with defined relationships like ‘part of’ or ‘is a’. This structures the ontology and makes it easy to traverse by computers. Because ontologies have class-subclass (“parent-child”) relationships, it is possible to find all the subclasses under a common higher level class.
We’ll return to GO in a bit.
I have annotated numerous genomes including bacteria, Apicomplexans, plants, animals, oomycetes, and fungi. The goal of genome annotation is to determine structures and assign function where possible for the entire genome.
Structural annotations can be assigned using various approaches including detecting intrinsic patterns, for example locating simple repeats, start codons, and intron splice sites. Similarity to a reference sequence can also be used, such as local alignment to characterized proteins from other organisms. Alignment of expression data (e.g. RNA-sequencing) originating from the subject being annotated itself can also be used.
Functional annotation uses similarity approaches such as matches to hidden Markov model libraries or similarity to extrinsic characterized proteins. Signals such as signal peptide cleavage sites can also be considered. Others ways to determine function include a huge number of laboratory based experimental methods.
Transitive Annotation at Biological Databases
I often rely on biological databases for functional annotation. Yet, databases that contain “known” proteins/genes are often full of transitive annotation errors and are not often reliable for annotation.
For example, if you performed a BLAST search and saw that “protein A” is 80% identical to “protein B” you might say that’s sufficient to make an annotation that states that protein A has whatever function that protein B is showing in the database. But what if protein B got it annotation by similarity to protein C, and protein C similarly derived its annotation from protein D?
What do we really know about protein D? How was its function determined? What’s more, proteins A and D might share very low similarity, depending on each protein’s similarity along the transitive chain. There may be no apparent tracking of any of the above information in the database. This is one reason why a BLAST query against NCBI non-redundant database is not necessarily highly informative, per se. You should be skeptical about such results.
Contrast the above situation with manual GO curation, described below – and keep in mind that one area where GO excels is in its use of evidence to bolster its annotations.
Model Organism Database Annotation
GO annotations are generated by both manual & human and automatic & computational methods. Model organism databases (MODs) such as Saccharomyces, mouse, zebrafish, and Arabidopsis (to name a few) have curators on staff who make annotations based on the scientific literature, but they use computational tools as well, such as phylogenetic inferencing.
Many high quality annotations are made by reading the literature. A paper might report a particular gene function as determined by a laboratory experiment, for example mutagenesis that resulted in loss of function. Some other gene might be said to have a similar function, but this time based on a computational approach such as orthology determination. But please do not assume that all experimental-based annotations are “better” than sequence based ones, as this is not always the case.
At a MOD the determined function for each gene, regardless of how it was derived, would be noted by a curator who adds this information to an annotation file.
Each gene annotation gets one row in the file, and each column represents a different attribute, such as gene ID, date, GO term ID, the source paper (typically a PubMed identifier), and so on. The curator reads the paper and assigns the annotation after reading methods and results, in particular, as well as considering what they just read, drawing some conclusinons of their own, and searching for a particular ontology term in the GO that best describes the attribute of the gene product.
To add a particular GO term to the row for a particular gene product annotation is to say that the gene product carries out a function, exists in a location, or participates in a process, as described in that GO term’s definition. We call that associating a GO term with a gene product.
Annotation files may contain many types of data or metadata. Sometimes a given MOD will record some information that is kept “in house” and not shared widely, while a subset is shared with others. MODs that participate in GO annotations share a set of annotations using a common standard to facilitate interoperability.
“This gene product has this function, as evidenced by some particular methodology…”
One important field in an annotation file is for evidence, i.e. what methods the authors of the paper used to support their conclusions that the gene has a given function, process, or location.
Most MODs represent evidence with the Evidence and Conclusion Ontology (ECO), an ontology I worked on for about eight years, serving for three as Principal Investigator (PI). By including evidence in the annotation row, it becomes possible to now say, in essence, “this gene product has this function, as evidenced by some particular methodology…” Sounds much more reliable, doesn’t it?
By incorporating evidence into the annotation file, once that annotation gets stored at a giant genomic/proteomic database, any user who downloads it now has the ability to know why we believe what we believe about that protein.
Manual curation isn’t the only way proteins & genes are annotated. It’s very expensive to pay people to generate such high quality annotations.
Much annotation is automated. But to go beyond simple sequence similarity searches, e.g. BLAST, manual effort has been spent to create mappings between resources. For example, mappings to GO exist in the form of Interpro2GO and EC2GO.
Due to these efforts, now if your protein has an Interpro domain or an Enzyme Commission number, it can be assigned a GO term automatically to add additional information. Other such efforts exist.
WHAT Does ONTOLOGY DO FOR GENOMICS DATA?
At the outset of my career in genomics, before working on ontologies or really understanding how databases receive or manage their data, I took so much for granted. Like many of the students I have taught, initially I didn’t ask enough about the methods or the evidence.
But knowing how ontologies are developed, how biocuration works, how annotations are generated, and how databases receive this information has changed how I look at the data I consume.
Following are a few examples of how ontologies can help you with genomics data.
Study Diverse Taxa
What do you do when two distinct but related projects exist, having arisen independently and using different protocols and terminologies? Can you make useful comparisions?
Ontologies are great for helping to compare disparate data.
Because all MODs contributing such information to GO use standard protocols, the information is similar within a given annotation file, across databases, and across taxa.
The GO annotation file is a standard format. Anyone using it knows what to expect. Further, annotators at different databases working on different organisms but using the GO are using a common controlled vocabulary (an ontology is a type of CV) to describe biology in their respective organisms.
This standardized approach gives us the ability to meaningfully compare annotations across databases and across taxa. This is the basic principle behind MODs using GO. For example, a biological process in a human can be compared to one in a rat or a zebrafish, for example. The ontology is species agnostic.
Data Consistency Checks
It took me a while to fully appreciate the power that ontologies can bring to enforcing quality/consistency checking rules within a database system. When I was writing an ECO paper, the ontologist & bioinformatician Chris Mungall explained this to me.
GO has many rules, for example that certain types of GO annotations can be supported only by certain types of evidence. (There are clearly spelled out explanations for such rules in the documentation.)
When database managers run checks on their data, if a certain types of evidence-annotation combinations are detected, those annotation are flagged as suspect and must be reviewed.
Many other such checks exist that can be enforced with the careful use of ontologies. This helps maintain the integrity of the annotations.
Better User Queries
Associating ‘omics data with ontologies also gives users more control over how they query the data. Imagine if you wanted to find all the genes within a given taxonomic group whose functions were determined only by a particular type of evidence (or perhaps not supported by an unwanted type of evidence). This is easy to do when the data are properly associated with evidence. GO provides for this by incorporating ECO terms into its query page.
Connecting Ontologies To One Another
Did you know that ontologies themselves can also be connected?
Many ontologies were originally designed for a specific type of application. But you can create linkages among them, which can prove useful for both standardizing ontology development as well as drawing inferences about data annotated with them. The process of resolving differences between ontologies and helping them relate better is called “harmonization”.
Ontology terms from different ontologies can be cross-referenced so that a given term from Ontology A has a defined relationship to a given term from Ontology B. If the relationship is one of equivalency, it could be expressed like “Term 123 from Ontology A ‘equivalent to’ Term 456 from Ontology B”. Any annotations to Term 123 in Ontology A would have the same meaning as annotations to Term 456 from Ontology B.
Terms don’t have to have one-to-one relationships, though. They could be composed of combinations of other terms, for example “Term 123 from Ontology A ‘equivalent to’ (Term 456 from Ontology B ‘undergoing some process’ 789 from Ontology C)”. In essence, this is my way of saying that ontologies can be built from other ontologies, which helps maintain standardization and speeds development.
Inferencing from Connected Ontologies
By interrelating ontologies, we can learn new things about our data associated with the ontologies, too.
Logically defined intraontology relationships can allow us to use a computer to infer (via a reasoner) otherwise unseen relationships. We could, for example, code into our ontology the information that a particular biological process is executed only in a particular part of the cell by relating the term that describes the process to the term that describes the location.
If we then annotated a protein with the term describing that particular process, then we would be able to infer that the protein was found in some part of the cell, irrespective of whether we had a specific annotation to that effect. In a big dataset being searched by a computer, we might find multiple proteins like that that would otherwise have been missed.
SLOPPY NAMES & GOOD ANNOTATIONS
Now, to bring it all back to my genome annotation work, consider a FASTA (a common sequence file format) file containing annotated proteins, like you may find at a protein database. Perhaps you want to download this file, and you plan to group proteins by function, so that you can design some sort of experiment. This may prove more difficult than you expect.
Many existing proteins at databases follow naming conventions very poorly. The FASTA header line (“>…”) for a given protein might contain inconsistent language, taxon-specific information, variable spelling, words like “putative”, “potential”, “partial”, gene names/identifiers, molecular weights, relative protein sizes, and a host of other information.
I have run NCBI’s annotation consistency checkers routinely after functional annotation of genomes using similarity-based approaches (BLAST & HMM searches) and always find numerous naming issues to correct.
So if you download such a file with inconsistently named predicted proteins, even if many are functionally similar, it’s harder to use the file. You would have to manually go through the file, looking at a hodge podge of names, or you would have to script out a solution to determine which proteins you wanted. For a big file with much variation, this is not trivial.
And that’s another great reason for using ontologies. If that same file contained GO IDs, we could easily group proteins by function (using an existing script) and pull out any proteins we wanted from the file based on such information – we wouldn’t need to know a protein’s name at all. (Keep in mind that GO is not a naming system.)
Ontologies Make Genomics Better!
The bottom line is that ‘omics data are numerous, messy, and ever growing. Researchers are not like-minded, standards are often not comprehensive, and curators are not perfect. There are many sub- communities of researchers that have formed over time to study diverse taxa, and they often use different terminologies.
There exists a great potential for a data inconsistency, creating data silos, improper annotations, trapped knowledge waiting to be discovered, and difficult-to-use sequence data.
By using ontologies in conjunction with good annotation practices and working to interconnect disparate resources, we can organize our data better and interconnect it. This can allow us to standardize our data, help with naming issues, free data from silos, control for annotation quality, and even infer some previously unknown things about data.
Working on ontologies, learning about biocuration, and communicating with developers at many protein, gene, and ‘omics resources has given me a deeper understanding of this bigger picture. This has made me much more aware of the nature of the genomic and proteomic data that I generate and use.
If you want to learn more I suggest you read some of the featured publications at biological resources including major ‘omics databases, MODs, GO, and ECO. (The publications are too numerous to list here.)
I hope this was helpful!