PathCase Architecture & Data Model1. IntroductionLiving organisms behave as complex systems that are flexible and adaptive to their surroundings. At the molecular level, organisms consist of intricate networks of molecular reactions, which are often called gbiochemical pathwaysh. In order to maintain, visualize, and ultimately, analyze organism functions that result from biochemical pathways, the PathCase is being developed. The system contains a pathways database and the associated tools to store, compare, query, and visualize biochemical pathways through a web client. The aim is to develop an integrated database, and the associated tools to support computational analysis and visualization of biochemical pathways. The ultimate goal of the system is to describe, utilize and predict systems functions and behaviors of living organisms. The PathCase system employs client-server architecture (See Figure 1). At the server-side, PathCase data is kept in a relational database management system, MS SQL Server 2005. In addition, PathCase development team built an object oriented data-access interface between the relational database and the application layer, in order to provide flexible and intuitive access to data and to prevent major changes in the application when a schema change occurs during the evolution of the PathCase system. When designing the object model behind this system, a persistent object model is used which hides the details of the actual database implementation and database access. For instance, the developer can easily create an object that can be stored into the database or search for a specific object by using the PathCase API without writing any SQL query. ![]() Figure 1. Illustration of PathCase Client/Server Architecture 2. Database ModelPathways are the sequential and cumulative action of genetically distinct, but functionally related molecules. Each reaction in a pathway is a biochemical step from specific substrates (input molecules) to products (output molecules) that are chemically modified substrates. Each step may also use various combinations of molecules as cofactors, activators, inhibitors, and regulators, and usually involves at least one genetically unique gene product that catalyzes the reaction step. PathCase integrates genomic and biological information which can be managed, analyzed, queried and displayed in dynamic ways at various levels of biological and genetic detail to provide insight into diverse biological processes in health and disease. The design of the underlying database in PathCase is the primary influence on how data is to be acquired and presented, edited, and entered. While different tools can provide slightly different approaches, the logical steps for handling the data must be the same, and ultimately this is reflected in an overall similarity in all tools that access and work with the database. For this reason, a keen understanding of the data model is fundamental to a positive experience using any of the tools available now or in the future with the Pathways Database System. This chapter will describe the three primary players in the database model (molecular entities, processes, and pathways), and the various other data items that helps complete the system. 2.1 Molecular EntitiesAll biochemical pathways can be simply described as molecular entity interaction throughout the organism. Basically, molecules that are floating around in (and out) of the cell sometimes get together, resulting often in changes to the molecules. These changed molecules then go off and interact with other molecules, continuing the procedure, and thus giving birth to cascading pathways. So, at the very foundation of the notion of a pathway is the single molecule, which is referred to here as a molecular entity. Lumping all molecular entities together though into a single group is not an efficient storage method, as different classes of entities possess unique attributes. For example, RNAs are certainly molecular entities with a set of characteristics that is quite distinct from amino acids (another class of molecular entity). To solve this storage problem, the molecular entities are broken up into five classes: genes, RNAs, proteins, amino acids, and basic molecules. These classes and their unique properties are discussed in detail for the remainder of this section. 2.1.1 GenesGenes are often considered the blueprint of life. They are stretches along the genome of an organism that are, have been, or will be engaged in some function of that organismfs life. All of a creaturefs genetic information is stored in long strings (chromosomes) of connected nucleic acids (adenine, cytosine, guanine, and tyrosine), the combination of which (if there is more than one) is usually called the genome of that creature. Genes then, are just functional substrings along the longer chromosomal strings. So, to be strict, the chromosomes themselves are the true molecular entities that store genetic information, but it adopted convention in the Pathways Database that the genes are instead viewed as distinct molecular entities. It is to be noted though, that the true life model is represented in the database. Every gene in the Pathways Database lies on a unique chromosome, which itself is unique to a specific organism. This is a strictly enforced policy, preventing arbitrary genes and chromosomes which cannot be traced back to an originating species. 2.1.1.1 Multiple Organisms, Multiple ChromosomesAs mentioned above, each gene must be a part of a chromosome, which must exist in some organism. A subtle side-effect of this policy is that genes that are shared between species are not considered the same in the PathCase System. These genes may in fact have the same name across a group of species, but since each species has a different set of chromosomes, there must be as many copies of these genes as there are species the genes exist in. There is justification for this approach: the genes are not the same. While shared genes between species are usually identical in function, and highly similar in nucleotide sequence, there nevertheless is always some difference (usually at the sequence level). As far as data storage, these differences manifest themselves as separate identifiers to external data sources. For example, both humans and mice share the Abl2 gene, although each gene has its own GenBank[?] reference. Each nucleotide difference in the gene can lead to an amino acid change at the protein level (discussed below), which can change the configuration (three-dimensional shape) and thus the actual biochemistry of the final enzyme. Of course, there is also the case of some genes being entirely unique to one species, which fits perfectly into this model as well. 2.1.1.2 Gene HomologuesEach gene in the database is stored with some pertinent information, such as its sequence (or a Genbank link to it), and its containing chromosome. Each chromosome subsequently is stored in the database, with the most important attribute being the organism it belongs to. As described above, each organism has a separate entry in the database, primarily so genes and gene products can be traced back to their originating creature. One of the advantages to this approach is the ability to compare different species at the pathway level. Since genes are the starting point of each pathway, ultimately pathways that are shared between organisms are stored as being distinct (because once again, they are at the genetic level), and thus they can be compared. Genes that are considered the same across species are identified as homologue pairs, which is the primary relationship on which the comparisons can be made. In fact, pathways can be computationally derived for new species as their genome sequencing is done. Consider a well-known pathway that exists in humans, where each step in the pathway is controlled ultimately by a single gene. Now imagine another species whose genome has recently been sequenced, with full computational gene annotation. Gene homologues are often identified at this step, usually as a result of extraordinarily high sequence similarity. For each gene in this newly sequenced genome, if a human homologue exists, we can create a step in an analogous pathway for the new species. Of course, all computationally derived information (from gene homologues to complete pathways) must be carefully examined by appropriate experts for verification, but the time-saving value and potential for discovery that comes with such a procedure is beyond measure. 2.1.1.3 Gene ProductsBeyond just the gene homologues and the chromosomal (and subsequently the organismal) information, is the gene product data. Since the genes themselves are only substrings of chromosomes, they are not physically involved in metabolic pathways steps. Instead, the products of the genes are the true physical players in the cell. The central dogma of molecular biology describes two basic processes from which the gene products are derived: transcription and translation. Both RNA molecules and proteins originate from gene sequences ultimately, and thus are often known as gene products, with certain genes encoding for different RNA molecules and proteins. Each of the gene products is described in detail below as a separate class of molecular entity. 2.1.2 RNA Molecules and TranscriptionThe result of transcription is the manifestation of a gene apart from its containing chromosome. Instead of existing as a substring of DNA, a copy of the gene is made as an independent RNA molecule. Most often this molecule exists for only a short time as an mRNA molecule, to be translated to a protein, but both tRNAs and rRNAs play functional roles as nucleotide complexes themselves. In fact, the discovery of catalytic RNAs has shown that RNAs play a much larger role than previously thought in basic life functions. RNA molecules are thus stored in the Pathways Database as a class of molecular entity. Associated with each RNA molecule is commonly useful information, such as the specific type of molecule it is (mRNA, tRNA, rRNA), the sequence, external database links, and more. Since RNA molecules come directly from genes through transcription, a (gene product) link is created between each RNA molecule and its originating gene (if known). 2.1.3 Proteins, Amino-Acids, and TranslationIf genes are the blueprints of life, proteins are the building blocks. Apart from catalytic RNAs, proteins govern the ultimate functionality of nearly all aspects of life. Proteins are obtained through the other step in the dogma of molecular biology: translation. Essentially, proteins are strings of (end on end) attached molecules, much like DNA and RNA strands. The primary difference of course, is that proteins are built of amino acids, not nucleotides (thus the term amino-acetyl chains). Since amino acids are primarily limited to this role in an organism, and possess characteristics that are essential in understanding the final protein (such as water attraction due to electro-static forces), they are stored as their own class of molecular entity by the database, with their own set of unique attributes. While the gene sequence encodes nearly directly, translation is a bit more complicated. Every three RNA nucleotides create a codon, specifying which amino acid is added to the protein chain. So, proteins fall into yet another class of molecular entity, and are often stored with their own sequence information (strings from an alphabet of 20 amino acids). The folding of proteins into their shapes is an active area of research, since it is these shapes that define the proteinfs role in an organism. So, even though the protein sequences are derived from mRNA sequences, the original genes themselves and the proteins are related, with the protein being a gene product. Most often, the proteins serve as catalysts for reactions between other molecular entities by allowing the reactants to come together in a physical conformation determined by the proteinfs shape. These proteins (and indeed, most proteins studied are of this type) are called catalytic enzymes, and have been cataloged by the Enzyme Commission[ref] using a hierarchical number system (called the EC number). Each protein then that is catalytic and well-known will have any EC numbers associated with it stored in the database as well. Multiple EC numbers are possible for a single protein as the particular shape may allow for multiple enzymatic reactions (i.e. different reactants). 2.1.4 Basic MoleculesThe remaining class of molecular entities is the basic molecule. Basic molecules are just that: basic. That is not to say that they are not complex, but that they generally can be found both inside and outside the cell (and indeed the body in some cases). Probably the most prolific basic molecule involved in most pathways is water. Other examples of common basic molecules are magnesium (Mg++), ATP, ADP, oxygen, etc. Stored with this class are some helpful attributes, such as molecular formula and image, if possible. 2.1.5 Name PoolThere is constant research going on in all parts of the world in biochemistry, and very often multiple groups in different parts of a country, or in different countries, will be working on the same topic. With each person, lab, and publication, many different names or terms are assigned to the molecule. Thus, choosing a singular name for any molecular entity is extremely prohibitive (as a researcher looking for a particular compound by using a name not selected in the system). To prevent this problem, the notion of a name pool was introduced. The Pathways Database stores a large repository of names for all classes of molecular entities, and each entity can have as many names from this pool as is appropriate. An immediate bonus can be seen when considering the naming scheme for many genes and their protein gene products. In many cases, the names of the genes and the proteins are identical. In fact, when using the symbols instead of the full names, distinction between a gene and a protein in many systems is only through capitalization of the first letter! This is hardly a reliable method for entering and identifying data, as human error is certain to take its toll, and a letter will be capitalized when it shouldnft, and vice versa. By instead pulling names from a large pool, and identifying the class of the molecule by the class stored internally in the database, shared names is no longer a problem at all. From the computer science and performance point of view, this allows for much speedier text searches. Too often in web-based and older systems searches must be focused to a single class, requiring multiple indices on each group, and multiple searches. By compiling all the names into one pool, a single fast index can be built over all molecular entities. Duplicate strings are also thus eliminated from the system. This method also allows for more beneficial approximate matching, with full regular expression support for text searches available in the future. 2.2 ProcessesThe process is the database annotation of an interaction between molecular entities. A process is also sometimes seen as a single step (or reaction) in a metabolic pathway. Most of the chemical reactions that are included in metabolic pathways are catalyzed by some particular enzyme (as described above). During these reactions, the enzyme allows molecules that would otherwise not interact to be brought into contact with each other in such a way that a reaction occurs. Once the reaction completes, usually the reactants are released by the enzyme, and the enzyme is entirely unchanged by the process. There are also cases where no catalyzing enzyme (or catalyzing RNA) is present, and the reactants engage in a process entirely on their own. In either sort of process, there are a series of roles that different molecular entities can play. For each process in the Pathways Database, all of the known reactants are stored (with a quantity attribute themselves, to allow for correct stochiometry), as well as some other process-specific attributes (such as whether or not the process is reversible, cellular location, etc.). 2.2.1 Entity Roles in ProcessesAs mentioned, each molecular entity involved in a process will fall into some category or role-players for that process. The major categories of roles are those entities that are consumed (or destroyed), those that are created (or produced), and those that are not modified. One of the non-modified roles that has already been previously described is the catalyzing protein. Without this component, all of the other reactants can be present in large quantities and the process would still very rarely take place. Once the process is complete, the catalyzing enzyme releases any reactants and moves on, completely unmodified (hence the term catalyzing enzyme). The next two most frequently seen and discussed roles are the substrates and the products. These are the primary reactants in the process, with the substrates being consumed and the products being produced. If a molecular entity is described as a substrate in a process, it will be consumed during the course of the reaction, and will not be present (at least in the same quantity) after the process has completed. In the same light, a product is not present at the start of the process, but will be at the end. It is through the substrates and products that most pathways are defined (more on this below). Two other roles that are mathematically the same as substrates and products are the cofactors. The cofactors can either be in, or out (cofactor in, or cofactor out), and they are also consumed or produced during the course of a process. Cofactors fall into their own special category though due to their commonality across many processes. Quite often cofactors are highly abundant molecules (water, ATP, etc.), and are used in exactly the same way in various different reactions. As an example, consider any phosphorylation reaction. Almost always these reactions will involve a substrate and product that are very similar in that the product is basically just the substrate with a phosphate added on (which is usually enough to make the entity active or inactive, or to induce some conformational change). The cofactors for such a reaction will almost always be the pair ATP and ADP (ATP as the cofactor in, and ADP as the cofactor out). There are also other molecular entities that can be involved in the reaction that are not modified besides the catalyzing protein. They are often called regulators when it is not entirely clear how they affect the process. If the reaction is dependent on a regulator, or if the regulator speeds up the rate of the reaction, they are usually called activators. On the other hand, inhibitors are regulators that play a negative role in the completion (or initiation) of the process. Keep in mind that some processes are reversible, and are thus governed by concentration equilibriums usually. Since we must store the roles of the process in at least one manner, for those reactions that are reversible it is clear that the choice is somewhat arbitrary. When a process is reversed, the entire set of substrates can become the set of products, and vice versa. The same dichotemy exists for cofactors (in becomes out, out becomes in) as they are functionally often the same as products and substrates. For regulators, it is not always clear if an activator will become an inhibitor and vice versa, and indeed since it is not fully understood biochemically, it cannot be modeled yet. 2.3 PathwaysThe pathway is finally where we can see the processes in action. Described simply, a pathway is nothing more than a collection of processes. Only, a pathway defined arbitrarily at random would be rather meaningless, so they are put together to describe a particular aspect of life function. The cascading property of pathways becomes readily apparent when one looks at his first pathway. Usually the pathway has a starting point (if it is not shown as a cycle), with some initial substrate(s). As the pathway progresses from process to process, those substrates are consumed and products are produced, which are often used as substrates in processes downstream. It is at this level that the need for differentiation between substrates/products and cofactors becomes apparent. If these roles were treated as being identical in importance, pathways would be too complicated to view and understand (since so many processes will involve connections to so many others). Logically, a pathway becomes a graph of processes, with edges between them if there exists a substrate/product relationship between the processes (i.e. a product of one process is used as a substrate of another). A disconnected graph should be a flag for a possible error, as if maybe the curator should consider making each component a separate pathway itself. From the database perspective, a pathway is stored very easily with a few attributes (name, type, references, etc.) and a collection of the processes that make it up. Once the process collection for a pathway is available, the graph can be dynamically created by finding the appropriate edges between processes. Since a pathway is simply a collection of processes, one can question the creation of pathways at all, instead arguing that it would be useful to just create a super-pathway of all processes collected for an organism. The problem with this is the complexity of the graph becomes too high for people to work with. So, this creates a problem, how can one understand the connections that exist between the pathways themselves? The solution to this problem lies in the tools. One useful visualization technique is to draw links from processes to special nodes that represent entire pathways, which can then be explored at the userfs leisure. Some of these links may be useful and topic of further study, others may not be. For those that are useful, the database can store information about select pathway links. So, how to help resolve some the complexity with pathway links is still primarily a problem for the user-interface side of a system, as the data is stored basically at the process level. Just remember that pathways have no biological meaning per se, they are human-defined collections of processes. 2.4 External DatabaseWith the popularity of the Internet and the increasing speed and technology present in laboratories worldwide, we are now beginning to deal with the problem of information overload in life sciences. There is too much data to completely integrate in a single system, and thus we rely on using other authoritative sources that specialize in certain areas (e.g. Genbank for nucleotide sequence data). These sources are considered external databases, and each entry in these sources should have some identifier. Each item in the Pathways Database can be linked to an appropriate entry in an external database by a local id to external id relationship. This relationship will associate a data item in the Pathways Database (a molecular entity, process, pathway link, chromosome, etc.) with an external database and the appropriate id in that external database. 3. PathCase SchemaWe represent the data classes and the relationships between biological entities in an entity-relationship model (See Figure 2). ![]() Figure 2. PathCase ER diagram. Mapping ER model into tables and foreign-key constraints, PathCase data model is illustrated in Figure 3 in detail. ![]() Figure 3. PathCase database diagram. |



