As computational models of complete pathways, cellular metabolism of specific tissues, and organs are developed and interconnected, new enhanced functionalities and database‐enabled software tools are required to
(i) link the ever expanding body of molecular information to an understanding of how intact organisms function via multiscale mechanistic models of the system.
(ii) facilitate interactive model development and dynamic analysis of responses in an effective and efficient manner.
We propose to build a new set of integrated database‐enabled tools for regulatory metabolic networks, called PathCaseSB with interfaces for model‐based querying, visualization, simulation, and model building. The aim of PathCaseSB is to build a database-enabled framework and tools towards effective and efficient systems biology model development for multiscale mechanistic models of biological systems.
The PathCaseSB system employs client-server architecture (See Figure 1). At the server-side, PathCase data is kept in a relational database management system, MS SQL Server 2008. In addition, PathCaseSB development team built an object oriented data-access interface between the relational database and the application layer, in order to provide flexible and intuitive access to data and to prevent major changes in the application when a schema change occurs during the evolution of the PathCaseSB system. When designing the object model behind this system, a persistent object model is used which hides the details of the actual database implementation and database access. For instance, the developer can easily create an object that can be stored into the database or search for a specific object by using the PathCaseSB API without writing any SQL query.
In designing the PathCaseSB database, we were confronted with two major issues. The first one was whether or not we should merge common information among systems biology (e.g., Biomodels) and biochemical reaction network data sources (e.g., KEGG). In our earlier database designs, we did merge such information, and soon were faced with many data integration and data repopulation issues. Our current approach for capturing data in different systems biology and biochemical network databases is to have separate tables for common entities of different systems biology and/or biochemical network data sources, e.g., BioModels [BioM, BioM06] and Kegg [Kegg] data sources. As an example, we maintain a species table occurring in BioModels models and molecular_entities table for Kegg molecular entities. Such an approach allows us to separate data from different data sources cleanly, and, to add new data sources seamlessly, without worrying about future data cleansing and data integration problems. Our approach however requires an additional “mapping” effort: We create mappings between all pairs of data sources on their corresponding entities. For BioModels and Kegg, these would be three mappings: <species, molecular entities>, <reactions, process-entities>, <models, pathways>.
The second issue was whether or not we should merge the common information in different systems biology data sources into a single set of tables. While we do not yet have much experience in this issue, again for ease in repopulating data from a given data source, at this point in time, we have chosen to also keep separate tables for different systems biology data sources. That is, currently, we generate distinct sets of tables for, say, BioModels and CellML [CellML] data sources. Two reasons for this decision are: (i) our brief experience with BioModels and CellML data sources is that integrating data from these two data sources sometimes translates into further data curation, which we think is not proper for us to carry out; (ii) we have designed user interfaces that naturally separate data from systems biology data sources (while tightly coupling them with biochemical reaction network data sources, e.g., KEGG).
At the present time, PathCaseSB database has data from BioModels data source (as the systems biology data source) and from KEGG data source (as the metabolic pathways data source).
Currently, PathCase Systems Biology database contains four classes of tables capturing the following information:
The PathCaseMAW architecture has five distinct Subsystems.
Database: This subsystem contains the actual pathways information, that is compiled from from a biochemistry textbook (Devlin 2006) and an atlas of human metabolism (Salway 1999). It captures the location information of the pathways unlike many web-based systems, which avoids incorrect interconnection of pathways. In addition to that, transport processes that carry a metabolite from one compartment to another are also modeled and captured in the database. The database server used is Microsoft SQLServer 2005.
The Data Object Library: This subsystem provides a programmer-friendly interface to the system content stored in the databases that allows for data to be accessed, created, and updated. This library is written in Microsoft C# 2005.
Metabolomics Analysis Library: This subsystem which is the subject of most of this thesis contains libraries to perform Automated Prediction and also includes tools for Data Acquisition and Standalone graph visualization. This Library is written in Microsoft C# 2005.
The Web Server: This subsystem includes the PathCaseMAW web site and the web services, both of which are written in C# and ASP.NET, and generate standard HTML pages and XML data. This allows the site to be accessed by users from a standard web browser on any operating system.
Client Side Components: This includes the basic HTML that renders the main site interface to the user, the JavaScript with AJAX that makes the site highly responsive to the user, and, finally, the graph viewer java applet used for interactive pathway graph and Closure Tree visualization. Java is chosen for powerful and highly responsive dynamic graph drawings. The graph viewer applet makes use of the web service subsystem in order to request additional data as needed, and enhance the graph visualization without requiring the user to reload the web page. All graph manipulations such as zooming in and out, panning, and application of different layouts are carried out on the client side with no server side requests, which makes it highly scalable.
We adopt an object oriented data model to represent the mammalian metabolism. Objects are structured data types which contain basic types (e.g., string, int, etc.) or other structured data types (i.e., objects) as their fields. We employ the metabolic principles that are summarized in Section 2 as the main motivation, and as a guide in our modeling effort. Figure 2 shows the object definitions and their fields for the essential constituents of the metabolism, where id fields are omitted for brevity. In our data model, metabolism, at the highest level, consists of a set of pathways. Each pathway contains a collection of reactions, a set of substrates, a set of products, and a set of cofactors. To satisfy principle 1, we explicitly specify pathway inputs to prepare the infrastructure required for implementing the corresponding query processing rule in the query processing stage. Optionally, pathways may have committed and rate-limiting steps (if known). Moreover, to satisfy principles 5 and 6, we model committed and rate-limiting steps. Input and output molecules of a pathway are associated with a particular pool of a metabolite through MetabolitePoolLink objects, which are associated with a particular MetabolitePool and have a rate field specifying the contribution (or consumption) rate to the overall pool through that link.
A pool of a metabolite has a location field, and optional size and parent fields, where the parent field makes hierarchical organization of pools possible for a particular metabolite. Such an organization allows for the creation of conceptual groupings of metabolite pools. To satisfy principle 7.a and the corresponding query processing rule, we have a MetabolitePool object for each metabolite. For principle 7.b, we have a size field for each MetabolitePool object. For principle 8, we capture the compartment information for each pool, and allow a metabolite to have more than one pool. For principles 9 and 17, a rate field in each MetabolitePoolLink object is introduced. For principle 10, a size field and a reference to its owner metabolite is created in each MetabolitePool. For principle 11, an optional parent field is created for a hierarchical organization. For principle 18 and its corresponding query processing rule, we include triggerCondition field in each MetabolitePoolLink.
Metabolite object has name and type fields, where type may be “basic molecule”, “hormone”, “protein”, etc. Moreover, each metabolite stores its default pools per biological compartment. This information is required during query processing to associate a metabolite referenced in a query to one the specific pools of its own, if it has multiple pools in a biological compartment. Hormones have a single pool in blood, but they influence a large number of tissues through cascading signaling steps.
EnergyCurrencyMetabolite directly extends from Metabolite, and represents those metabolites which are considered to be energy carriers in a cell. An additional peer field links an energy metabolite to its reduced (or oxidized) peer, and the chargeStatus field describes whether a given metabolite is a highEnergy or a lowEnergy metabolite. For principle 12, overall sizes of EnergyCurrencyMetabolite pools and their peers can be used to reason about the energy state of a cell.
Each reaction has a collection of substrates, products, cofactors, enzymes, and regulators, each of which (except enzymes) are metabolite pool instances. For principle 4, we can obtain the reactions of a pathway, and decide which ones are regulated (through regulator field). Since some biochemical reactions are reversible, and in a particular pathway they usually work in one direction, in each pathway they participate, the direction information is also stored. Enzymes can reside in multiple compartments (e.g., isozymes). Hence, each reaction is associated with a specific instance of a reaction in a particular compartment. The location information for a reaction is implied by the location of its enzymes. For principle 13, we associate each enzyme with a particular compartment. Compartments have name, and optional size and parent fields. Similar to MetabolitePools, parent field allows for the definition of a compartment hierarchy (e.g., organ-> tissue-> organelle-> inner membrane). In our data model, an organelle in tissue A is different than the same type of organelle in tissue B (that is, mitochondrion in liver vs. mitochondrion in red blood cells), since the same type of organelles in different tissues may have different metabolic functions and/or enzyme coverage. For principle 16, we have a parent field in Compartment object for a possible hierarchical organization. In addition, each compartment has a set of transportProcesses that carry metabolites in and out of the compartment. A transport process is modeled as an instance of a regular reaction, where compartment refers to the one that a particular transport process belongs to, and substrates and products refer to different pools of the same metabolite. For principle 14, we can figure out input and output metabolites of a compartment based on substrates/products of its transport processes. And, for principle 15, we can find out if a transport process is complex or simple by checking whether it has regulators or not.
A regulator involves a specific pool of a metabolite and an optional precedence value which is required when multiple regulators with conflicting effects act simultaneously on the rate of the same reaction (the one with the highest precedence value determines the final effect on the rate of a reaction). Besides, a regulator may optionally be defined based on a ratio of metabolite pools (e.g., glucagon/insulin in Figure 1). Regulator also involves a triggerCondition field that differentiates between reactions requiring accumulations of metabolites and those for which availability is sufficient (principles 17 & 18). Finally, regulators have a type field that captures the mechanism (i.e., allosteric, covalent, expression control) of regulation as described by principles 2.a-2.d. We model all enzyme regulation mechanisms through Regulator objects. For principle 3, we have an optional precedence field for each Regulator object.
A DietaryState object is the representation of a dietary state (e.g., fasting) and represented by a set of metabolite concentration changes that characterize the dietary state, which collectively represent the “signature” of a dietary state.
A ConcentrationChange refers to a specific pool of a particular metabolite, and the direction of its concentration change (i.e., increase or decrease). As an example, fasting state can be represented by the following signature involving concentration change objects: {insulin↓, glucagon↑, glucose↓, fatty acids↑, ketone bodies↑}. A PhysiologicalCondition stands for a condition or a disease (e.g., diabetes), and it directly extends from the DietaryState object, as we reuse the same representation model. In addition, PhysiologicalCondition allows for the specification of a set of changes on the shares of different reactions in metabolite pools. By allowing rate changes, we allow representation of physiological conditions where the rate of a metabolic process can increase/decrease significantly, leading to changes in its contribution or consumption rates for a particular metabolic pool. The modified behavior may affect the accumulation or availability of different metabolites.
Graph Representation Model In our graph representation model, compartments (e.g., liver in Figure 1) are modeled as large “super-nodes” which contain subnetworks of the overall metabolism, as well as other compartments (e.g., mitochondrion in liver in Figure 1). In each subnetwork, nodes represent metabolite pools (e.g., Acetyl CoA in mitochondrion in Figure 1). Reactions are represented as hyper-edges, which connect multiple end-points (i.e., substrates and products) (e.g., the reaction that converts oxalacetate and Acetyl CoA to citrate in the TCA Cycle in Figure 1). Regulation is represented by an edge between a metabolite pool (i.e., a regulator) and a hyper-edge (i.e., a reaction) (e.g., NADH as inhibitor for two different reactions in the TCA cycle in Figure 1).
Living organisms behave as complex systems that are flexible and adaptive to their surroundings. At the molecular level, organisms consist of intricate networks of molecular reactions, which are often called biochemical pathways. In order to maintain, visualize, and ultimately, analyze organism functions that result from biochemical pathways, the PathCase is being developed. The system contains a pathways database and the associated tools to store, compare, query, and visualize biochemical pathways through a web client. The aim is to develop an integrated database, and the associated tools to support computational analysis and visualization of biochemical pathways. The ultimate goal of the system is to describe, utilize and predict systems functions and behaviors of living organisms.
The PathCase system employs client-server architecture (See Figure 1). At the server-side, PathCase data is kept in a relational database management system, MS SQL Server 2005. In addition, PathCase development team built an object oriented data-access interface between the relational database and the application layer, in order to provide flexible and intuitive access to data and to prevent major changes in the application when a schema change occurs during the evolution of the PathCase system. When designing the object model behind this system, a persistent object model is used which hides the details of the actual database implementation and database access. For instance, the developer can easily create an object that can be stored into the database or search for a specific object by using the PathCase API without writing any SQL query.
Pathways are the sequential and cumulative action of genetically distinct, but functionally related molecules. Each reaction in a pathway is a biochemical step from specific substrates (input molecules) to products (output molecules) that are chemically modified substrates. Each step may also use various combinations of molecules as cofactors, activators, inhibitors, and regulators, and usually involves at least one genetically unique gene product that catalyzes the reaction step. PathCase integrates genomic and biological information which can be managed, analyzed, queried and displayed in dynamic ways at various levels of biological and genetic detail to provide insight into diverse biological processes in health and disease.
The design of the underlying database in PathCase is the primary influence on how data is to be acquired and presented, edited, and entered. While different tools can provide slightly different approaches, the logical steps for handling the data must be the same, and ultimately this is reflected in an overall similarity in all tools that access and work with the database. For this reason, a keen understanding of the data model is fundamental to a positive experience using any of the tools available now or in the future with the Pathways Database System. This chapter will describe the three primary players in the database model (molecular entities, processes, and pathways), and the various other data items that helps complete the system.
All biochemical pathways can be simply described as molecular entity interaction throughout the organism. Basically, molecules that are floating around in (and out) of the cell sometimes get together, resulting often in changes to the molecules. These changed molecules then go off and interact with other molecules, continuing the procedure, and thus giving birth to cascading pathways. So, at the very foundation of the notion of a pathway is the single molecule, which is referred to here as a molecular entity.
Lumping all molecular entities together though into a single group is not an efficient storage method, as different classes of entities possess unique attributes. For example, RNAs are certainly molecular entities with a set of characteristics that is quite distinct from amino acids (another class of molecular entity). To solve this storage problem, the molecular entities are broken up into five classes: genes, RNAs, proteins, amino acids, and basic molecules. These classes and their unique properties are discussed in detail for the remainder of this section.
Genes are often considered the blueprint of life. They are stretches along the genome of an organism that are, have been, or will be engaged in some function of that organismfs life. All of a creaturefs genetic information is stored in long strings (chromosomes) of connected nucleic acids (adenine, cytosine, guanine, and tyrosine), the combination of which (if there is more than one) is usually called the genome of that creature. Genes then, are just functional substrings along the longer chromosomal strings. So, to be strict, the chromosomes themselves are the true molecular entities that store genetic information, but it adopted convention in the Pathways Database that the genes are instead viewed as distinct molecular entities. It is to be noted though, that the true life model is represented in the database. Every gene in the Pathways Database lies on a unique chromosome, which itself is unique to a specific organism. This is a strictly enforced policy, preventing arbitrary genes and chromosomes which cannot be traced back to an originating species.
As mentioned above, each gene must be a part of a chromosome, which must exist in some organism. A subtle side-effect of this policy is that genes that are shared between species are not considered the same in the PathCase System. These genes may in fact have the same name across a group of species, but since each species has a different set of chromosomes, there must be as many copies of these genes as there are species the genes exist in. There is justification for this approach: the genes are not the same. While shared genes between species are usually identical in function, and highly similar in nucleotide sequence, there nevertheless is always some difference (usually at the sequence level). As far as data storage, these differences manifest themselves as separate identifiers to external data sources. For example, both humans and mice share the Abl2 gene, although each gene has its own GenBank[?] reference. Each nucleotide difference in the gene can lead to an amino acid change at the protein level (discussed below), which can change the configuration (three-dimensional shape) and thus the actual biochemistry of the final enzyme. Of course, there is also the case of some genes being entirely unique to one species, which fits perfectly into this model as well.
Each gene in the database is stored with some pertinent information, such as its sequence (or a Genbank link to it), and its containing chromosome. Each chromosome subsequently is stored in the database, with the most important attribute being the organism it belongs to. As described above, each organism has a separate entry in the database, primarily so genes and gene products can be traced back to their originating creature. One of the advantages to this approach is the ability to compare different species at the pathway level. Since genes are the starting point of each pathway, ultimately pathways that are shared between organisms are stored as being distinct (because once again, they are at the genetic level), and thus they can be compared.
Genes that are considered the same across species are identified as homologue pairs, which is the primary relationship on which the comparisons can be made. In fact, pathways can be computationally derived for new species as their genome sequencing is done. Consider a well-known pathway that exists in humans, where each step in the pathway is controlled ultimately by a single gene. Now imagine another species whose genome has recently been sequenced, with full computational gene annotation. Gene homologues are often identified at this step, usually as a result of extraordinarily high sequence similarity. For each gene in this newly sequenced genome, if a human homologue exists, we can create a step in an analogous pathway for the new species. Of course, all computationally derived information (from gene homologues to complete pathways) must be carefully examined by appropriate experts for verification, but the time-saving value and potential for discovery that comes with such a procedure is beyond measure.
Beyond just the gene homologues and the chromosomal (and subsequently the organismal) information, is the gene product data. Since the genes themselves are only substrings of chromosomes, they are not physically involved in metabolic pathways steps. Instead, the products of the genes are the true physical players in the cell. The central dogma of molecular biology describes two basic processes from which the gene products are derived: transcription and translation. Both RNA molecules and proteins originate from gene sequences ultimately, and thus are often known as gene products, with certain genes encoding for different RNA molecules and proteins. Each of the gene products is described in detail below as a separate class of molecular entity.
The result of transcription is the manifestation of a gene apart from its containing chromosome. Instead of existing as a substring of DNA, a copy of the gene is made as an independent RNA molecule. Most often this molecule exists for only a short time as an mRNA molecule, to be translated to a protein, but both tRNAs and rRNAs play functional roles as nucleotide complexes themselves. In fact, the discovery of catalytic RNAs has shown that RNAs play a much larger role than previously thought in basic life functions. RNA molecules are thus stored in the Pathways Database as a class of molecular entity. Associated with each RNA molecule is commonly useful information, such as the specific type of molecule it is (mRNA, tRNA, rRNA), the sequence, external database links, and more. Since RNA molecules come directly from genes through transcription, a (gene product) link is created between each RNA molecule and its originating gene (if known).
If genes are the blueprints of life, proteins are the building blocks. Apart from catalytic RNAs, proteins govern the ultimate functionality of nearly all aspects of life. Proteins are obtained through the other step in the dogma of molecular biology: translation. Essentially, proteins are strings of (end on end) attached molecules, much like DNA and RNA strands. The primary difference of course, is that proteins are built of amino acids, not nucleotides (thus the term amino-acetyl chains). Since amino acids are primarily limited to this role in an organism, and possess characteristics that are essential in understanding the final protein (such as water attraction due to electro-static forces), they are stored as their own class of molecular entity by the database, with their own set of unique attributes. While the gene sequence encodes nearly directly, translation is a bit more complicated. Every three RNA nucleotides create a codon, specifying which amino acid is added to the protein chain. So, proteins fall into yet another class of molecular entity, and are often stored with their own sequence information (strings from an alphabet of 20 amino acids). The folding of proteins into their shapes is an active area of research, since it is these shapes that define the proteinfs role in an organism. So, even though the protein sequences are derived from mRNA sequences, the original genes themselves and the proteins are related, with the protein being a gene product. Most often, the proteins serve as catalysts for reactions between other molecular entities by allowing the reactants to come together in a physical conformation determined by the proteinfs shape. These proteins (and indeed, most proteins studied are of this type) are called catalytic enzymes, and have been cataloged by the Enzyme Commission[ref] using a hierarchical number system (called the EC number). Each protein then that is catalytic and well-known will have any EC numbers associated with it stored in the database as well. Multiple EC numbers are possible for a single protein as the particular shape may allow for multiple enzymatic reactions (i.e. different reactants).
The remaining class of molecular entities is the basic molecule. Basic molecules are just that: basic. That is not to say that they are not complex, but that they generally can be found both inside and outside the cell (and indeed the body in some cases). Probably the most prolific basic molecule involved in most pathways is water. Other examples of common basic molecules are magnesium (Mg++), ATP, ADP, oxygen, etc. Stored with this class are some helpful attributes, such as molecular formula and image, if possible.
There is constant research going on in all parts of the world in biochemistry, and very often multiple groups in different parts of a country, or in different countries, will be working on the same topic. With each person, lab, and publication, many different names or terms are assigned to the molecule. Thus, choosing a singular name for any molecular entity is extremely prohibitive (as a researcher looking for a particular compound by using a name not selected in the system). To prevent this problem, the notion of a name pool was introduced. The Pathways Database stores a large repository of names for all classes of molecular entities, and each entity can have as many names from this pool as is appropriate. An immediate bonus can be seen when considering the naming scheme for many genes and their protein gene products. In many cases, the names of the genes and the proteins are identical. In fact, when using the symbols instead of the full names, distinction between a gene and a protein in many systems is only through capitalization of the first letter! This is hardly a reliable method for entering and identifying data, as human error is certain to take its toll, and a letter will be capitalized when it shouldnft, and vice versa. By instead pulling names from a large pool, and identifying the class of the molecule by the class stored internally in the database, shared names is no longer a problem at all. From the computer science and performance point of view, this allows for much speedier text searches. Too often in web-based and older systems searches must be focused to a single class, requiring multiple indices on each group, and multiple searches. By compiling all the names into one pool, a single fast index can be built over all molecular entities. Duplicate strings are also thus eliminated from the system. This method also allows for more beneficial approximate matching, with full regular expression support for text searches available in the future.
The process is the database annotation of an interaction between molecular entities. A process is also sometimes seen as a single step (or reaction) in a metabolic pathway. Most of the chemical reactions that are included in metabolic pathways are catalyzed by some particular enzyme (as described above). During these reactions, the enzyme allows molecules that would otherwise not interact to be brought into contact with each other in such a way that a reaction occurs. Once the reaction completes, usually the reactants are released by the enzyme, and the enzyme is entirely unchanged by the process. There are also cases where no catalyzing enzyme (or catalyzing RNA) is present, and the reactants engage in a process entirely on their own. In either sort of process, there are a series of roles that different molecular entities can play. For each process in the Pathways Database, all of the known reactants are stored (with a quantity attribute themselves, to allow for correct stochiometry), as well as some other process-specific attributes (such as whether or not the process is reversible, cellular location, etc.).
As mentioned, each molecular entity involved in a process will fall into some category or role-players for that process. The major categories of roles are those entities that are consumed (or destroyed), those that are created (or produced), and those that are not modified.
One of the non-modified roles that has already been previously described is the catalyzing protein. Without this component, all of the other reactants can be present in large quantities and the process would still very rarely take place. Once the process is complete, the catalyzing enzyme releases any reactants and moves on, completely unmodified (hence the term catalyzing enzyme).
The next two most frequently seen and discussed roles are the substrates and the products. These are the primary reactants in the process, with the substrates being consumed and the products being produced. If a molecular entity is described as a substrate in a process, it will be consumed during the course of the reaction, and will not be present (at least in the same quantity) after the process has completed. In the same light, a product is not present at the start of the process, but will be at the end. It is through the substrates and products that most pathways are defined (more on this below).
Two other roles that are mathematically the same as substrates and products are the cofactors. The cofactors can either be in, or out (cofactor in, or cofactor out), and they are also consumed or produced during the course of a process. Cofactors fall into their own special category though due to their commonality across many processes. Quite often cofactors are highly abundant molecules (water, ATP, etc.), and are used in exactly the same way in various different reactions. As an example, consider any phosphorylation reaction. Almost always these reactions will involve a substrate and product that are very similar in that the product is basically just the substrate with a phosphate added on (which is usually enough to make the entity active or inactive, or to induce some conformational change). The cofactors for such a reaction will almost always be the pair ATP and ADP (ATP as the cofactor in, and ADP as the cofactor out).
There are also other molecular entities that can be involved in the reaction that are not modified besides the catalyzing protein. They are often called regulators when it is not entirely clear how they affect the process. If the reaction is dependent on a regulator, or if the regulator speeds up the rate of the reaction, they are usually called activators. On the other hand, inhibitors are regulators that play a negative role in the completion (or initiation) of the process.
Keep in mind that some processes are reversible, and are thus governed by concentration equilibriums usually. Since we must store the roles of the process in at least one manner, for those reactions that are reversible it is clear that the choice is somewhat arbitrary. When a process is reversed, the entire set of substrates can become the set of products, and vice versa. The same dichotemy exists for cofactors (in becomes out, out becomes in) as they are functionally often the same as products and substrates. For regulators, it is not always clear if an activator will become an inhibitor and vice versa, and indeed since it is not fully understood biochemically, it cannot be modeled yet.
The pathway is finally where we can see the processes in action. Described simply, a pathway is nothing more than a collection of processes. Only, a pathway defined arbitrarily at random would be rather meaningless, so they are put together to describe a particular aspect of life function. The cascading property of pathways becomes readily apparent when one looks at his first pathway. Usually the pathway has a starting point (if it is not shown as a cycle), with some initial substrate(s). As the pathway progresses from process to process, those substrates are consumed and products are produced, which are often used as substrates in processes downstream. It is at this level that the need for differentiation between substrates/products and cofactors becomes apparent. If these roles were treated as being identical in importance, pathways would be too complicated to view and understand (since so many processes will involve connections to so many others).
Logically, a pathway becomes a graph of processes, with edges between them if there exists a substrate/product relationship between the processes (i.e. a product of one process is used as a substrate of another). A disconnected graph should be a flag for a possible error, as if maybe the curator should consider making each component a separate pathway itself. From the database perspective, a pathway is stored very easily with a few attributes (name, type, references, etc.) and a collection of the processes that make it up. Once the process collection for a pathway is available, the graph can be dynamically created by finding the appropriate edges between processes.
Since a pathway is simply a collection of processes, one can question the creation of pathways at all, instead arguing that it would be useful to just create a super-pathway of all processes collected for an organism. The problem with this is the complexity of the graph becomes too high for people to work with. So, this creates a problem, how can one understand the connections that exist between the pathways themselves? The solution to this problem lies in the tools. One useful visualization technique is to draw links from processes to special nodes that represent entire pathways, which can then be explored at the userfs leisure. Some of these links may be useful and topic of further study, others may not be. For those that are useful, the database can store information about select pathway links. So, how to help resolve some the complexity with pathway links is still primarily a problem for the user-interface side of a system, as the data is stored basically at the process level. Just remember that pathways have no biological meaning per se, they are human-defined collections of processes.
With the popularity of the Internet and the increasing speed and technology present in laboratories worldwide, we are now beginning to deal with the problem of information overload in life sciences. There is too much data to completely integrate in a single system, and thus we rely on using other authoritative sources that specialize in certain areas (e.g. Genbank for nucleotide sequence data). These sources are considered external databases, and each entry in these sources should have some identifier. Each item in the Pathways Database can be linked to an appropriate entry in an external database by a local id to external id relationship. This relationship will associate a data item in the Pathways Database (a molecular entity, process, pathway link, chromosome, etc.) with an external database and the appropriate id in that external database.
We represent the data classes and the relationships between biological entities in an entity-relationship model (See Figure 2).
Mapping ER model into tables and foreign-key constraints, PathCase data model is illustrated in Figure 3 in detail.