CASE.EDU:    HOME | DIRECTORIES | SEARCH
case western reserve university

Database and Bioinformatics Lab

Research

 
 
 

PathCase Systems Biology

As computational models of complete pathways, cellular metabolism of specific tissues, and organs are developed and interconnected, new enhanced functionalities and database‐enabled software tools are required to (i) link the ever expanding body of molecular information to an understanding of how intact organisms function via multiscale mechanistic models of the system and (ii) facilitate interactive model development and dynamic analysis of responses in an effective and efficient manner. We propose to build a new set of integrated database‐enabled tools for regulatory metabolic networks, called PathCase‐SB, with interfaces for model‐based querying, visualization, simulation, and model building. The aim of PathCase-SB is to build a database-enabled framework and tools towards effective and efficient systems biology model development for multiscale mechanistic models of biological systems. Our approach is to integrate the model database with an existing metabolic network database, PathCase, (as well as other metabolic network data from other sources) in order to enable us to build 'one-shop' querying, visualization, simulation, and modeling capabilities.

People

Ali Cakmak, Xinjian Qi, Sarp Coskun, En Cheng, A. Ercument Cicek, Lei Yang, Rishiraj Jadeja, Nicola Lai, Ranjan Dash, Gultekin Ozsoyoglu, Z. Meral Ozsoyoglu

Project Web Site

 

Information Theory and Metabolic Network Based Metabolic Profile Analysis

Recent improvements in analytical methodology and large sample throughput allow for creation of large datasets of metabolites that reflect changes in metabolic dynamics due to disease or a perturbation in the metabolic network. However, current methods of comprehensive analyses of large metabolic datasets (metabolomics) are limited, unlike other “omics” approaches where complex techniques for analyzing coexpression/coregulation of multiple variables are applied. We address shortcomings of current metabolomics data analysis techniques, and research on new information theory based and metabolic network aided techniques

People

A. Ercument Cicek, Gultekin Ozsoyoglu

 

Steady-State Metabolic Network Dynamics Analysis

As an endeavor of automated analysis of metabolomics data in terms of the dynamic behavior of the metabolic network, we propose, analyze, and empirically evaluate a framework, called Steady-state Metabolic network Dynamics Analysis (SMDA), to reason about the dynamic behavior of the metabolic network at steady-state, and locate possible alternatives for active/inactive metabolic subnetworks. Under a user designated metabolic network, given that a set of bio-fluid (e.g., blood) metabolite concentration values and, perhaps, a number of tissue-based metabolite concentration values are measured at steady-state, we appy biochemistry-based rules to generate and output possible alternative steady-state metabolic network dynamic behavior scenarios with reasons to reach the results.

People

Ali Cakmak, Xinjian Qi, A. Ercument Cicek, Gultekin Ozsoyoglu

Project Web Site

 

 

Pedigree Data Management

A Pedigree is a “record of ancestry or purity of breed”. Pedigrees are hierarchical hereditary structures and are typically represented as directed acyclic graphs. Stud books (listings of pedigrees for horses, dogs, etc.) and herdbooks (records for cattle, swine, sheep, etc.) are maintained by governmental or private record associations or breed organizations in many countries. In human genetics, pedigree diagrams are utilized to trace the inheritance of a specific trait, abnormality, or disease, calculate disease risk factors, identify individuals at risk, and facilitate genetic counseling. In addition to medical genetics, pedigrees are also commonly used in animal breeding (e.g., Horse racing & pet breeding), plant studies (self-pollinated plant breeding), and genealogical studies. As the volume of this structured pedigree data expands, there is a pressing need for better ways to manage, store, and efficiently query this data.

People

En Cheng, Brendan Elliott, S. Fatih Akgul, Meral Ozsoyoglu

Previous Research

PathCase - Case Pathways Database System

As the blueprints of cellular actions, biological pathways provide scientists with invaluable information in understanding, analyzing and comparing the governing mechanisms in living organisms. The web-based PathCase system is designed for querying, visualizing, and analyzing biochemical pathways, and contains a pathways database, and various visualization, editing, computational analysis and data mining tools, involving hierarchically-arranged graph objects (pathways) and their associated (gene) ontology.

The PathCase architecture involves (i) a web server with extensive graph object (layout) caching to provide a scalable system over the web, (ii) a thin client that caches the graph object layout resulting from a query, and, that, afterwards, all manipulations/re-visualizations of the graph object take place only at the client side, again for scalability, and (iii) extensive web service functions that provide, among others, pathways graph querying capabilities.

People

Meral Ozsoyoglu, Gultekin Ozsoyoglu, S. Fatih Akgul, Ali Cakmak, Brendan Elliott, Mustafa Kirac, Gokhan Yavas, Michael Starke, Greg Strnad

Project Web Site

http://nashua.case.edu/pathways

 

Case Explorer

Case Explorer is a score-guided searching and querying prototype portal for ACM SIGMOD Anthology, a digital library for the database systems research community, containing about 15,000 papers. Case Explorer has a powerful user interface that allows users to pose score-guided ad hoc queries to search the Anthology, automatically computes the scores of query results from the scores of database objects (papers, authors, publication venues), and returns either the top-k results or results with high scores. Case Explorer database is built by extracting metadata from the Anthology, storing it in a database, deriving multiple scores for papers, authors, and publication venues. Propagating database scores to query outputs is achieved by a unique score propagation methodology. A rich set of queries are offered to users using a powerful and innovative user interface that allows users to add arbitrarily many conditions to their queries.

People

Nattakarn Ratprasartporn, Sulieman Bani Ahmad, Ali Cakmak, Gultekin Ozsoyoglu

Project Web Site

http://nashua.case.edu/caseexplorer

 

Annotating Genomic Entities with Gene Ontology Terms

With the recent sequencing of complete genomes, the amount of available data characterizing the properties of genes and proteins has increased dramatically. In order to organize the knowledge in this field, Gene Ontology (GO) has been proposed as a controlled term vocabulary which describes the central attributes of genomic entities, i.e., genes and gene products. The impact of such an ontology, on a large scale, depends on its coverage of the genomic entity space, and the accuracy of the existing gene annotations with GO terms. In this project, we explore effective ways of annotating genomic entities with GO terms. In particular, we study GO term annotations of (a) proteins based on protein interaction networks and existing GO term annotations of proteins, and (b) genes using textual patterns extracted from PubMed articles.

ProtANN
The relationship between proteins in a protein interaction network is not only limited to protein pairs (i.e., interaction edges), but also generalizes to functional modules and protein complexes. It is now believed that proteins in the same functional module have the same (or similar) functional annotation. Using protein-protein interaction data, we explore probabilistic relationships between GO annotations of proteins to locate highly correlated GO terms. In particular, this project focuses on probabilistic significance of GO annotation sequences using a protein-protein interaction network. At the later stage, significant GO sequences are utilized to predict protein annotations.

GeneANN
Presently, an extensive portion of the discovered biological knowledge about genes and proteins is only available as unstructured textual data in scientific papers. Manual curation of textual data to annotate genomic entities with GO terms is very costly as it requires significant amounts of human effort. This project focuses on extracting textual patterns from biomedical papers, and using these patterns towards identifying new gene annotations.

People

Mustafa Kirac, Ali Cakmak, Gultekin Ozsoyoglu

 

XML Query Optimization Using Materialized Views

Efficient processing of XPath queries is considered as the heart of XML data query processing. Exploiting materialized views for XPath query evaluation can significantly speed up the query processing for large XML documents. This project considers the problem of query/view answerability when two types of data are maintained as materialized views. In the first case, we consider the views which contain only the XML fragment (copy data) of the result of the query. In the second case, we consider views containing the path data (i.e., a list of ancestor tags for the root node of the resulting XML tree) as well as the copy data. Utilizing the copy data enlarges the set of answerable queries without causing much overhead for the execution of the rewriting of the query.

People

Gokhan Yavas, Meral Ozsoyoglu

 

Labeling Schemes for Tree-Structured Data

The increasing popularity of eXtensible Markup Language (XML) as a data storage format imposes new challenges to store, analyze and update XML data efficiently. A frequently proposed approach for XML documents is to use relational databases where the XML document is represented as a tree structure and each node of the tree is represented as a tuple of a relation. In order to facilitate doing queries on this data, this project explores novel labeling schemes with the goal of reducing the number of expensive join operations. More specifically, prefix-based schemes which supports subtree insertions to an XML document without requiring any relabeling is the focus of the current research.

People

S. Fatih Akgul, Brendan Elliott, Meral Ozsoyoglu

 

Context-based Search

At the present time, ranking functions of literature digital libraries are either ineffective, or simply do not exist at all. For example, PubMed, the largest literature digital library in the world with more than 14 million publications, does not have a paper-scoring system for ranking papers satisfying a keyword search. Also, publication topics in PubMed are diverse; PubMed publications in response to a general keyword-based search routinely fall into multiple topics (i.e., topic diffusion across search results), some of which are not of interest to users. PubMed simply lists publications returned in a search query in descending order of their PubMed ids or publication years, thereby forcing users to scan large numbers of publications, and potentially missing important publications. Our proposal is to assign publications into pre-specified ontology-based contexts, compute relevancy scores for papers with respect to their assigned context(s), perform search within automatically selected contexts, and rank and return selected papers within their contexts. With this new approach, (a) the output is enhanced by a highly useful paper classification (based on contexts), which also eliminates topic diffusion and reduces output size, and (b) only semantically related papers in contexts of interest, as opposed to all papers, are involved in the ranking.

People

Nattakarn Ratprasartporn, Ali Cakmak, Sulieman Bani Ahmad, Gultekin Ozsoyoglu

 

Publication Similarity, Scoring and Ranking Measures


Publication searching based on keywords provided by users is traditional in digital libraries. While useful in many circumstances, the success of locating related publications via keyword-based searching paradigm is influenced by how users choose their keywords. Example-based searching, where user provides an example publication to locate similar publications, is also becoming commonplace in digital libraries. Existing publication similarity measures, needed for example-based searching, fall into two classes, namely, text-based similarity measures from Information Retrieval, and citation-based similarity measures based on bibliographic coupling and/or co-citation. This project explores alternative publication similarity measures, ranking and scoring mechanisms.

People

Sulieman Bani Ahmad, Ali Cakmak, Gultekin Ozsoyoglu

 

Evaluation of Publication Scoring Schemes in Context-based Search Environment

Context-based literature digital library search is a new searching paradigm that allows for an effective ranking of query outputs, and controls the diversity of query output topics. Contexts are defined by pre-specified ontology-based terms, and a paper set of a context is located based on the semantic properties of the ontology (context) term. In order to provide a comparative assessment of papers in a context and to effectively rank papers returned in search outputs, prestige scores are attached to all papers with respect to their assigned contexts. This project explores the effectiveness of different prestige score functions for context-based environment, namely, citation-based, text-based, and pattern-based score functions. PubMed publications are used as the test bed for the experiments, and Gene Ontology is employed as the context hierarchy.

People

Nattakarn Ratprasartporn, Sulieman Bani Ahmad, Ali Cakmak, Gultekin Ozsoyoglu

 

Bibliometry-Aware Selection of Publication Ranking Measures in Digital Libraries

Keyword-based searching of digital libraries usually returns a large number of publications. Users usually view the first few results. Consequently, it is believed that ranking search results are useful. Despite their success in web search engines, link-based ranking approaches, like PageRank and HITS, did not find acceptance in ranking publication for digital libraries. Yet, publication citation-count is widely used in practice academics as an indicator of its influence to aid in tenure decisions. Digital libraries, like Google Scholar in Computer Science and PubMed in Medical Sciences, order search results according to either the text-based relevancy score only or the pre-assigned document ID. The reasons for that are the complexity and special characteristics of literature environment compared to the web environment. For instance, one may find a number of quality indicators of publications that need to be considered in the process of ranking publications. Another example is the bibliometric features of the field of study being targeted that need to be considered when making raking decisions. This project explores a ranking mechanism that automatically defines paper contexts in the absence of domain-specific ontology terms. Our proposal is based on dynamically (i) discovering author communities and (ii) defining paper context, then ranking papers within that context.

People

Sulieman Bani Ahmad, Ali Cakmak, Gultekin Ozsoyoglu

 

Inferring Disease Models from Integrated Gene Networks

The mouse has been one of the most common model organisms in genetics research. Mouse Genome Informatics (MGI), maintained by The Jackson Laboratory, provides integrated access to data on the genetics, genomics, and biology of the laboratory mouse. MGI has been a rich resource coupling phenotypic analysis of the mouse with orthology information of candidate disease genes in humans (see MGI's phenotype query system). Once candidate orthologous genes have been identified in the mouse, techniques such as targeted mutagenesis can be employed to create mouse models for studying human disease. On the other hand, due to incompleteness of disease studies, not all genes related to a disease are known. In many cases, MGI phenotype database lists candidate genes associated with phenotypic characteristic of a disease for a single (i.e., human or mouse) species, leaving the orthologous genes in the other organism unknown to be related to the disease. In this work, our aim is to compute the likelihood of a gene to be related to a disease (i.e., genes of the disease) and infer the relationship between the gene and the disease (e.g., direct or through regulation of auxiliary genes/gene products).

We view the genetic system of an organism as a gene network where vertices represent the genes and the edges denotes for physical and genetic interactions between the genes. We build an integrated gene network by combining the following biological networks:

  • (1) Protein interaction networks: Protein complexes are responsible for most biological functions and protein-protein interaction data lists protein pairs with compatible binding sites.
  • (2) Metabolic pathways: Genes coding enzymes that catalyze adjacent metabolic reactions (product of a reaction is exploited as a substrate in another) are linked by metabolites. Blocking prior reaction eliminates the following reaction even if the enzyme of the next reaction is expressed in the cell.
  • (3) Signaling pathways: Similar to metabolic pathways, a change in the functionality of kinases affect the performance of the protein to be phosphorylated.
  • (4) Transcription regulation networks: A protein binding at the promoter site of gene coding region of DNA regulates the expression level of the gene.

People

Mustafa Kirac, Gultekin Ozsoyoglu

 

Estimating Future Paper Scores using Temporal Citation Patterns

Scientific papers often cite other papers to discuss the related work in their field, and also point out the differences/improvements in comparison to the other similar papers. Based on the citation information, a literature database can be considered as a graph, called citation-graph, where the nodes are the papers, and there is a directed edge from a paper A to paper B if A cites B. The same setting also applies to the web environment where nodes are individual web pages or sites, and the edges are the hyperlinks from one page to the other. Assigning prestige scores to papers or web pages is a common practice. PageRank is recently the most popular ranking algorithm variations of which are used by almost all the search engines to rank the web pages, and order them accordingly in a search result. PageRank is also used to assign importance scores to papers using the underlying citation graph as input. Once a paper is published, it takes time for the paper to be recognized, and get cited by the other papers. On the average, it may take from 5 to 20 years for a paper to reach its peak prestige scores. Therefore, for newly published high quality papers, the PageRank may provide relatively low scores due to the fact that the paper does not have enough citations shortly after it is published. In order to tackle with this bias, this project focuses on mechanisms to characterize the nature of very first citations that a paper gets, and use it as an indicator towards the final score of a paper. To this end, temporal citation patterns in multiple dimensions are studied.

People

Sulieman Bani Ahmad, Ali Cakmak, Gultekin Ozsoyoglu

 

Mining Biological Networks for Pathway Fragments

It is well established that genomic entities in different organisms show considerable similarities in terms of sequence and/or functionality. In the course of studying organisms at a coarser, systems level, life scientists recently listed the following questions: (i) To what extent the genomic pathways are conserved among different species? (ii) Is there a minimal set of pathways that are required by all organisms? (iii) How are organisms related in terms of the distance between pathways rather than at the level of DNA sequence similarity? At the core of such questions lies the identification of pathways in different organisms. However, experimental validation of enormous number of possible candidates in a wet-lab environment requires monumental amounts of time and effort. Thus, there is a need for comparative genomics tools that help scientists predict the pathways in an organism's biological network. This project explores modeling each pathway as a graph of enzyme functions, which we call pathway functionality template (PFT). More specifically, we focus on moving from the traditional pathway domain of reactions, metabolites, etc. to pathway GO functionality (PF) domain so as to infer previously unknown pathway fragments in different organisms.

People

Ali Cakmak, Gultekin Ozsoyoglu

 

PopulusLog: People Information Database

The amount of knowledge available on publicly accessible web pages has recently increased
dramatically. Among many others, personal information about individuals is one of the most
commonly published data type on the web. Politicians, scientists, students and individual
from other professional fields usually publish information about their background, research,
experience, family and so on. Generic search engines do not provide more
than the links to the web pages that contain an occurrence of the given search phrase. And,
each result is considered individually. Hence, the generic search scheme misses the useful
information that can be obtained only through considering the entities organized as a network. Due to individual consideration of entities, traditional search engines also do not support complex graph queries such as virtual distance between people.
In order to organize the publicly available personal information on the web in a more
structured way, and allow for advanced querying of the collected information, we have been
developing a knowledgebase called PopulusLog. More specifically, PopulusLog (i) allows grouping of, and provides one-point access to the information about a person, (ii) provides semantic querying schemes like “Find the colleagues of person A” rather than just simple syntactical keyword search, (iii) evaluates the collected information thorough social network analysis and provide new knowledge like personal impact factors, social cliques that would otherwise stay implicit, (iv) visualizes the query results.

People

Ali Cakmak, Mustafa Kirac, Gultekin Ozsoyoglu

Project Web Site

http://nashua.case.edu/populuslogcase

 

Interactive Picture Database Website Featuring Multiple Organizational Hierarchies

Digital cameras are being increasingly used as a replacement for film cameras. As personal digital photo libraries increase in size, specific pictures may become harder to find, just as it can be difficult to retrieve a print among a shelf full of traditional photo albums. This project began approximately two and half years ago as an automated website to share and organize my pictures. Since its conception it has grown from about 1,600 images to roughly 30,000 images, including pictures posted by friends and family. The site has been redesigned several times during this period in order to answer the challenges posed by the increasingly large scale. Two of the major concerns are 1) allowing the user to efficiently find a specific image or images that they wish to retrieve, and 2) provide means for the user to discover new pictures that may be of interest to him or her. The first problem is currently addressed through the user of modern database technology to index and provide an efficient framework for querying. There are several approaches that I have been investigating to facilitate the second issue. One of the main approaches was to design the system to allow images to be placed in an arbitrary number of conceptual hierarchies and allow switching of the ‘context’ hierarchy in which a node is being browsed. Other features of this site include: a framework for doing image similarity search, support for users/content ownership, the ability to post comments, and access to image metadata.

People

Brendan Elliott, Meral Ozsoyoglu

Project Web Site

http://www.risukun.com