| |
|
PathCase - Case Pathways Database
System |
| As the blueprints of cellular actions,
biological pathways provide scientists with invaluable information
in understanding, analyzing and comparing the governing mechanisms
in living organisms. The web-based PathCase system is designed
for querying, visualizing, and analyzing biochemical pathways,
and contains a pathways database, and various visualization,
editing, computational analysis and data mining tools, involving
hierarchically-arranged graph objects (pathways) and their associated
(gene) ontology.
The PathCase architecture involves (i) a web server with extensive
graph object (layout) caching to provide a scalable system over
the web, (ii) a thin client that caches the graph object layout
resulting from a query, and, that, afterwards, all manipulations/re-visualizations
of the graph object take place only at the client side, again
for scalability, and (iii) extensive web service functions that
provide, among others, pathways graph querying capabilities.
People
Meral Ozsoyoglu, Gultekin Ozsoyoglu, S. Fatih Akgul, Ali Cakmak,
Brendan Elliott, Mustafa Kirac, Gokhan Yavas, Michael Starke,
Greg Strnad
Project Web Site
http://nashua.case.edu/pathways
|
Case Explorer |
| Case Explorer is a score-guided searching and querying prototype portal
for ACM SIGMOD Anthology, a digital library for the database systems research community,
containing about 15,000 papers. Case Explorer has a powerful user interface that allows
users to pose score-guided ad hoc queries to search the Anthology, automatically computes
the scores of query results from the scores of database objects (papers, authors, publication venues), and
returns either the top-k results or results with high scores. Case Explorer database is built by extracting metadata from the Anthology, storing it in a database, deriving multiple scores for papers, authors, and publication venues. Propagating database scores to query outputs is achieved by a unique score propagation methodology. A rich set of queries are offered to users using a powerful and innovative user interface that allows users to add arbitrarily many conditions to their queries.
People
Nattakarn Ratprasartporn, Sulieman Bani Ahmad, Ali Cakmak,
Gultekin Ozsoyoglu
Project Web Site
http://nashua.case.edu/caseexplorer
|
Pedigree Data Management |
| A Pedigree
is a “record of ancestry or purity of breed”. Pedigrees
are hierarchical hereditary structures and are typically represented
as directed acyclic graphs. Stud books (listings of pedigrees
for horses, dogs, etc.) and herdbooks (records for cattle, swine,
sheep, etc.) are maintained by governmental or private record
associations or breed organizations in many countries. In human
genetics, pedigree diagrams are utilized to trace the inheritance
of a specific trait, abnormality, or disease, calculate disease
risk factors, identify individuals at risk, and facilitate genetic
counseling. In addition to medical genetics, pedigrees are also
commonly used in animal breeding (e.g., Horse racing & pet
breeding), plant studies (self-pollinated plant breeding), and
genealogical studies. As the volume of this structured pedigree
data expands, there is a pressing need for better ways to manage,
store, and efficiently query this data.
People
Brendan Elliott, S. Fatih Akgul, Meral Ozsoyoglu
|
Annotating Genomic Entities with Gene
Ontology Terms |
| With the recent sequencing of complete genomes,
the amount of available data characterizing the properties of
genes and proteins has increased dramatically. In order to organize
the knowledge in this field, Gene Ontology (GO) has been proposed
as a controlled term vocabulary which describes the central
attributes of genomic entities, i.e., genes and gene products.
The impact of such an ontology, on a large scale, depends on
its coverage of the genomic entity space, and the accuracy of
the existing gene annotations with GO terms. In this project,
we explore effective ways of annotating genomic entities with
GO terms. In particular, we study GO term annotations of (a)
proteins based on protein interaction networks and existing
GO term annotations of proteins, and (b) genes using textual
patterns extracted from PubMed articles.
ProtANN
The relationship between proteins in a protein interaction
network is not only limited to protein pairs (i.e., interaction
edges), but also generalizes to functional modules and protein
complexes. It is now believed that proteins in the same functional
module have the same (or similar) functional annotation. Using
protein-protein interaction data, we explore probabilistic relationships
between GO annotations of proteins to locate highly correlated
GO terms. In particular, this project focuses on probabilistic
significance of GO annotation sequences using a protein-protein
interaction network. At the later stage, significant GO sequences
are utilized to predict protein annotations.
GeneANN
Presently, an extensive portion of the discovered biological
knowledge about genes and proteins is only available as unstructured
textual data in scientific papers. Manual curation of textual
data to annotate genomic entities with GO terms is very costly
as it requires significant amounts of human effort. This project
focuses on extracting textual patterns from biomedical papers,
and using these patterns towards identifying new gene annotations.
People
Mustafa Kirac, Ali Cakmak, Gultekin Ozsoyoglu
|
XML Query Optimization Using Materialized
Views |
| Efficient processing of XPath queries is considered
as the heart of XML data query processing. Exploiting materialized
views for XPath query evaluation can significantly speed up
the query processing for large XML documents. This project considers
the problem of query/view answerability when two types of data
are maintained as materialized views. In the first case, we
consider the views which contain only the XML fragment (copy
data) of the result of the query. In the second case, we consider
views containing the path data (i.e., a list of ancestor tags
for the root node of the resulting XML tree) as well as the
copy data. Utilizing the copy data enlarges the set of answerable
queries without causing much overhead for the execution of the
rewriting of the query.
People
Gokhan Yavas, Meral Ozsoyoglu
|
Labeling Schemes for Tree-Structured
Data |
| The increasing popularity of eXtensible Markup
Language (XML) as a data storage format imposes new challenges
to store, analyze and update XML data efficiently. A frequently
proposed approach for XML documents is to use relational databases
where the XML document is represented as a tree structure and
each node of the tree is represented as a tuple of a relation.
In order to facilitate doing queries on this data, this project
explores novel labeling schemes with the goal of reducing the
number of expensive join operations. More specifically, prefix-based
schemes which supports subtree insertions to an XML document
without requiring any relabeling is the focus of the current
research.
People
S. Fatih Akgul, Brendan Elliott, Meral Ozsoyoglu
|
Context-based Search |
| At the present time, ranking functions of literature
digital libraries are either ineffective, or simply do not exist
at all. For example, PubMed, the largest literature digital
library in the world with more than 14 million publications,
does not have a paper-scoring system for ranking papers satisfying
a keyword search. Also, publication topics in PubMed are diverse;
PubMed publications in response to a general keyword-based search
routinely fall into multiple topics (i.e., topic diffusion across
search results), some of which are not of interest to users.
PubMed simply lists publications returned in a search query
in descending order of their PubMed ids or publication years,
thereby forcing users to scan large numbers of publications,
and potentially missing important publications. Our proposal
is to assign publications into pre-specified ontology-based
contexts, compute relevancy scores for papers with respect to
their assigned context(s), perform search within automatically
selected contexts, and rank and return selected papers within
their contexts. With this new approach, (a) the output is enhanced
by a highly useful paper classification (based on contexts),
which also eliminates topic diffusion and reduces output size,
and (b) only semantically related papers in contexts of interest,
as opposed to all papers, are involved in the ranking.
People
Nattakarn Ratprasartporn, Ali Cakmak, Sulieman Bani Ahmad,
Gultekin Ozsoyoglu
|
Publication Similarity, Scoring and
Ranking Measures |
|
Publication searching based on keywords provided by users is
traditional in digital libraries. While useful in many circumstances,
the success of locating related publications via keyword-based
searching paradigm is influenced by how users choose their keywords.
Example-based searching, where user provides an example publication
to locate similar publications, is also becoming commonplace
in digital libraries. Existing publication similarity measures,
needed for example-based searching, fall into two classes, namely,
text-based similarity measures from Information Retrieval, and
citation-based similarity measures based on bibliographic coupling
and/or co-citation. This project explores alternative publication
similarity measures, ranking and scoring mechanisms.
People
Sulieman Bani Ahmad, Ali Cakmak, Gultekin Ozsoyoglu
|
Evaluation of Publication Scoring Schemes in Context-based Search Environment |
| Context-based literature digital library search
is a new searching paradigm that allows for an effective ranking
of query outputs, and controls the diversity of query output
topics. Contexts are defined by pre-specified ontology-based
terms, and a paper set of a context is located based on the
semantic properties of the ontology (context) term. In order
to provide a comparative assessment of papers in a context and
to effectively rank papers returned in search outputs, prestige
scores are attached to all papers with respect to their assigned
contexts. This project explores the effectiveness of different
prestige score functions for context-based environment, namely,
citation-based, text-based, and pattern-based score functions.
PubMed publications are used as the test bed for the experiments,
and Gene Ontology is employed as the context hierarchy.
People
Nattakarn Ratprasartporn, Sulieman Bani Ahmad, Ali Cakmak,
Gultekin Ozsoyoglu
|
Bibliometry-Aware Selection of Publication
Ranking Measures in Digital Libraries |
| Keyword-based searching of digital libraries
usually returns a large number of publications. Users usually
view the first few results. Consequently, it is believed that
ranking search results are useful. Despite their success in
web search engines, link-based ranking approaches, like PageRank
and HITS, did not find acceptance in ranking publication for
digital libraries. Yet, publication citation-count is widely
used in practice academics as an indicator of its influence
to aid in tenure decisions. Digital libraries, like Google Scholar
in Computer Science and PubMed in Medical Sciences, order search
results according to either the text-based relevancy score only
or the pre-assigned document ID. The reasons for that are the
complexity and special characteristics of literature environment
compared to the web environment. For instance, one may find
a number of quality indicators of publications that need to
be considered in the process of ranking publications. Another
example is the bibliometric features of the field of study being
targeted that need to be considered when making raking decisions.
This project explores a ranking mechanism that automatically
defines paper contexts in the absence of domain-specific ontology
terms. Our proposal is based on dynamically (i) discovering
author communities and (ii) defining paper context, then ranking
papers within that context.
People
Sulieman Bani Ahmad, Ali Cakmak, Gultekin Ozsoyoglu
|
Inferring Disease Models from Integrated Gene Networks |
| The mouse has been one of the most common model organisms in genetics research. Mouse Genome Informatics (MGI), maintained by The Jackson Laboratory, provides integrated access to data on the genetics, genomics, and biology of the laboratory mouse. MGI has been a rich resource coupling phenotypic analysis of the mouse with orthology information of candidate disease genes in humans (see MGI's phenotype query system). Once candidate orthologous genes have been identified in the mouse, techniques such as targeted mutagenesis can be employed to create mouse models for studying human disease. On the other hand, due to incompleteness of disease studies, not all genes related to a disease are known. In many cases, MGI phenotype database lists candidate genes associated with phenotypic characteristic of a disease for a single (i.e., human or mouse) species, leaving the orthologous genes in the other organism unknown to be related to the disease. In this work, our aim is to compute the likelihood of a gene to be related to a disease (i.e., genes of the disease) and infer the relationship between the gene and the disease (e.g., direct or through regulation of auxiliary genes/gene products).
We view the genetic system of an organism as a gene network
where vertices represent the genes and the edges denotes for
physical and genetic interactions between the genes. We build
an integrated gene network by combining the following biological
networks:
- (1) Protein interaction networks: Protein complexes are responsible for most biological functions and protein-protein interaction data lists protein pairs with compatible binding sites.
- (2) Metabolic pathways: Genes coding enzymes that catalyze adjacent metabolic reactions (product of a reaction is exploited as a substrate in another) are linked by metabolites. Blocking prior reaction eliminates the following reaction even if the enzyme of the next reaction is expressed in the cell.
- (3) Signaling pathways: Similar to metabolic pathways, a change in the functionality of kinases affect the performance of the protein to be phosphorylated.
- (4) Transcription regulation networks: A protein binding at the promoter site of gene coding region of DNA regulates the expression level of the gene.
People
Mustafa Kirac, Gultekin Ozsoyoglu
|
Estimating Future Paper Scores using Temporal Citation Patterns |
| Scientific papers often cite other papers to
discuss the related work in their field, and also point out
the differences/improvements in comparison to the other similar
papers. Based on the citation information, a literature database
can be considered as a graph, called citation-graph, where the
nodes are the papers, and there is a directed edge from a paper
A to paper B if A cites B. The same setting also applies to
the web environment where nodes are individual web pages or
sites, and the edges are the hyperlinks from one page to the
other. Assigning prestige scores to papers or web pages is a
common practice. PageRank is recently the most popular ranking
algorithm variations of which are used by almost all the search
engines to rank the web pages, and order them accordingly in
a search result. PageRank is also used to assign importance
scores to papers using the underlying citation graph as input.
Once a paper is published, it takes time for the paper to be
recognized, and get cited by the other papers. On the average,
it may take from 5 to 20 years for a paper to reach its peak
prestige scores. Therefore, for newly published high quality
papers, the PageRank may provide relatively low scores due to
the fact that the paper does not have enough citations shortly
after it is published. In order to tackle with this bias, this
project focuses on mechanisms to characterize the nature of
very first citations that a paper gets, and use it as an indicator
towards the final score of a paper. To this end, temporal citation
patterns in multiple dimensions are studied.
People
Sulieman Bani Ahmad, Ali Cakmak, Gultekin Ozsoyoglu
|
Mining Biological Networks for Pathway Fragments |
| It is well established that genomic entities
in different organisms show considerable similarities in terms
of sequence and/or functionality. In the course of studying
organisms at a coarser, systems level, life scientists recently
listed the following questions: (i) To what extent the genomic
pathways are conserved among different species? (ii) Is there
a minimal set of pathways that are required by all organisms?
(iii) How are organisms related in terms of the distance between
pathways rather than at the level of DNA sequence similarity?
At the core of such questions lies the identification of pathways
in different organisms. However, experimental validation of
enormous number of possible candidates in a wet-lab environment
requires monumental amounts of time and effort. Thus, there
is a need for comparative genomics tools that help scientists
predict the pathways in an organism's biological network. This
project explores modeling each pathway as a graph of enzyme
functions, which we call pathway functionality template (PFT).
More specifically, we focus on moving from the traditional pathway
domain of reactions, metabolites, etc. to pathway GO functionality
(PF) domain so as to infer previously unknown pathway fragments
in different organisms.
People
Ali Cakmak, Gultekin Ozsoyoglu
|
PopulusLog: People Information Database |
| The amount of knowledge available on publicly
accessible web pages has recently increased
dramatically. Among many others, personal information about
individuals is one of the most
commonly published data type on the web. Politicians, scientists,
students and individual
from other professional fields usually publish information about
their background, research,
experience, family and so on. Generic search engines do not
provide more
than the links to the web pages that contain an occurrence of
the given search phrase. And,
each result is considered individually. Hence, the generic search
scheme misses the useful
information that can be obtained only through considering the
entities organized as a network. Due to individual consideration
of entities, traditional search engines also do not support
complex graph queries such as virtual distance between people.
In order to organize the publicly available personal information
on the web in a more
structured way, and allow for advanced querying of the collected
information, we have been
developing a knowledgebase called PopulusLog. More specifically,
PopulusLog (i) allows grouping of, and provides one-point access
to the information about a person, (ii) provides semantic querying
schemes like “Find the colleagues of person A” rather
than just simple syntactical keyword search, (iii) evaluates
the collected information thorough social network analysis and
provide new knowledge like personal impact factors, social cliques
that would otherwise stay implicit, (iv) visualizes the query
results.
People
Ali Cakmak, Mustafa Kirac, Gultekin Ozsoyoglu
Project Web Site
http://nashua.case.edu/populuslogcase
|
Interactive Picture Database Website
Featuring Multiple Organizational Hierarchies |
| Digital cameras are being increasingly used
as a replacement for film cameras. As personal digital photo
libraries increase in size, specific pictures may become harder
to find, just as it can be difficult to retrieve a print among
a shelf full of traditional photo albums. This project began
approximately two and half years ago as an automated website
to share and organize my pictures. Since its conception it has
grown from about 1,600 images to roughly 30,000 images, including
pictures posted by friends and family. The site has been redesigned
several times during this period in order to answer the challenges
posed by the increasingly large scale. Two of the major concerns
are 1) allowing the user to efficiently find a specific image
or images that they wish to retrieve, and 2) provide means for
the user to discover new pictures that may be of interest to
him or her. The first problem is currently addressed through
the user of modern database technology to index and provide
an efficient framework for querying. There are several approaches
that I have been investigating to facilitate the second issue.
One of the main approaches was to design the system to allow
images to be placed in an arbitrary number of conceptual hierarchies
and allow switching of the ‘context’ hierarchy in
which a node is being browsed. Other features of this site include:
a framework for doing image similarity search, support for users/content
ownership, the ability to post comments, and access to image
metadata.
People
Brendan Elliott, Meral Ozsoyoglu
Project Web Site
http://www.risukun.com
|
|