| |
Areas of research in Computer Science:
The Database Management Systems (DBMS) Research Lab in the School of
Computing has a strong group of faculty members and students working on
many diverse data management issues, ranging from core database
research such as storage structure, query processing and optimisation,
and indexing, to advanced database applications such as XML, data
mining, data warehousing and peer-to-peer computing.
Our research interests in database management systems are:
- Database security
- Distributed
systems for standing subscriptions
- Information
integration
- Privacy and
anonymity in databases
- Semi-structured
data repository and query processing
- Skyline query
processing
- XML query
evaluation and estimation
- Mobile databases
- Peer-to-peer
database management
- Peer-to-peer
technologies for distributed applications
- Query processing
in peer-to-peer (P2P) systems
- Schema integration
and schematic discrepancy
- Sensor network
- Spatio-temporal
data mining
Database Security
We started working in Database Security in 2001. Security is critical
to any computing system. We aim to focus on designing techniques to
protect data/ databases. We have focused on several aspects over the
years, ranging from storage design (steganographic file systems) to
securing data streams. In the past two years, we have studied data
authentication in outsourced databases. Here, we have focused on the
completeness issue which has not been previously addressed in the
literature. Essentially, we have designed a scheme that enables users
to verify that results obtained from third party publishers are
complete without violating user access control. We are also the first
to look into verifying multi-dimensional data.
Distributed Systems for Standing Subscriptions
This project started in January 2006, and is funded by the University
Research Committee (URC). The goal of this project is to study
efficient techniques to disseminate data in a distributed
publish/subscribe environment.
To date, we have achieved the following:
a) We have designed a publish/subscribe system to facilitate
dissemination of streaming data where the subscription has a coherency
constraint. Our proposed method is also adaptive to runtime changes in
the environment, and can adapt the dissemination tree dynamically at
runtime to optimise performance.
b) We have also developed a novel architecture for scaleable
distributed stream processing. Our system provides services to evaluate
a large number of continuous queries for a large number of clients. The
query layer is built over a distributed publish/ subscribe
architecture. We have also proposed a solution to distribute queries in
the query layer.
c) We have developed an efficient approach for filtering and
disseminating fragmented XML documents in a publish/subscribe
environment. The approach consists of novel query scheduling and
optimisation algorithms that exploit the semantics of fragmented data
to minimise unnecessary and redundant processing overhead.
d) We have developed a new optimisation technique that enables
downstream routers to leverage the processing done by upstream routers
to reduce response time. This is achieved by dynamically piggybacking
useful processing hints along with the XML data.
Information Integration
The objective of this research is the design of strategies,
techniques and tools for the integration and management of multi-modal
and multimedia information from distributed, heterogeneous and
autonomous sources. The approach is primarily grounded in database
research. As such, it emphasises various ideas such as declarative
query languages, conceptual data models, adaptive processing and
optimisation of programs and queries, and the consideration of large
numbers of users and large collections of data. Currently, the research
focuses on the following themes: XML data management and processing,
progressive processing of disparate information (stream processing),
business process definitions management and integration, and algorithms
for web information (clustering).
The research findings concern the semantic integration of
disparate information sources, in particular, the integration and
management of XML data. The integration and management of multi-modal
information also concern textual and geographical information indexing,
retrieval and extraction, as well as business processes management.
Our research achievements include:
- The first geographical search engine Global Atlas
- The X007 benchmark
- The iMinMax indexing technique for multidimensional data
Privacy and Anonymity in Databases
Organisations, such as government agencies, insurance
companies and hospitals, need to release detailed data (e.g., medical
records, financial data, etc.) for research and other public benefit
purposes. However, sensitive personal information (e.g., disease of a
specific person) may be revealed in this process, despite the hiding of
identifying attributes, such as name or social security number. This is
due to the existence of quasi-identifiers (QI), which are sets of
attributes (e.g., ZIP, DateOfBirth, Sex) that can be used in
conjunction with information obtained from a variety of sources (e.g.,
public voting registration data) to reveal the identity of individuals
whose data is residing on the records. Indeed, a well-known study has
found that 87% of the US population can be uniquely identified by the
combination of ZIP, DateOfBirth, Sex. As more data is made available
online, greater concern over privacy may be expected. Recent research
has proposed k-Anonymity and l-Diversity to solve this problem. Both
approaches involve generalisation (e.g., showing the area code instead
of the exact phone number) or suppression (completely hiding some
data), which inadvertently causes information loss.
Our research focuses on:
a) Developing a framework for efficient data transformations
(i.e., k-Anonymity and l-Diversity), which minimises information loss
(for the given k or l ). Most existing research employs many-to-one
mappings to compute anonymised versions of original multi-dimensional
data. We focus on more flexible many-to-many multi-dimensional
mappings. Although it has been acknowledged that this method may
achieve lower information loss, it has not received much attention
because it is computationally expensive. To reduce computation cost, we
use dimensionality reduction techniques and study efficient
approximation algorithms in one-dimensional space.
b) Studying the dual problem: Given the maximum acceptable
information loss, we seek to find a transformation that maximises
privacy (i.e., k or l). Although this problem is significant in
practice, it has not been studied before.
c) Extending our framework (i.e., anonymisation via
dimensionality reduction) to related applications. Examples include
location-based services (e.g., online maps), where the location of the
querying user (rather than the data) must be anonymised, or stream data
(e.g., data from sensors, stock market, etc.).
Semi-structured Data Repository and Query Processing
There has been increased interest in semi-structured data
recently with the introduction of XML and related languages and
technologies. Semi-structured data is claimed to be self-describing,
making it important to define a schema for the data. The data models
that have been proposed specifically for semi-structured data, (e.g.
DOM, Dataguides, etc.), capture limited semantics in the data, and do
not model the semantics expressed in the schema.
Although XML documents could have rather complex internal
structures, they can generally be modelled as ordered trees. In most
XML query languages, the structures of XML documents are expressed by
twig patterns while the values of XML elements are used as part of
selection predicates. Efficiently matching all twig patterns in an XML
database is a major concern of XML query processing. Among them, the
holistic twig join approach has been taken as an efficient way to match
twig patterns through reducing intermediate results.
The project has so far achieved the following:
a) We have defined a data model called ORASS that captures
richer semantics. Using this semantically rich data model, we have
investigated and experimented with more efficient storage mechanisms,
identified valid views of the base data, defined efficient view
maintenance algorithms, and proposed methods for efficiently querying a
semistructured data repository.
b) We have extended the existing Dewey node labelling
technique and used it to improve twig pattern query processing.
c) We have developed new node labelling techniques to label
dynamic XML data without re-labelling any node labels when the XML data
is updated. Prior node labelling techniques used in twig pattern
processing are designed for static XML data.
Skyline Query Processing
This project started in January 2005, and is funded by the
Faculty Research Committee (FRC). The goal of this project is to
investigate efficient techniques to process an important class of
qualitative preference queries called skyline queries. A skyline query
returns a set of records such that each returned record is not
dominated by other records with respect to a userspecified set of
attributes.
The project has so far achieved the following:
a) We have developed a novel framework and several scalable
algorithms to efficiently process a generalised class of skyline
queries that support data attributes with partially ordered domains.
b) We have introduced two novel notions of interesting skyline
points (frequent skylines and k-dominant skylines), and developed
efficient algorithms to compute them. These concepts are particularly
important for high-dimensional data points in enabling a meaningful
ranking of large sets of skyline query results.
XML Query Evaluation and Estimation
With the fast-growing use of XML data on the Web, optimising
XML queries has become one of the most active and exciting research
areas. Developments in query processing and selectivity estimations of
XML data are among the major issues since they determine data access
methods and the best possible execution plans for complex XML queries.
Our research in this area aims to develop efficient approaches
for query evaluation and selectivity estimations of XML queries. We
examine how path information in XML data can be utilised to speed up
structural join, which is the core operation in XML query processing.
The proposed solution comprises a path-based node labelling scheme and
a path join algorithm. The path-based approach is also efficient for an
important class of XML queries involving structural join with
not-predicates.
We have also designed several methods for XML selectivity
estimations. These include a compact statistical method, a
histogram-based method for skewed XML data, and a path-based method for
estimating the selectivity of XML queries with or without order-based
axes.
The faculty members involved in database management systems are:
- BRESSAN Stephane
- CHAN Chee Yong
- HSU Wynne
- KALNIS Panagiotis
- LEE Mong Li
- LING Tok Wang
- OOI Beng Chin
- TAN Kian Lee
- TUNG Kum Hoe, Anthony
|