The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2000-2001 are below, and more will be listed as we get confirmations. Please send your suggestions to M. Tamer Özsu.
Unless otherwise noted, all talks will be in room DC (Davis Centre) 1304. Coffee will be served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).
Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.
23 October 2000, 11:00 AM; MC (Mathematics and Computer Building) 5158
Title: | Courseware-On-Demand: Generating New Course Material From Existing Courses |
Speaker: | Vincent Oria, New Jersey Institute of Technology |
Abstract: |
Creation
of
appropriate
educational
content
is
a
major
challenge
and
cost
factor
in
the
increasing
important
areas
of
on-line
and
life-long
learning.
The
aim
of
the
ongoing
Courseware-on-Demand
project
is
to
enable
the
creation
of
course
content
by
recomposing
parts
extracted
from
existing
annotated
course
materials.
Course
materials
comprising
a
variety
of
learning
objects
(slides,
video,
audio,
textbook,
etc.),
are
gathered,
decomposed
into
elementary
learning
fragments,
appropriately
annotated,
indexed,
and
stored.
They
are
further
augmented
with
course
structures
and
dependencies
(prerequisite
and
precedence).
A
course
designer
who
has
a
particular
training
goal
in
mind
can
pose
a
query
to
the
Courseware-on-Demand
system,
expressing
that
goal.
The
first
step
in
processing
the
query
is
to
locate
course
fragments
containing
material
relevant
to
the
goal.
Then,
taking
into
account
the
dependencies,
the
system
can
define
a
minimal
path
in
the
metadata
network
that
leads
to
the
goal.
The
last
step
is
to
physically
extract
the
parts
of
the
course
materials
retrieved
during
the
previous
step,
and
compose
the
new
teaching
material.
This is joint work with Silvia Hollfelder and Peter Fankhauser, GMD-IPSI, Darmstadt, Germany. |
Bio: | Vincent Oria got a Diplôme d'Ingénieur in 1989 from the Institut National Polytechnique, Yamoussoukro, Côte d'Ivoire (Ivory Coast), a D.E.A in 1990 from Université Pierre et Marie Curie (Paris VI), Paris, France, and a Ph.D. in 1994 in Computer Science from the Ecole Nationale Supérieure des Télécommunications (ENST), Paris, France. He worked as a Research Scientist at the ENST Paris from 1994 to 1996 before joining Prof. Tamer Ozsu's group at the University of Alberta, Canada as a Post-Doct from 1996 to 1999 where he was the primarily researcher of the DISIMA project. He is currently an Assistant Professor in the Department of Computer and Information Science at the New Jersey Institute of Technology (NJIT), USA. |
6 November 2000, 11:00 AM
Title: | How to Crawl the Web |
Speaker: | Hector Garcia-Molina, Stanford University |
Abstract: |
Joint
work
with
Junghoo
Cho
A crawler collects large numbers of web pages, to be used for building an index or for data mining. Crawlers consume significant network and computing resources, both at the visited web servers and at the site(s) collecting the pages, and thus it is critical to make them efficient and well behaved. In this talk I will discuss how to build a "good" crawler, addressing questions such as:
|
Bio: | Hector Garcia-Molina is the Leonard Bosack and Sandra Lerner Professor in the Departments of Computer Science and Electrical Engineering at Stanford University, Stanford, California. From August 1994 to December 1997 he was the Director of the Computer Systems Laboratory at Stanford. From 1979 to 1991 he was on the faculty of the Computer Science Department at Princeton University, Princeton, New Jersey. His research interests include distributed computing systems and database systems. He received a BS in electrical engineering from the Instituto Tecnologico de Monterrey, Mexico, in 1974. From Stanford University, Stanford, California, he received in 1975 a MS in electrical engineering and a PhD in computer science in 1979. Garcia-Molina is a Fellow of the ACM, received the 1999 ACM SIGMOD Innovations Award, and is a member of the President's Information Technology Advisory Committee (PITAC). |
20 November 2000, 11:00 AM
Title: | An Authorization Model for Temporal Data |
Speaker: | Avigdor Gal, Rutgers University |
Abstract: |
The
use
of
temporal
data
has
become
wide-spread
in
recent
years,
within
applications
such
as
data
warehouses
and
spatiotemporal
databases.
In
this
paper,
we
extend
the
basic
authorization
model
by
facilitating
it
with
the
capability
to
express
authorizations
based
on
the
temporal
attributes
associated
with
data,
such
as
transaction
time
and
valid
time.
In
particular,
a
subject
can
specify
authorizations
based
on
data
validity
or
data
update
time,
using
either
absolute
or
relative
time
references.
Such
a
specification
is
essential
in
providing
access
control
for
predictive
data,
or
in
constraining
access
to
data
based
on
currency
considerations.
We
provide
an
expressive
language
for
specifying
such
access
control
to
temporal
data,
using
a
variation
of
temporal
logic
for
specifying
complex
temporal
constraints.
We
also
introduce
an
easy-to-use
access
control
mechanism
for
stream
data.
A joint work with Vijay Atluri, Rutgers University. A paper describing this material is available here. |
Bio: | Avigdor Gal is a faculty member at the Department of MSIS at Rutgers University. He received his D.Sc. degree from the Technion-Israel Institute of Technology in 1995 in the area of temporal active databases. He has published more than 30 papers in journals, books, and conferences on the topics of information systems architectures, active databases and temporal databases. Together with Dr. John Mylopoulos, Avigdor has chaired the Distributed Heterogeneous Information Services workshop at HICSS98 and he was the guest editor of a special issue by the same name in the International Journal of Cooperative Information Systems. Also, he was the General co-Chair of CoopIS2000. Avigdor has consulted in the area of eCommerce and is a member of the ACM and the IEEE computer society.. |
4 December 2000, 11:00 AM
Title: | Consistent Query Answers in Inconsistent Databases |
Speaker: | Jan Chomicki, University at Buffalo |
Abstract: |
As
the
amount
of
information
available
in
online
data
sources
explodes,
there
is
a
growing
concern
about
the
consistency
and
quality
of
answers
to
user
queries.
This
talk
addresses
the
issue
of
using
logical
integrity
constraints
to
gauge
the
consistency
and
quality
of
query
answers.
Although
it
is
impractical
to
enforce
global
integrity
constraints
across
different
data
sources
and
correct
integrity
violations
by
updating
individual
sources,
integrity
constraints
are
still
useful
because
they
express
important
semantic
properties
of
data.
This
talk
introduces
formal
notions
of
database
repair
and
consistent
query
answer:
A
consistent
answer
is
true
in
every
minimal
repair
of
the
database.
The
information
about
answer
consistency
serves
as
an
important
indication
of
its
quality
and
reliability.
I will show a general method for transforming first-order queries in such a way that the transformed query computes consistent answers to the original query. I will characterize the scope of this method: the classes of queries and integrity constraints it handles. For functional dependencies, I will provide a complete characterization of the computational complexity of computing consistent query answers to scalar aggregation queries (those involving the SQL operators COUNT, MIN, MAX, SUM, or AVG). I will also show that if the set of functional dependencies is in Boyce-Codd Normal Form, the boundary between tractable and intractable can be pushed further using the tools of perfect graph theory. (Joint work with Marcelo Arenas and Leopoldo Bertossi, with a contribution from Vijay Raghavan and Jeremy Spinrad.) |
Bio: | Jan Chomicki received his M.S. in computer science with honors from Warsaw University, Poland, and his Ph.D., also in computer science, from Rutgers University. He is currently an Associate Professor at the Department of Computer Science and Engineering, University at Buffalo (formerly: SUNY Buffalo). He has held visiting positions at Hewlett-Packard Labs, European Community Research Center, Bellcore, and Lucent Bell Labs. He is the recipient of 3 National Science Foundation research grants, the author of over 40 publications, and an editor of the book "Logics for Databases and Information Systems," Kluwer, 1998. |
22 January 2001, 11:00 AM
Title: | W3C XML Query WG: A Status Report (PDF) |
Speaker: | Paul Cotton, Microsoft Canada |
Abstract: | The World Wide Web Consortium (W3C) XML Query Working Group [1] was chartered in September 1999 to develop a query language for XML [2] documents. The goal of the XML Query Working Group is to produce a formal data model for XML documents with Namespaces [3] based on the XML Infoset [4] and XML Schemas [5, 6, 7], an algebra of query operators on that data model, and then a query language with a concrete canonical syntax based on the proposed operators. The WG produced its first Requirements document [8] in January 2000. Subsequently it has published an XML Query Data Model Working Draft [9] and an XML Query Algebra Working Draft [10]. The XML Query WG is currently working on a syntax for the XQuery language. This talk will provide an update on the current status of the W3C XML Query WG. This update will include current information on the status of the XML Query Requirements, Data Model and Algebra. The talk will also outline the relationship of the work of the XML Query WG to other W3C XML standards especially XML Schema and XPath [11]. |
Bio: | Paul Cotton joined Microsoft in May, 2000 and is currently Program Manager of XML Standards. Paul telecommutes to his Redmond job from his home in Ottawa, Canada. Paul has 28 years of experience in the Information Technology industry. He has been involved in computer standards work since 1988 when he began representing Fulcrum Technologies in the SQL Access Group where he was heavily involved in the development of the SQL Call-Level Interface (CLI) which is the de jure standard based on ODBC. Paul has represented his employer and Canada on the ISO SQL and SQL/Multi-Media committees since 1992 and was the Editor of the SQL/MM Full-Text and Still Image documents from 1995 until joining Microsoft. Paul was a founding member in the consortium efforts to standardize JDBC and SQLJ which provide interfaces to SQL for the Java language. Paul has been participating in the W3C XML Activity since mid-1998 when he became IBM's prime representative on the XML Linking and Infoset Working Groups. Paul has been chairperson of the XML Query Working Group and a member of the XML Coordination Group since September 1999. Paul was elected to the W3C Advisory Board in June 2000 soon after joining Microsoft. Paul is also Microsoft's alternate on the XML Protocol Working Group. Paul graduated with a M.Math in Computer Science from the University of Waterloo in 1974. |
5 February 2001, 11:00 AM; DC 1302
Title: | Consistent and Efficient Database Replication based on Group Communication (PDF) |
Speaker: | Bettina Kemme, McGill University |
Abstract: | Data replication is a common mechanism in distributed database systems that provides performance and fault-tolerance. Most early work has focused on eager replication: updates are immediately sent to all copies in order to guarantee consistency. However, database designers often point out that eager replication is not feasible in practice due to the high deadlock rate, the number of messages involved and the poor response times. As a result, the focus has shifted towards faster lazy replication where updates are propagated asynchronously to remote copies. The price paid is poor consistency or severe restrictions to the configuration of the system. This talk suggests that, contrary to what has been claimed, eager replication can be implemented in such a manner as to provide not only clear correctness criteria but also reasonable performance and scalability. This is especially true in environments where communication is fast (e.g., computer clusters). I will present a suite of eager replication protocols based on novel techniques that avoid most of the limitations of current approaches. These protocols take advantage of powerful broadcast primitives to serialize transactions, and to avoid deadlocks and an explicit 2-phase commit. The talk will also discuss the practical implications of integrating these techniques into a real database system, and provide experimental results that prove the feasibility of the approach. |
Bio: | Bettina Kemme received her M.S. in Computer Science in 1996 from the Friedrich-Alexander University, Erlangen, Germany, and her Ph.D. in Computer Science in 2000 from the Swiss Federal Institute of Technology (ETH), Zurich, Switzerland. She is currently an Assistant Professor at the School of Computer Science at McGill University, Montreal. |
28 February 2001, 2:30 PM, DC 1302
Title: | The XML Query Algebra: State and Challenges (PDF) |
Speaker: | Peter Fankhauser, GMD-IPSI |
Abstract: | The XML Query Algebra [1] is designed as part of the W3C-Activity XML Query [2] to provide a formal basis for an XML Query Language. It is guided by three principles. Full support of wellformed and valid XML, compositionality, and strong typing. This talk will introduce the overall design, the underlying datamodel, the type system, and the operational semantics of the algebra. Furthermore, the talk will exemplify how to map surface syntaxes such as XPath [3] or XQuery to the algebra, and discuss the relationship between the algebra's typesystem and XML Schema's [4,5,6] typesystem. |
Bio: | Peter Fankhauser is head of the department OASYS (Open Adaptive Information systems) and director of the XML Competence Center at the GMD Institute for Integrated Publication and Information Systems (IPSI). His main research and development interests lie in information brokering and transaction brokering. He has been the technical leader in the development of tools for unified access to information retrieval systems, document structure recognition, and the integration of heterogeneous databases. He is member of the W3C XML Query working group, and is one of the editors of the W3C XML-Query Requirements and the XML-Query Algebra working drafts He has received a PhD in computer science from the Technical University of Vienna in 1997. |
12 March 2001, 11:00 AM
Title: | Profile-Driven Data Management (PDF) |
Speaker: | Mitch Cherniack, Brandeis University |
Abstract: |
Large
amounts
of
data
populate
the
Internet,
yet
it
is
largely
unmanaged.
Access
to
this
data
requires
a
wide
array
of
network
resources
including
server
capacity,
bandwidth,
and
cache
space.
Applications
compete
for
these
resources
with
little
overarching
support
for
its
intelligent
allocation.
The database community has seen this problem before. Data management techniques (e.g., indexing, data organization, buffer management) have long been used in DBMS's to offset the demands of limited resources made by competing applications' requests for data. Such techniques are application-driven; a database administrator (DBA) consults with users to determine their data needs, and tweaks the knobs of data management tools accordingly. But the Internet environment, consisting of autonomous data sources and global in scale, cannot be tamed by a DBA. We take the view that in this environment, the DBA role must be automated, and the specification of application-level data requirements must be made formal by way of processible user profiles. In this talk, I will discuss the Profile-Driven Data Management project: a multi-institution endeavor funded by the National Science Foundation in October, 2001. I will demonstrate the goals and scope of the project, present some early results, and pose some of the many challenges that lie ahead. Joint Work with: Stan Zdonik (Brown University) and Michael J. Franklin (UC Berkeley) |
Bio: | Mitch Cherniack received his Ph.D. in Computer Science in 1999 from Brown University. He is currently an Assistant Professor at Brandeis University, Boston. |
19 March 2001, 11:00 AM; DC1302
Title: | Optimizing Queries Using Materialized Views (PDF) (Associated paper (PDF)) |
Speaker: | Per-Ake (Paul) Larson, Microsoft Research |
Abstract: |
Materialized
views
can
provide
orders
of
magnitude
improvements
in
query
processing
time,
especially
for
aggregation
queries
over
large
tables.
To realize this potential, the query optimizer must understand how and when to exploit materialized views. This talk describes a fast and scalable algorithm for determining whether part or all of a query can be computed from materialized views and outlines how the algorithm is integrated into a transformation-based optimizer. The algorithm handles views composed of selections, joins and a final group-by. Multiple rewrites using views may be generated and the optimizer chooses the best alternative in the normal way. Experimental results based on a prototype implementation in Microsoft SQL Server show outstanding performance and scalability. Optimization time increases slowly with the number of views but remains low even up to a thousand views. |
Bio: | Paul Larson is a senior researcher in the Database Group at Microsoft Research. Prior to joining Microsoft, he was a Professor of Computer Science at the University of Waterloo. |
26 March 2001, 11:00 AM
Title: | A Configurable Application View Storage System (PDF) |
Speaker: | Sibel Adali, RPI |
Abstract: | Storage management is a well-known method for improving the efficiency of data intensive and networked applications. Today's data management systems handle many non-traditional data formats, ranging from spatial data to images, video and other hybrid representations. This requires the use of specialized methods to query, extract and transform data from multiple, possibly distributed sources. There is a great need to develop efficient and scalable methodologies for storing and reusing the results of computations in such applications. In this paper, we introduce a configurable storage management system that allows programmers to specify a collection of storage management protocols for managing different types of data requests on top of a shared and possibly distributed pool of resources. In addition, dynamic protocol change rules allow the system to shift from one storage management method to another depending on the availability of system resources. Furthermore, the storage management system can be expanded with application specific methods to look-up and re-use stored items. We show how the storage system can be tuned to specific workload specifications with the help of a simulation model that takes into account both the cost of storage management protocols as well as the methods to look-up and re-use stored items. |
Bio: | Sibel Adali is an Assistant Professor in Rensselaer Polytechnic Institute, which she joined in 1996 after her PhD from University of Maryland. At RPI, she leads the Multimedia Information Integration Lab. Her research focuses on heterogeneous distributed information systems, database interoperability, query optimization, and multimedia information systems. |
23 April 2001, 11:00 AM; DC 1302
Title: | Data (and Links) on the Web (PDF) |
Speaker: | Alberto Mendelzon, University of Toronto |
Abstract: | In the excitement of extending database technology such as semistructured models and languages to the Web, the importance of links between documents has somehow been left behind. We will present two projects that keep links front and center: the WebSQL/WebOQL query languages and the TOPIC system for computing page reputations. WebSQL and WebOQL are query languages intended to serve as high-level tools for building Web-based information systems in the same way that SQL is used for building traditional information systems. TOPIC is a method for analyzing incoming links to a page in order to determine what are the topics that this page is best known for on the Web. |
Bio: | Alberto Mendelzon was born in Buenos Aires and received his graduate degrees from Princeton University. His research interests are in databases and knowledge-bases including database design theory, query languages, semi-structured data, global information systems, and OLAP. Alberto has chaired or co-chaired the Program Committees of ACM PODS (1991), VLDB (1992), DOOD (1995), WebDb (1998), DBPL (1999), and WWW8 (1999). He is an Associate Editor of the ACM Transactions on Database Systems and has guest-edited issues of the VLDB Journal, Theory and Practice of Object Systems, and Journal of Computer and System Sciences. He is a member of the Executive Board ACM PODS and Information Director for ACM SIGMOD. |
14 May 2001, 11:00 AM
Title: | Data Provenance (PDF) |
Speaker: | Peter Buneman, University of Pennsylvania |
Abstract: |
When
you
find
some
data
on
the
Web,
are
you
concerned
about
out
how
it
got
there?
Most
of
us
have
learned
not
to
put
much
faith
in
what
we
find
when
we
casually
browse
the
Web.
But
if
you
are
a
scientist
or
any
kind
of
scholar,
you
would
like
to
have
confidence
in
the
accuracy
and
timeliness
of
the
data
that
you
find.
In
particular,
you
would
like
to
know
how
it
got
there
--
its
provenance.
In
Bioinformatics
there
are
literally
hundreds
of
public
databases.
Most
of
these
are
not
source
data:
their
contents
have
been
extracted
from
other
databases
by
a
process
of
filtering,
transforming,
and
manual
correction
and
annotation.
Thus,
describing
the
provenance
of
some
piece
of
data
is
a
complex
issue.
These
"curated"
databases
have
enormous
added
value,
yet
they
often
fail
to
keep
an
adequate
description
of
the
provenance
of
the
data
they
contain.
In this talk I shall describe some recent work on a Digital Libraries project to investigate data provenance. I shall try to describe the general problem and then deal with some technical issues that have arisen, including (a) a semistructured data model that may help in characterizing data provenance (b) tracing provenance through database queries (c) efficient archiving of scientific databases and (d) keys (canonical identifiers) for semistructured data and XML. Joint work with Sanjeev Khanna and Wang-Chiew Tan |
Bio: |
Peter
Buneman
is
Professor
of
Computer
and
Information
Science
at
the
University
of
Pennsylvania.
His
work
in
computer
science
has
focussed
mainly
on
databases
and
programming
languages,
areas
in
which
he
has
worked
on
active
databases,
database
semantics,
type
systems,
approximate
information,
query
languages,
and
on
semistructured
data
and
data
formats,
an
area
in
which
he
has
recently
co-authored
a
book.
He has served on numerous program committees, editorial boards and working groups, and has been program chair for ACM Sigmod, ACM PODS and the International Conference on Database Theory. He was recently elected fellow of the ACM. |
28 May 2001, 11:00 AM
Title: | Combining Data Independence and Realtime in Media Servers (PDF) |
Speaker: | Klaus Meyer-Wegener, Technical University of Dresden |
Abstract: | In order to be flexible and useful, media servers should provide data independence and realtime support. Extensible database systems today offer just the former, media servers just the latter. Going for data independence makes realtime support even harder, since it requires additional mappings and conversions: To play a video in a format different from the one used for storage, realtime conversion is needed. The presentation will introduce the implementation of data independence using so-called converter graphs. The timing of each step in the graph can be described as a jitter-constrained periodic stream, which allows to calculate some characteristics (e.g. buffer size) and supports scheduling. Probabilities of conflict can be used to determine the need for simultaneous playout of the same media stream. Finally, statistical rate-monotonic scheduling would allow to introduce a notion of quality, but an integrated model remains to be developed. This gives an overview of ongoing work in a project called "Memo.real." It is part of a Collaborative Research Center funded by the Deutsche Forschungsgemeinschaft. This implementation builds on the KANGAROO prototype and the Dresden Realtime Operating System (DROPS) |
Bio: | Prof. Klaus Meyer-Wegener is a Professor at Dresden University of Technology, the Head of the Database Group. His research interests include multimedia databases, document and workflow management, and digital Libraries. |
11 June 2001, 11:00 AM
Title: | Query Optimization via Data Mining |
Speaker: | Jarek Gryz, York University |
Abstract: |
It
is
almost
a
trivial
observation
that
the
more
we
know
about
the
data
the
faster
we
can
access
it.
This
idea
was
first
explored
in
the
early
1980's
in
semantic
query
optimization
(SQO),
a
technique
that
uses
semantic
information
stored
in
databases
as
integrity
constraints
to
improve
query
performance.
Some
of
the
ideas
developed
for
SQO
have
been
used
commercially,
but
to
the
best
of
our
knowledge,
no
extensive,
commercial
implementations
of
SQO
exist
today.
But SQO can be extended beyond the use of just integrity constraints. We can use data mining to discover patterns in data that can be useful for optimization in the same way integrity constraints are. Indeed, by focusing on the most useful patterns we can provide better query performance improvements than it is possible with the traditional SQO. In the first part of this talk, I present an implementation of two SQO techniques, Predicate Introduction and Join Elimination, in DB2 Universal Database. I present the implemented algorithms and performance results using the TPCD and APB-1 OLAP benchmarks. In the second part of the talk I will describe my current research on developing data mining techniques for discovery of check constraints and empty joins and the use of this information for query optimization. |
Bio: | Jarek Gryz is an Assistant Professor at York University. His research interests include query Optimization via data mining, semantic query caching, and query optimization in parallel database systems. |