Abstracts of Sigmod 1996 Papers.

If an abstract title is underlined, it is linked to the postscript copy of the paper.



1996 ACM Sigmod Conference
June 4-6, 1996
Le Centre Sheraton, Montreal, Canada


CONCEPT AND EXPERIENCE PAPERS:


Mining Quantitative Association Rules in Large Relational Tables

Ramakrishnan Srikant (IBM Almaden Research Center)
Rakesh Agrawal (IBM Almaden Research Center)

We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be ``10% of married people between age 50 and 60 have at least 2 cars''. We deal with quantitative attributes by fine-partitioning the values of the attribute and then combining adjacent partitions as necessary. We introduce measures of partial completeness which quantify the information lost due to partitioning. A direct application of this technique can generate too many similar rules. We tackle this problem by using a ``greater-than-expected-value'' interest measure to identify the interesting rules in the output. We give an algorithm for mining such quantitative association rules. Finally, we describe the results of using this approach on a real-life dataset.


Data Mining Using Two-Dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization

Takeshi Fukuda (IBM Tokyo Research Laboratory)
Yasuhiko Morimoto (IBM Tokyo Research Laboratory)
Shinichi Morishita (IBM Tokyo Research Laboratory)
Takeshi Tokuyama (IBM Tokyo Research Laboratory)

We discuss data mining based on association rules for two numeric attributes and one Boolean attribute. For example, in a database of bank customers, ``Age'' and ``Balance'' are two numeric attributes, and ``CardLoan'' is a Boolean attribute. Taking the pair (Age, Balance) as a point in two-dimensional space, we consider an association rule of the form "((Age, Balance) in R) => (CardLoan = Yes)," which implies that bank customers whose ages and balances fall in a planar region R tend to use card loan with a high probability. We consider two classes of regions, rectangles and admissible (i.e. connected and x-monotone) regions. For each class, we propose efficient algorithms for computing the regions that give optimal association rules for gain, support, and confidence, respectively. We have implemented the algorithms for admissible regions, and constructed a system for visualizing the rules.


IDEA: Interactive Data Exploration and Analysis

Peter G. Selfridge (AT&T Research)
Divesh Srivastava (AT&T Research)
Lynn O. Wilson (AT&T)

The analysis of business data is often an ill-defined task characterized by large amounts of noisy data. Because of this, business data analysis must combine two kinds of intertwined tasks: exploration and analysis. Exploration is the process of finding the appropriate subset of data to analyze, and analysis is the process of measuring the data to provide the business answer. While there are many tools available both for exploration and for analysis, a single tool or set of tools may not provide full support for these intertwined tasks. We report here on a project that set out to understand a specific business data analysis problem and build an environment to support it. The results of this understanding are, first of all, a detailed list of requirements of this task; second, a set of capabilities that meet these requirements; and third, an implemented client-server solution that addresses many of these requirements and identifies others for future work. Our solution incorporates several novel perspectives on data analysis and combines a history mechanism with a graphical, re-usable representation of the analysis and exploration process. Our approach emphasizes using the database itself to represent as many of these functions as possible.


Rapid Bushy Join-order Optimization with Cartesian Products

Bennet Vance (Oregon Graduate Institute)
David Maier (Oregon Graduate Institute)

Query optimizers often limit the search space for join orderings, for example by excluding Cartesian products in subplans or by restricting plan trees to left-deep vines. Such exclusions are widely assumed to reduce optimization effort while minimally affecting plan quality. However, we show that searching the complete space of plans is more affordable than has been previously recognized, and that the common exclusions may be of little benefit.

We start by presenting a Cartesian product optimizer that requires at most a few seconds of workstation time to search the space of bushy plans for products of up to 15 relations. Building on this result, we present a join-order optimizer that achieves a similar level of performance, and retains the ability to include Cartesian products in subplans wherever appropriate. The main contribution of the paper is in fully separating join-order enumeration from predicate analysis, and in showing that the former problem in particular can be solved swiftly by novel implementation techniques. A secondary contribution is to initiate a systematic approach to the benchmarking of join-order optimization, which we apply to the evaluation of our method.


SQL Query Optimization: Reordering for a General Class of Queries

Piyush Goel (IBM Santa Teresa Lab)
Bala Iyer (IBM Santa Teresa Lab)

The strength of commercial query optimizers like DB2 comes from their ability to select an optimal order by generating all equivalent reorderings of binary operators. However, there are no known methods to generate all equivalent reorderings for an SQL query containing joins, outerjoins, and groupby aggregations. Consequently, some of the reorderings with significantly lower cost may be missed. Using a hypergraph model and a set of novel identities, we propose a method to reorder SQL queries containing joins, outer joins, and groupby aggregations. While these operators are sufficient to capture SQL semantics, it is during their reordering that we identify a powerful primitive needed for a dbms. We report our findings of a simple, yet fundamental operator, "generalized selection", and demonstrate its power to solve the problem of reordering of SQL queries containing joins, outer joins, and groupby aggregations.


Fundamental Techniques for Order Optimization

David E. Simmen (IBM Santa Teresa Laboratory)
Eugene J. Shekita (IBM Almaden Research Center)
Timothy Malkemus (IBM Austin)

Decision support applications are growing in popularity as more business data is kept on-line. Such applications typically include complex SQL queries that can test a query optimizer's ability to produce an efficient access plan. Many access plan strategies exploit the physical ordering of data provided by indexes or sorting. Sorting is an expensive operation, however. Therefore, it is imperative that sorting is optimized in some way or avoided all together. Toward that goal, this paper describes novel optimization techniques for pushing down sorts in joins, minimizing the number of sorting columns, and detecting when sorting can be avoided because of predicates, keys, or indexes. A set of fundamental operations is described that provide the foundation for implementing such techniques. The operations exploit data properties that arise from predicate application, uniqueness, and functional dependencies. These operations and techniques have been implemented in IBM's DB2/CS.


A Teradata Content-Based Multimedia Object Manager for Massively Parallel Architectures

William O'Connell (Bell Laboratories)
Tim Ieong (NCR Corporation)
David Schrader (NCR Corporation)
Cam Watson (NCR Corporation)
Grace Au (NCR Corporation)
Alexandros Biliris (AT&T Research)
Susan Choo (NCR Corporation)
Pierre Colin (NCR Corporation)
Glenn Linderman (NCR Corporation)
Ethimios Panagos (AT&T Research)
June Wang (NCR Corporation)
Todd Walter (NCR Corporation)

The Teradata Multimedia Object Manager is a general-purpose content analysis multimedia server designed for symmetric multiprocessing and massively parallel processing environments. The Multimedia Object Manager defines and manipulates user-defined functions (UDFs), which are invoked in parallel to analyze or manipulate the contents of multimedia objects. Several computationally intensive applications of this technology, which use large persistent datasets, include fingerprint matching, signature verification, face recognition, and speech recognition/translation.


Fault-tolerant Architectures for Continuous Media Servers

Banu Ozden (Bell Laboratories)
Rajeev Rastogi (Bell Laboratories)
Prasant Shenoy (University of Texas at Austin)
Avi Silberschatz (Bell Laboratories)

Continuous media servers that provide support for the storage and retrieval of continuous media data (e.g., video, audio) at guaranteed rates are becoming increasingly important. Such servers, typically, rely on several disks to service a large number of clients, and are thus highly susceptible to disk failure. The challenge is to devise schemes that continue to retrieve data from disks at the required rate even if a certain disk were to fail. In this paper, we present two fault-tolerant approaches that rely on admission control in order to meet rate guarantees for continuous media requests. For both approaches, we present data placement strategies and admission control algorithms. We also present design techniques for maximizing the number of clients that can be supported by a continuous media server. Finally, through extensive simulations, we demonstrate the effectiveness of our schemes.


Optimizing Queries over Multimedia Repositories

Surajit Chaudhuri (HP Laboratories)
Luis Gravano (HP Laboratories and Stanford University)

Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A selection on these attributes will typically produce not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, indicating how well the object matches the selection condition (ranking). Also, multimedia repositories may allow access to the attributes of each object only through indexes. We investigate how to optimize the processing of queries over multimedia repositories. A key issue is the choice of the indexes used to search the repository. We define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. Although the general problem of picking an optimal plan in the search-minimal execution space is NP-hard, we solve the problem efficiently when the predicates in the query are independent. We also show that the problem of optimizing queries that ask for a few top-ranked objects can be viewed, in many cases, as that of evaluating selection conditions. Thus, both problems can be viewed together as an extended filtering problem.


BIRCH: An Efficient Data Clustering Method for Very Large Databases

Tian Zhang (University of Wisconsin, Madison)
Raghu Ramakrishnan (University of Wisconsin, Madison)
Miron Livny (University of Wisconsin, Madison)

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs. This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle ``noise'' (data points that are not part of the underlying pattern) effectively. We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.


On-line Reorganization of Sparsely-populated B+trees

Chendong Zou (Northeastern University)
Betty Salzberg (Northeastern University)

In this paper, we present an efficient method to do on-line reorganization of sparsely-populated B+trees. It reorganizes the leaves first, compacting in short operations groups of leaves with the same parent. After compacting, optionally, the new leaves may swap locations or be moved into empty pages so that they are in key order on the disk. After the leaves are reorganized, the method shrinks the tree by making a copy of the upper part of the tree while leaving the leaves in place. A new concurrency method is introduced so that only a minimum number of pages are locked during reorganization. During leaf reorganization, Forward Recovery is used to save all work already done while maintaining consistency after system crashes. A heuristic algorithm is developed to reduce the number of swaps needed during leaf reorganization, so that better concurrency and easier recovery can be achieved. A detailed description of switching from the old B+tree to the new B+tree is described for the first time.


Two Techniques for On-Line Index Modification in Shared Nothing Parallel Databases

Kiran J. Achyutuni (Informix Software)
Edward Omiecinski (Georgia Tech)
Shamkant B. Navathe (Georgia Tech)

Whenever data is moved across nodes in the parallel database system, the indexes need to be modified too. Index modification overhead can be quite severe because there can be a large number of indexes on a relation. In this paper, we study two alternatives to index modification, namely OAT (One-At-a-Time page movement) and BULK (bulk page movement). OAT and BULK are two extremes on the spectrum of the granularity of data movement. OAT and BULK differ in two respects: first, OAT uses very little additional disk space (at most one extra page), whereas BULK uses a large amount of disk space. Second, BULK uses sequential prefetch I/O to optimize on the number of I/Os during index modification, while OAT does not. Using an experimental testbed, we show that BULK is an order of magnitude faster than OAT. In terms of the impact on transaction performance during reorganization, BULK and OAT perform differently: when the number of indexes to be modified is either one or two, OAT has a lesser impact on the transaction performance degradation. However, when the number of indexes is greater than two, both techniques have the same impact on transaction performance.


Query Caching and Optimization in Distributed Mediator Systems

Sibel Adali (University of Maryland)
Kasim Selcuk Candan (University of Maryland)
Yannis Papakonstantinou (Stanford University)
V.S. Subrahmanian (University of Maryland)

Query processing and optimization in mediator systems that access distributed non-proprietary sources pose many novel problems. Cost-based query optimization is hard because the mediator does not have access to source statistics information and furthermore it may not be easy to model the source's performance. At the same time, querying remote sources may be very expensive because of high connection overhead, long computation time, financial charges, and temporary unavailability. We propose a cost-based optimization technique that caches statistics of actual calls to the sources and consequently estimates the cost of the possible execution plans based on the statistics cache. We investigate issues pertaining to the design of the statistics cache and experimentally analyze various tradeoffs. We also present a query result caching mechanism that allows us to effectively use results of prior queries when the source is not readily available. We employ the novel ``invariants'' mechanism, which shows how semantic information about data sources may be used to discover cached query results of interest.


Performance Tradeoffs for Client-Server Query Processing

Michael J. Franklin (University of Maryland)
Bjorn Thor Jonsson (University of Maryland)
Donald Kossmann (University of Maryland)

The construction of high-performance database systems that combine the best aspects of the relational and object-oriented approaches requires the design of client-server architectures that can fully exploit client and server resources in a flexible manner. The two predominant paradigms for query execution in client-server database systems are data-shipping and query-shipping. In this paper, we first define these policies in terms of the restrictions they place on operator site selection during query optimization. We then investigate the performance tradeoffs among them for bulk query processing. While each strategy has advantages, neither one on its own is efficient across a wide range of circumstances. We describe and evaluate a more flexible policy called hybrid-shipping, which can execute queries at clients, servers, or any combination of the two. Hybrid-shipping is shown to at least match the best of the two ``pure'' policies, and in some situations, to perform better than both. The implementation of hybrid-shipping raises a number of difficult problems, such as an increase in the search space for query optimization and in the number of system and data distribution parameters on which the quality of a query plan depends. We describe an initial investigation into the use of a 2-step query optimization strategy as a way of addressing these issues.


Data Access for the Masses through OLE DB

Jose A. Blakeley (Microsoft Corporation)

This paper presents an overview of OLE DB, a set of interfaces being developed at Microsoft whose goal is to enable applications to have uniform access to data stored in DBMS and non-DBMS information containers. Applications will be able to take advantage of the benefits of database technology without having to transfer data from its place of origin to a DBMS. Our approach consists of defining an open, extensible collection of interfaces that factor and encapsulate orthogonal, reusable portions of DBMS functionality. These interfaces define the boundaries of DBMS components such as record containers, query processors, and transaction coordinators that enable uniform, transactional access to data among such components. The proposed interfaces extend Microsoft's OLE/COM object services framework with database functionality, hence these interfaces are collectively referred to as OLE DB. The OLE DB functional areas include data access and updates (rowsets), query processing, schema information, notifications, transactions, security, and access to remote data. In a sense, OLE DB represents an effort to bring database technology to the masses. This paper presents an overview of the OLE DB approach and its areas of componentization.


The Dangers of Replication and a Solution

Jim Gray (Microsoft)
Pat Helland (Microsoft)
Patrick O'Neil (University of Massachusetts)
Dennis Shasha (New York University)

Update anywhere-anytime-anyway transactional replication has unstable behavior as the workload scales up: a ten-fold increase in nodes and traffic gives a thousand fold increase in deadlocks or reconciliations. Master copy replication (primary copy) schemes reduce this problem. A simple analytic model demonstrates these results. A new two-tier replication algorithm is proposed that allows mobile (disconnected) applications to propose tentative update transactions that are later applied to a master copy. Commutative update transactions avoid the instability of other replication schemes.


Hot mirroring: A method of hiding parity update penalty and degradation during rebuilds for RAID5

Kazuhiko Mogi
Masaru Kitsuregawa (Univ. of Tokyo)

This paper proposes a storage management scheme for disk arrays, named hot mirroring. In this scheme, storage space is partitioned into two regions. One is the mirrored region, which is characterized by high performance and low storage efficiency. The other is the RAID5 region, which is characterized by low performance and high storage efficiency. Hot data blocks are stored in the former area, while cold blocks are stored in the latter. In addition, mirrored pairs and RAID5 stripes are orthogonally laid out, through which the performance degradation during rebuilding is minimized. Hot block clustering in hot mirroring achieves higher performance than conventional RAID5 arrays. The potential of hot mirroring is examined through extensive simulation.


Random I/O Scheduling in Online Tertiary Storage Systems

Bruce K. Hillyer (Bell Laboratories)
Avi Silberschatz (Bell Laboratories)

New database applications that require the storage and retrieval of many terabytes of data are reaching the limits for disk-based storage systems, in terms of both cost and scalability. These limits provide a strong incentive for the development of databases that augment disk storage with technologies better suited to large volumes of data. In particular, the seamless incorporation of tape storage into database systems would be of great value. Tape storage is two orders of magnitude more efficient than disk in terms of cost per terabyte and physical volume per terabyte; however, a key problem is that the random access latency of tape is three to four orders of magnitude slower than disk. Thus, to incorporate a tape bulk store in an online storage system, the problem of tape access latency must be solved. One approach to reducing the latency is careful I/O scheduling. The focus of this paper is on efficient random I/O scheduling for tape drives that use a serpentine track layout, such as the Quantum DLT and the IBM 3480 and 3590. For serpentine tape, I/O scheduling is problematic because of the complex relationships between logical block numbers, their physical positions on tape, and the time required for tape positioning between these physical positions. The results in this paper show that our scheduling schemes provide a significant improvement in the latency of random access to serpentine tape.


Implementing Data Cubes Efficiently

Venky Harinarayan (Stanford University)
Anand Rajaraman (Stanford University)
Jeffrey D. Ullman (Stanford University)

Decision support applications involve complex queries on very large databases. Since response times should be small, query optimization is critical. Users typically view the data as multidimensional data cubes. Each cell of the data cube is a view consisting of an aggregation of interest, like total sales. The values of many of these cells are dependent on the values of other cells in the data cube. A common and powerful query optimization technique is to materialize some or all of these cells rather than compute them from raw data each time. Commercial systems differ mainly in their approach to materializing the data cube. In this paper, we investigate the issue of which cells (views) to materialize when it is too expensive to materialize all views. A lattice framework is used to express dependencies among views. We present greedy algorithms that work off this lattice and determine a good set of views to materialize. The greedy algorithm performs within a small constant factor of optimal under a variety of models. We then consider the most common case of the hypercube lattice and examine the choice of materialized views for hypercubes in detail, giving some good tradeoffs between the space used and the average time to answer a query.


Providing Better Support for a Class of Decision Support Queries

Sudhir G. Rao (Indiana University)
Antonio Badia (Indiana University)
Dirk Van Gucht (Indiana University)

Relational database systems do not effectively support complex queries containing quantifiers (quantified queries) that are increasingly becoming important in decision support applications. Generalized quantifiers provide an effective way of expressing such queries naturally. In this paper, we consider the problem of processing quantified queries within the generalized quantifier framework. We demonstrate that current relational systems are ill-equipped, both at the language and at the query processing level, to deal with such queries. We also provide insights into the intrinsic difficulties associated with processing such queries. We then describe the implementation of a quantified query processor, Q^2P that is based on multidimensional and boolean matrix structures. We provide results of performance experiments run on Q^2P that demonstrate superior performance on quantified queries. Our results indicate that it is feasible to augment relational systems with query subsystems like Q^2P for significant performance benefits for quantified queries in decision support applications.


A Query Language for Multidimensional Arrays: Design, Implementation, and Optimization Techniques

Leonid Libkin (Bell Laboratories)
Rona Machlin (Univ. of Pennsylvania)
Limsoon Wong (Institute of Systems Science)

While much recent research has focussed on extending databases beyond the traditional relational model, relatively little has been done to develop database tools for querying data organized in (multidimensional) arrays. The scientific computing community has made little use of available database technology. Instead, multidimensional scientific data is typically stored in local files conforming to various data exchange formats and queried via specialized access libraries tied in to general purpose programming languages. To allow such data to be queried using known database techniques, we design and implement a query language for multidimensional arrays. Our main design decision is to treat arrays as functions from index sets to values rather than as collection types. This leads to clean syntax and semantics as well as simple but powerful optimization rules. We present a calculus for arrays that extends standard calculi for complex objects. We derive a higher-level comprehension style query language based on this calculus and describe its implementation, including a data driver for the NetCDF data exchange format. Next, we explore some optimization rules obtained from the equational laws of our core calculus. Finally, we study the expressiveness of our calculus and prove that it essentially corresponds to adding ranking to a query language for complex objects.


A Super Scalar Sort Algorithm for RISC Processors

Ramesh C. Agarwal (IBM Watson Research Center)

The compare and branch sequences required in a traditional sort algorithm can not efficiently exploit multiple execution units present in currently available high performance RISC processors. This is because of the long latency of the compare instructions and the sequential algorithm used in sorting. With the increased level of integration on a chip, this trend is expected to continue. We have developed new sort algorithms which eliminate almost all the compares, provide functional parallelism which can be exploited by multiple execution units, significantly reduce the number of passes through keys, and improve data locality. These new algorithms outperform traditional sort algorithms by a large factor. For the Datamation disk to disk sort benchmark (one million 100-byte records), at SIGMOD'94, Chris Nyberg et al presented several new performance records using DEC alpha processor based systems. We have implemented the Datamation sort benchmark using our new sort algorithm on a desktop IBM RS/6000 model 39H (66.6 MHz) with 8 IBM SSA 7133 disk drives (total cost $73K). The total elapsed time for the 100 MB sort was 5.1 seconds (vs the old uni-processor record of 9.1 seconds). We have also established a new price performance record (0.2› vs the old record of 0.9› as the cost of the sort). The entire sort processing was overlapped with I/O. During the read phase, we achieved a sustained BW of 47 MB/sec and during the write phase, we achieved a sustained BW of 39 MB/sec. Key extraction and sorting of one million 10-byte keys took only 0.6 second of CPU time. The rest of the CPU time was used in moving records, servicing I/O, and other overheads. Algorithmic details leading to this level of performance are described in this paper. A detailed analysis of the CPU time spent during various phases of the sort algorithm and I/O is also provided.


Spatial Hash-Joins

Ming-Ling Lo (University of Michigan)
Chinya V. Ravishankar (University of Michigan)

We examine how to apply the hash-join paradigm to spatial joins, and define a new framework for spatial hash-joins. Our spatial partition functions have two components: a set of bucket extents and an assignment function, which may map a data item into multiple buckets. Furthermore, the partition functions for the two input datasets may be different. We have designed and tested a spatial hash-join method based on this framework. The partition function for the inner dataset is initialized by sampling the dataset, and evolves as data are inserted. The partition function for the outer dataset is immutable, but may replicate a data item from the outer dataset into multiple buckets. The method mirrors relational hash-joins in other aspects. Our method needs no pre-computed indices. It is therefore applicable to a wide range of spatial joins. Our experiments show that our method outperforms current spatial join algorithms based on tree matching by a wide margin. Further, its performance is superior even when the tree-based methods have pre-computed indices. This makes the spatial hash-join method highly competitive both when the input datasets are dynamically generated and when the datasets have pre-computed indices.


Partition Based Spatial-Merge Join

Jignesh M. Patel (University of Wisconsin, Madison)
David J. DeWitt (University of Wisconsin, Madison)

This paper describes PBSM (Partition Based Spatial-Merge), a new algorithm for performing spatial join operation. This algorithm is especially effective when neither of the inputs to the join have an index on the joining attribute. Such a situation could arise if both inputs to the join are intermediate results in a complex query, or in a parallel environment where the inputs must be dynamically redistributed. The PBSM algorithm partitions the inputs into manageable chunks, and joins them using a computational geometry based plane-sweeping technique. This paper also presents a performance study comparing the the traditional indexed nested loops join algorithm, a spatial join algorithm based on joining spatial indices, and the PBSM algorithm. These comparisons are based on complete implementations of these algorithms in Paradise, a database system for handling GIS applications. Using real data sets, the performance study examines the behavior of these spatial join algorithms in a variety of situations, including the cases when both, one, or none of the inputs to the join have an suitable index. The study also examines the effect of clustering the join inputs on the performance of these join algorithms. The performance comparisons demonstrates the feasibility, and applicability of the PBSM join algorithm.


Bifocal Sampling for Skew-Resistant Join Size Estimation

Sumit Ganguly (Rutgers University)
Phillip B. Gibbons (Bell Laboratories)
Yossi Matias (Bell Laboratories)
Avi Silberschatz (Bell Laboratories)

This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value. Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some well-known problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is Omega(n log n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(sqrt{n} log n). By contrast, previous algorithms using a sample of similar size may require the join size to be Omega(n \sqrt{n}) to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective.


Estimating Alphanumeric Selectivity in the Presence of Wildcards

P. Krishnan (Bell Laboratories)
Jeffrey Scott Vitter (Duke University)
Bala Iyer (IBM Santa Teresa Laboratory)

Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reorderings [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determining the optimal join order. Previous work has concentrated on estimating selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. With the popularity of textual data being stored in databases, it has become important to estimate selectivity accurately for alphanumeric fields. A particularly problematic predicate used against alphanumeric fields is the SQL LIKE predicate [Dat]. Techniques used for estimating numeric selectivity are not suited for estimating alphanumeric selectivity. In this paper, we study for the first time the problem of estimating alphanumeric selectivity in the presence of wildcards. Based on the intuition that the model built by a data compressor on an input text encapsulates information about common substrings in the text, we develop a technique based on the suffix tree data structure to estimate alphanumeric selectivity. In a statistics generation pass over the database, we construct a compact suffix tree-based structure from the columns of the database. We then look at three families of methods that utilize this structure to estimate selectivity during query plan costing, when a query with predicates on alphanumeric attributes contains wildcards in the predicate. We evaluate our methods empirically in the context of the TPC-D benchmark. We study our methods experimentally against a variety of query patterns and identify five techniques that hold promise.


Improved Histograms for Selectivity Estimation of Range Predicates

Viswanath Poosala (University of Wisconsin, Madison
Yannis E. Ioannidis (University of Wisconsin, Madison
Peter J. Haas (IBM Almaden Research Center)
Eugene J. Shekita (IBM Almaden Research Center)

Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspects, the available choices for each aspect, and the impact of such choices on histogram effectiveness. In this paper, we provide a taxonomy of histograms that captures all previously proposed histogram types and indicates many new possibilities. We introduce novel choices for several of the taxonomy dimensions, and derive new histogram types by combining choices in effective ways. We also show how sampling techniques can be used to reduce the cost of histogram construction. Finally, we present results from an empirical study of the proposed histogram types used in selectivity estimation of range predicates and identify the histogram types that have the best overall performance.


Structures for Manipulating Proposed Updates In Object-Oriented Databases

Michael Doherty (University of Colorado)
Richard Hull (University of Colorado)
Mohammed Rupawalla (University of Colorado)

Support for virtual states and deltas between them is useful for a variety of database applications, including hypothetical database access, version management, simulation, and active databases. The Heraclitus paradigm elevates delta values to be "first-class citizens" in database programming languages, so that they can be explicitly created, accessed and manipulated. A fundamental issue concerns the trade-off between the "accuracy" or "robustness" of a form of delta representation, and the ease of access and manipulation of that form. At one end of the spectrum, code-blocks could be used to represent delta values, resulting in a more accurate capture of the intended meaning of a proposed update, at the cost of more expensive access and manipulation. In the context of object-oriented databases, another point on the spectrum is "attribute-granularity" deltas which store the net changes to each modified attribute value of modified objects. This paper introduces a comprehensive framework for specifying a broad array of forms for representing deltas for complex value types (tuple, set, bag, list, o-set and dictionary). In general, the granularity of such deltas can be arbitrarily deep within the complex value structure. Applications of this framework in connection with hypothetical access to, and "merging" of, proposed updates are discussed.


Safe and Efficient Sharing of Persistent Objects in Thor

Barbara Liskov (MIT)
Atul Adya (MIT)
Miguel Castro (MIT)
Mark Day (Lotus)
Sanjay Ghemawat (Digital Systems Research Lab.)
Robert Gruber (AT&T Research)
Umesh Maheshwari (MIT)
Andrew C. Myers (MIT)
Liuba Shrira (MIT)

Thor is an object-oriented database system designed for use in a heterogeneous distributed environment. It provides highly-reliable and highly-available persistent storage for objects, and supports safe sharing of these objects by applications written in different programming languages. Safe heterogeneous sharing of long-lived objects requires encapsulation: the system must guarantee that applications interact with objects only by invoking methods. Although safety concerns are important, most object-oriented databases forgo safety to avoid paying the associated performance costs. This paper gives an overview of Thor's design and implementation. We focus on two areas that set Thor apart from other object-oriented databases. First, we discuss safe sharing and techniques for ensuring it; we also discuss ways of improving application performance without sacrificing safety. Second, we describe our approach to cache management at client machines, including a novel adaptive prefetching strategy. The paper presents performance results for Thor, on several OO7 benchmark traversals. The results show that adaptive prefetching is very effective, improving both the elapsed time of traversals and the amount of space used in the client cache. The results also show that the cost of safe sharing can be negligible; thus it is possible to have both safety and high performance.


An Open Abstract-Object Storage System

Stephen Blott (ETH, Zurich)
Lukas Relly (ETH, Zurich)
Hans-Joerg Schek (ETH, Zurich)

Database systems must become more open to retain their relevance as a technology of choice and necessity. Openness implies not only databases exporting their data, but also exporting their services. This is as true in classical application areas as in non-classical (GIS, multimedia, design, etc). This paper addresses the problem of exporting storage-management services of indexing, replication and basic query processing. We describe an abstract-object storage model which provides the basic mechanism, `likeness', through which these services are applied uniformly to internally-stored, internally-defined data, and to externally-stored, externally-defined data. Managing external data requires the coupling of external operations to the database system. We discuss the interfaces and protocols required of these to achieve correct resource management and admit efficient realisation. Throughout, we demonstrate our solutions in the area of semi-structured file management; in our case, geospatial metadata files.


Static Detection of Security Flaws in Object-Oriented Databases

Keishi Tajima (Kyoto University)

Access control in function granularity is one of the features of many object-oriented databases. In those systems, the users are granted rights to invoke composed functions instead of rights to invoke primitive operations. Although primitive operations are invoked inside composed functions, the users can invoke them only through the granted functions. This achieves access control in abstract operation level. Access control utilizing encapsulated functions, however, easily causes many ``security flaws'' through which malicious users can bypass the encapsulation and can abuse the primitive operations inside the functions. In this paper, we develop a technique to statically detect such security flaws. First, we design a framework to describe security requirements that should be satisfied. Then, we develop an algorithm that syntactically analyzes program code of the functions and determines whether given security requirements are satisfied or not. This algorithm is sound, that is, whenever there is a security flaw, it detects it. ==============


Goal-Oriented Buffer Management Revisited

Kurt P. Brown (64k Inc., San Jose)
Michael J. Carey (IBM Almaden Research Center)
Miron Livny (University of Wisconsin, Madison)

In this paper we revisit the problem of achieving multi-class workload response time goals by automatically adjusting the buffer memory allocations of each workload class. We discuss the virtues and limitations of previous work with respect to a set of criteria we lay out for judging the success of any goal-oriented resource allocation algorithm. We then introduce the concept of hit rate concavity and develop a new goal-oriented buffer allocation algorithm, called Class Fencing, that is based on this concept. Exploiting the notion of hit rate concavity results in an algorithm that not only is as accurate and stable as our previous work, but also more responsive, more robust, and simpler to implement.


Multi-dimensional Resource Scheduling for Parallel Queries

Minos N. Garofalakis (University of Wisconsin, Madison
Yannis E. Ioannidis (University of Wisconsin, Madison

Scheduling query execution plans is an important component of query optimization in parallel database systems. The problem is particularly complex in a shared-nothing execution environment, where each system node represents a collection of time-shareable resources (e.g., CPU(s), disk(s), etc.) and communicates with other nodes only by message-passing. Significant research effort has concentrated on only a subset of the various forms of intra-query parallelism so that scheduling and synchronization is simplified. In addition, most previous work has focused its attention on one-dimensional models of parallel query scheduling, effectively ignoring the potential benefits of resource sharing. In this paper, we develop an approach that is more general in both directions, capturing all forms of intra-query parallelism and exploiting sharing of multi-dimensional resource nodes among concurrent plan operators. This allows scheduling a set of independent query tasks (i.e., operator pipelines) to be seen as an instance of the multi-dimensional bin-design problem. Using a novel quantification of coarse grain parallelism, we present a list scheduling heuristic algorithm that is provably near-optimal in the class of coarse grain parallel executions (with a worst-case performance ratio that depends on the number of resources per node and the granularity parameter). We then extend this algorithm to handle the operator precedence constraints in a bushy query plan by splitting the execution of the plan into synchronized phases. Preliminary performance results confirm the effectiveness of our scheduling algorithm compared both to previous approaches and the optimal solution. Finally, we present a technique that allows us to relax the coarse granularity restriction and obtain a list scheduling method that is provably near-optimal in the space of all possible parallel schedules.


Semi-automatic, Self-adaptive Control of Garbage Collection Rates in Object Databases.

Jonathan E. Cook (University of Colorado)
Artur W. Klauser (University of Colorado)
Alexander L. Wolf (University of Colorado)
Benjamin G. Zorn (University of Colorado)

A fundamental problem in automating object database storage reclamation is determining how often to perform garbage collection. We show that the choice of collection rate can have a significant impact on application performance and that the best rate depends on the dynamic behavior of the application, tempered by the particular performance goals of the user. We describe two semi-automatic, self-adaptive policies for controlling collection rate that we have developed to address the problem. Using trace-driven simulations, we evaluate the performance of the policies on a test database application that demonstrates two distinct reclustering behaviors. Our results show that the policies are effective at achieving user-specified levels of I/O operations and database garbage percentage. We also investigate the sensitivity of the policies over a range of object connectivities. The evaluation demonstrates that semi-automatic, self-adaptive policies are a practical means for flexibly controlling garbage collection rate.


Towards Effective and Efficient Free Space Management

Mark L. McAuliffe (University of Wisconsin, Madison)
Michael J. Carey (IBM Almaden Research Center)
Marvin H. Solomon (University of Wisconsin, Madison)

An important problem faced by many database management systems is the "online object placement problem" - the problem of choosing a disk page to hold a newly allocated object. In the absence of clustering criteria, the goal is to maximize storage utilization. For main-memory based systems, simple heuristics exist that provide reasonable space utilization in the worst case and excellent utilization in typical cases. However, the storage management problem for databases includes significant additional challenges, such as minimizing I/O traffic, coping with crash recovery, and gracefully integrating space management with locking and logging. We survey several object placement algorithms, including techniques that can be found in commercial and research database systems. We then present a new object placement algorithm that we have designed for use in Shore, an object-oriented database system under development at the University of Wisconsin---Madison. Finally, we present results from a series of experiments involving actual Shore implementations of some of these algorithms. Our results show that while current object placement algorithms have serious performance deficiencies, including excessive CPU or main memory overhead, I/O traffic, or poor disk utilization, our new algorithm consistently demonstrates excellent performance in all of these areas.


Rule Languages and Internal Algebras for Rule-Based Optimizers

Mitch Cherniack (Brown University)
Stanley B. Zdonik (Brown University)

Rule-based optimizers and optimizer generators use rules to specify query transformations. Rules act directly on query representations, which typically are based on query algebras. But most algebras complicate rule formulation, and rules over these algebras must often resort to calling to externally defined bodies of code. Code makes rules difficult to formulate, prove correct and reason about, and therefore compromises the effectiveness of rule-based systems. In this paper we present KOLA; a combinator-based algebra designed to simplify rule formulation. KOLA is not a user language, and KOLA's variable-free queries are difficult for humans to read. But KOLA is an effective internal algebra because its combinator-style makes queries manipulable and structurally revealing. As a result, rules over KOLA queries are easily expressed without the need for supplemental code. We illustrate this point, first by showing some transformations that despite their simplicity, require head and body routines when expressed over algebras that include variables. We show that these transformations are expressible without supplemental routines in KOLA. We then show complex transformations of a class of nested queries expressed over KOLA. Nested query optimizations, while having been studied before, have seriously challenged the rule-based paradigm.


Evaluating Queries with Generalized Path Expressions

Vassilis Christophides (INRIA)
Sophie Cluet (INRIA)
Guido Moerkotte (Aachen University)

In the past few years, query languages featuring generalized path expressions have been proposed. These languages allow the interrogation of both schema and data. They are powerful and essential for a number of applications. However, until now, their evaluation relied on a rather naive and inefficient algorithm. In this paper, we extend an object algebra with two new operators and present some interesting rewriting techniques for queries featuring generalized path expressions. We also show how a query optimizer can integrate the new techniques.


Query Processing Techniques for Caching Expensive Methods.

Joseph M. Hellerstein (University of California, Berkeley)
Jeffrey F. Naughton (University of Wisconsin, Madison)

Object-Relational and Object-Oriented DBMSs allow users to invoke time-consuming (``expensive'') methods in their queries. When queries containing these expensive methods are run on data with duplicate values, time is wasted redundantly computing methods on the same value. This problem has been studied in the context of programming languages, where ``memoization'' is the standard solution. In the database literature, sorting has been proposed to deal with this problem. We compare these approaches along with a third solution, a variant of unary hybrid hashing which we call Hybrid Cache. We demonstrate that Hybrid Cache always dominates memoization, and significantly outperforms sorting in many instances. This provides new insights into the tradeoff between hashing and sorting for unary operations. Additionally, our Hybrid Cache algorithm includes some new optimizations for unary hybrid hashing, which can be used for other applications such as grouping and duplicate elimination. We conclude with a discussion of techniques for caching multiple expensive methods in a single query, and raise some new optimization problems in choosing caching techniques.


Cost-Based Optimization for Magic: Algebra and Implementation

Praveen Seshadri (University of Wisconsin, Madison)
Joseph M. Hellerstein (University of California, Berkeley)
Hamid Pirahesh (IBM Almaden Research Center)
T.Y. Cliff Leung (IBM Santa Teresa Laboratory)
Raghu Ramakrishnan (University of Wisconsin, Madison)
Divesh Srivastava (AT&T Research)
Peter J. Stuckey (University of Melbourne)
S. Sudarshan (Indian Institute of Technology)

Magic sets rewriting is a well-known optimization heuristic for complex decision-support queries. There can be many variants of this rewriting even for a single query, which differ greatly in execution performance. We propose cost-based techniques for selecting an efficient variant from the many choices. Our first contribution is a practical scheme that models magic sets rewriting as a special join method that can be added to any cost-based query optimizer. We derive cost formulas that allow an optimizer to choose the best variant of the rewriting and to decide whether it is beneficial. The order of complexity of the optimization process is preserved by limiting the search space in a reasonable manner. We have implemented this technique in IBM's DB2 C/S V2 database system. Our performance measurements demonstrate that the cost-based magic optimization technique performs well, and that without it, several poor decisions could be made. Our second contribution is a formal algebraic model of magic sets rewriting, based on an extension of the multiset relational algebra, which cleanly defines the search space and can be used in a rule-based optimizer. We introduce the multiset theta-semijoin operator, and derive equivalence rules involving this operator. We demonstrate that magic sets rewriting for non-recursive SQL queries can be modeled as a sequential composition of these equivalence rules.


Materialized View Maintenance and Integrity Constraint Checking: Trading Space for Time

Kenneth A. Ross (Columbia University)
Divesh Srivastava (AT&T Research)
S. Sudarshan (Indian Institute of Technology)

We investigate the problem of incremental maintenance of an SQL view in the face of database updates, and show that it is possible to reduce the total time cost of view maintenance by materializing (and maintaining) additional views. We formulate the problem of determining the optimal set of additional views to materialize as an optimization problem over the space of possible view sets (which includes the empty set). The optimization problem is harder than query optimization since it has to deal with multiple view sets, updates of multiple relations, and multiple ways of maintaining each view set for each updated relation. We develop a memoing solution for the problem; the solution can be implemented using the expression DAG representation used in rule-based optimizers such as Volcano. We demonstrate that global optimization cannot, in general, be achieved by locally optimizing each materialized subview, because common subexpressions between different materialized subviews can allow nonoptimal local plans to be combined into an optimal global plan. We identify conditions on materialized subviews in the expression DAG when local optimization is possible. Finally, we suggest heuristics that can be used to efficiently determine a useful set of additional views to materialize. Our results are particularly important for the efficient checking of assertions (complex integrity constraints) in the SQL-92 standard, since the incremental checking of such integrity constraints is known to be essentially equivalent to the view maintenance problem.


Maintaining Database Consistency in Presence of Value Dependencies in Multidatabase Systems

Claire Morpain (LSI Laboratory)
Michele Cart (LSI Laboratory)
Jean Ferrie (LSI Laboratory)
Jean-Francois Pons (LSI Laboratory)

The emergence of new criteria specifically adapted to multidatabase systems, in response to constraints imposed by global serializability, leads to restrictive hypotheses in order to ensure correctness of executions. This is the case with the two level serializability presented in [6], that ensures strongly correct executions if transaction programs are Local Database Preserving (LDP). The main drawback of the LDP hypothesis is that it relies on rigorous programming. The principal objective of this paper has been to suppress this drawback while conserving the strong correctness of 2LSR executions. We propose defining precisely the notion of value dependencies, and managing them so as not to impose the LDP property.


Algorithms for Deferred View Maintenance

Latha Colby (Bell Laboratories)
Timothy Griffin(Bell Laboratories)
Leonid Libkin(Bell Laboratories)
Inderpal Singh Mumick(Bell Laboratories)
Howard Trickey (Bell Laboratories)

Materialized views and view maintenance are important for data warehouses, retailing, banking, and billing applications. We consider two related view maintenance problems: 1) how to maintain views after the base tables have already been modified, and 2) how to minimize the time for which the view is inaccessible during maintenance.

Typically, a view is maintained immediately, as a part of the transaction that updates the base tables. Immediate maintenance imposes a significant overhead on update transactions that cannot be tolerated in many applications. In contrast, deferred maintenance allows a view to become inconsistent with its definition. A refresh operation is used to reestablish consistency. We present new algorithms to incrementally refresh a view during deferred maintenance. Our algorithms avoid a state bug that has artificially limited techniques previously used for deferred maintenance.

Incremental deferred view maintenance requires auxiliary tables that contain information recorded since the last view refresh. We present three scenarios for the use of auxiliary tables and show how these impact per-transaction overhead and view refresh time. Each scenario is described by an invariant that is required to hold in all database states. We then show that, with the proper choice of auxiliary tables, it is possible to lower both per-transaction overhead and view refresh time.


A framework for supporting data integration using the materialized and virtual approaches

Richard Hull (University of Colorado)
Gang Zhou (University of Colorado)

This paper presents a framework for data integration currently under development in the Squirrel project. The framework is based on a special class of mediators, called "Squirrel integration mediators". These mediators can support the traditional virtual and materialized approaches, and also hybrids of them. In the Squirrel mediators, a relation in the integrated view can be supported as (a) fully materialized, (b) fully virtual, or (c) partially materialized (i.e., with some attributes materialized and other attributes virtual). In general, (partially) materialized relations of the integrated view are maintained by incremental updates from the source databases. Squirrel mediators provide two approaches for doing this: (1) materialize all needed auxiliary data, so that data sources do not have to be queried when processing the incremental updates; or (2) leave some or all of the auxiliary data virtual, and query selected source databases when processing incremental updates. The paper presents formal notions of consistency and ``freshness" for integrated views defined over multiple autonomous source databases. It is shown that Squirrel mediators satisfy these properties.


Change Detection in Hierarchically Structured Information

Sudarshan S. Chawathe (Stanford University)
Anand Rajaraman (Stanford University)
Hector Garcia-Molina (Stanford University)
Jennifer Widom (Stanford University)

Detecting and representing changes to data is important for active databases, data warehousing, view maintenance, and version and configuration management. Most previous work in change management has dealt with flat-file and relational data; we focus on hierarchically structured data. Since in many cases changes must be computed from old and new versions of the data, we define the hierarchical change detection problem as the problem of finding a ``minimum-cost edit script'' that transforms one data tree to another, and we present efficient algorithms for computing such an edit script. Our algorithms make use of some key domain characteristics to achieve substantially better performance than previous, general-purpose algorithms. We study the performance of our algorithms both analytically and empirically, and we describe the application of our techniques to hierarchically structured documents.


A Query Language and Optimization Techniques for Unstructured Data

Peter Buneman (University of Pennsylvania)
Susan Davidson (University of Pennsylvania)
Gerd Hillebrand (University of Pennsylvania and Universitaet Karlsruhe)
Dan Suciu (AT&T Research)

For certain database tasks that require great flexibility, some people are turning to a new kind of data representation in which the database is not constrained by a schema. Instead, each component of the database carries its own description which is independent of other components. Systems like AceDB, which has become very popular with biologists, and the recent Tsimmis proposal for data integration organize data in tree-like structures whose components can be used equally well to represent sets and tuples. What query language is appropriate for such structures? Here we propose a simple language UNQL for querying data organized as a rooted, edge-labeled graph. In this model, relational data may be represented as fixed-depth trees, and on such trees UNQL is equivalent to the relational algebra. The novelty of UNQL consists in its programming constructs for arbitrarily deep data and for cyclic structures. While strictly more powerful than query languages with path expressions like XSQL, UNQL can still be efficiently evaluated. We describe new optimization techniques for the deep, or ``vertical'' dimension of UNQL queries. Furthermore, we show that known optimization techniques for operators on flat relations apply to the ``horizontal'' dimension of UNQL.


Is GUI Programming a Database Research Problem?

Nita Goyal (HP Laboratories)
Charles Hoch (HP Laboratories)
Ravi Krishnamurthy (HP Laboratories)
Brian Meckler (HP Laboratories)
Michael Suckow (HP Laboratories)

Programming nontrivial GUI applications is currently an arduous task. Just as the use of a declarative language simplified the programming of database applications, we ask whether we can do the same for GUI programming? Can we then import a large body of knowledge from database research? We answer these questions by describing our experience in building nontrivial GUI applications initially using C++ programming and subsequently using Logic++, a higher order Horn clause logic language on complex objects with object-oriented features. We abstract a GUI application as a set of event handlers. Each event handler can be conceptualized as a transition from the old screen/program state to a new screen/program state. We use a data centric view of the screen/program state (i.e., every entity on the screen corresponds to proxy datum in the program) and express each event handler as a query dependent update, albeit a complicated one. To express such complicated updates we use Logic++. The proxy data are expressed as derived views that are materialized on the screen. Therefore, the system must be active in maintaining these materialized views. Consequently, each event handler is conceptually an update followed by a fixpoint computation of the proxy data. Based on our experience in building the GUI system, we observe that many database techniques such as view maintenance, active DB, concurrency control, recovery, optimization as well as language concepts such as higher order logic are useful in the context of GUI programming.


Accessing Relational Databases from the World Wide Web

Tam Nguyen (IBM Santa Teresa Laboratory)
V. Srinivasan (IBM Santa Teresa Laboratory)

With the growing popularity of the internet and the World Wide Web (Web), there is a fast growing demand for access to database management systems (DBMS) from the Web. We describe here techniques that we invented to bridge the gap between HTML, the standard markup language of the Web, and SQL, the standard query language used to access relational DBMS. We propose a flexible general purpose variable substitution mechanism that provides cross-language variable substitution between HTML input and SQL query strings as well as between SQL result rows and HTML output thus enabling the application developer to use the full capabilities of HTML for creation of query forms and reports, and SQL for queries and updates. The cross-language variable substitution mechanism has been used in the design and implementation of a system called DB2 WWW Connection that enables quick and easy construction of applications that access relational DBMS data from the Web. An end user of these DB2 WWW applications sees only the forms for his or her requests and resulting reports. A user fills out the forms, points and clicks to navigate the forms and to access the database as determined by the application.


TUTORIALS:



Repository System Engineering

Philip A. Bernstein (Microsoft Corp.)

A repository manager is a generic database application that manages metadata. It supports functions for managing changes in the structure of objects: types, relationships, versions and configurations. But a repository system is much more than a repository manager. The system must include a rich tool set and an object model (i.e. information model). It's the tools that make the system useful to people. To understand repository technology, it's essential to understand this system context. We therefore discuss tool features that repositories support, the repository marketplace, usage scenarios, and how to integrate tools into a repository system. Of course, the base technology is important too, so we discuss the underlying repository manager architecture, including object models, type extensibility, relationship semantics, and version and configuration management. We close with research problems worth pursuing.

Click here for the full tutorial notes.


Databases and Visualization

Daniel A. Keim (University of Munich)

Over the last decades, the areas of database systems and data visualization developed almost independently. While research on databases deals with the efficient storage and retrieval of large amounts of data, research on data visualization deals with the effective portrayal of data with the goal towards insight about the data. There is, however, a great need for combining the benefits from both areas. On one side, visualization systems deal with very large amounts of data and therefore need database systems as backends; and on the other side, there are very large databases containing a treasure of undiscovered information which could be unveiled using visualization technology. Until recently, however, there has not been much interaction between the two areas. With the exceptions of visualizing database schemas for database design (cf. semantic data models such as the ER-model) and visual querying of the database (cf. visual query systems such as QBE), visualization has been only of minor importance in the database area, especially the aspect of visualizing the data itself. Since size and dimensionality of databases is growing rapidly, the visualization of the data itself however is becoming increasingly important.

The goal of the tutorial is to show the potentials of visualization technology for databases. The tutorial provides an overview of the state-of-the-art in data visualization, classifying the existing data visualization techniques into eight groups: Geometric, Icon-based, Graph-based, Pixel- oriented, Hierarchical, 3D, Dynamic, and Hybrid Techniques. Besides describing each of these classes, the tutorial focuses on new developments in data visualization, which are relevant to the area of database systems. In particular, we describe a wide range of recently developed techniques for visualizing large amounts of arbitrary multi-attribute data which does not have any two- or three-dimensional semantics and therefore does not lend itself to an easy display. A detailed comparison shows the strengths and weaknesses of the existing techniques and reveals potentials for further improvements. Several examples demonstrate the benefits of visualization techniques for real databases applications. The tutorial concludes with an overview of existing database visualization systems, including research prototypes as well as commercial products.


Click here for the full tutorial notes.


Data Mining Techniques

Jiawei Han (Simon Fraser University)

Data mining, or knowledge discovery in databases, has been popularly recognized as an important research issue with broad applications. We provide a comprehensive survey, in the database perspective, on the data mining techniques developed recently. Several major kinds of data mining methods, including generalization, characterization, classification, clustering, association, evolution, pattern matching, data visualization, and meta-rule guided mining, will be reviewed. Techniques for mining knowledge in different kinds of databases, including relational, transaction, object-oriented, spatial, and active databases, as well as global information systems, will be examined. Potential data mining applications and some research issues will also be discussed.

Click here for the full tutorial notes.

DEMOS:


Thinksheet: a tool for tailoring complex documents

Peter Piatko (New York University)
Roman Yangarber (New York University)
Daoi Lin (New York University)
Dennis Shasha (New York University)

Imagine that you are a ``knowledge worker'' in the coming millenium. That means you must synthesize information and make decisions such as ``Which benefits plan to use?'' ``What do the regulations say about this course of action?'' ``How does my job fit into the corporate business plan?'' ``What should I be careful about when I approach this client?'' or even ``How does this program work?'' If the dream of digital libraries is to bring you all material relevant to your task, you may find yourself drowning before long. Reading is harder than talking to people who know the relevant documents and can tell you what you're interested in. That is what many current knowledge workers do, giving rise to professions such as insurance consultant, lawyer, benefits specialist, and so on.

Imagine by contrast that the documents you retrieve could be tailored precisely to your needs. That is, imagine that the document might ask you questions and produce a document filtered and organized according to those you have answered. To choose a concrete example, suppose you are interested in social security. Instead of reading the law or the 200 page Mercer Guide to Social Security to figure out your benefits, you answer the 30 or so questions necessary to determine your benefits (incomes, marital status, ages of kids, etc.). When you click on the proper node you read a four page document that is the law as if it were written for you. If you could avoid reading long complicated documents in favor of short ones tailored to your situation, which would you choose?

We have been developing software to achieve such tailoring. We call it Thinksheet.

Thinksheet documents currently must be created by hand. A very careful reader has to understand the document(s) thoroughly and then program the rules into a thinksheet to generate the tailored document. The reader could potentially use a conventional expert system, but programming such a system is challenging and "features" like conflict resolution render the conceptual model difficult.

Instead, we combine the technologies of spreadsheets, functional expert systems, keyword searching (in preparation for more advanced information retrieval), and database query processing. The conceptual model is only slightly more complex than a spreadsheet.

Our demonstration will show Thinksheet in action for applications from law, tourism, telephone billing plans, strategies, and others. We will show users how to create their own Thinksheets and how to integrate them with document and conventional databases as well as the internet.


HyperStorM - Administering Structured Documents Using Object-Oriented Database Technology

Klemens Boehm (GMD - Integrated Publication and Information Systems Institute)
Karl Aberer (GMD - Integrated Publication and Information Systems Institute)

In the demonstration, an OODBMS application framework for structured document storage, i.e., storage of SGML-/HyTime-documents, is presented. The database application allows to administer documents of arbitrary type. Further, documents' database-internal representation is configurable. Documents in the database can be modified concurrently with other access operations. The document base can be accessed via the WWW and be queried according to documents' internal structure, relationships between documents and document components' content. Metadata is used for layouting, hyperlink rendition, and query-form generation.


DBSim: A Simulation Tool for Predicting Database Performance

Mike Lefler (PRC Inc.)
Mark Stokrp (PRC Inc.)
Craig Wong (PRC Inc.)

Relational databases are the overwhelmingly dominant foundation for information systems development in the 1990's, and have been for at least a decade. However, despite the plethora of CASE tools that exist today to support information modeling, there is currently a shortfall of tools available for relational database designers and administrators to reliably predict performance prior to implementation. Typically, therefore, performance questions are pushed off until the physical schema has been designed, the database has been loaded with test (or operational) data, and acceptance testing has begun. As an unfortunate consequence, it is all too common for the new information system to initially perform worse than the system it is replacing. This can lead to customer dissatisfaction with the system - and with the system integrator.

Commercial CASE tools which support relational database design focus on the use of entity-relationship modeling to move a database design forward from a blank sheet of paper (or, today, an empty screen) through conceptual and logical schemas to an installed physical database schema, and it is our judgment that these "upper CASE" aspects of relational design are reasonably well-supported by the available toolkits. However there is a vital need for a tool that can aid in resolving performance questions prior to the new information system going online. The Database Performance Simulator (DBSim) is a tool which applies simulation to address he problem of fine tuning the physical database design prior to the time when the system is installed and the database loaded.

DBSim consists of four layers, each layer itself a stochastic model. The layers are: (1) the SQL Compilation Submodel, which maps received transactions into a sequence of internal DBMS operations (e.g., full table scan, indexed scan) and sequentially submits these operations to the Query Execution Submodel; (2) the Query Execution Submodel, which models the execution of each received operation as a sequence of page requests; (3) the Buffer Pool Submodel, which emulates the DBMS's disk cache manager, including shared and exclusive locking, deadlock detection, and LRU replacement strategy; and (4) the I/O Submodel, which uses a queuing model to estimate response times for received I/O requests.

Each submodel calculates an execution time by examining information visible only to that submodel, and/or by executing a sequence of requests against the next lower submodel.

The heart of DBSim is the Query Execution Submodel. It compares each primitive operation it receives to the physical schema, then stochastically generates a sequence of page requests which are passed to the Buffer Pool Submodel. The Query Execution Submodel models the following database operations:

DBSim presumes a B+ tree index. Parameters allow DBSim to model both unique (primary key) indexes and nonunique (secondary) indexes for both clustered and unclustered indexes. Index page requests for a path down the index are generated according to the 80-20 rule, and data page requests for range searches of unclustered indexes are randomly generated, again according to the 80-20 rule.


LORE: A Lightweight Object REpository for Semistructured Data

Dallan Quass (Stanford University)
Jennifer Widom (Stanford University)
Roy Goldman (Stanford University)
Kevin Haas (Stanford University)
Qingshan Luo (Stanford University)
Jason McHugh (Stanford University)
Svetlozar Nestorov (Stanford University)
Anand Rajaraman (Stanford University)
Hugo Rivero (Stanford University)
Serge Abiteboul (Stanford University)
Jeffrey D. Ullman (Stanford University)
Janet Wiener (Stanford University)

We will demonstrate a system called LORE (for Lightweight Object REpository), and its query language LOREL (for LORE Language), which are aimed specifically at handling semistructured data. We consider data to be semistructured when there is no schema fixed or known in advance and when the data may be incomplete or irregular. For example, HTML files on the World-Wide Web usually contain some structure, but often the data is irregular or incomplete. In addition, data integrated from multiple, heterogeneous information sources often is semistructured.

The data model used in Lore is a "lightweight" object model called OEM (for Object Exchange Model); OEM is a simple, self-describing model with object nesting and identity. Lore itself also is "lightweight," in the sense that it is a repository and a query engine but not a full-feature database management system. The query language Lorel is a compatible extension to OQL, with new features designed specifically for querying semistructured data: partially specified path expressions, wildcards, automatic type coercion in comparisons, and a special semantics for disjunction. Unlike OQL, Lorel does not enforce strong typing, thus allowing similar objects to be compared and retrieved despite minor differences in their structure. Finally, Lorel allows querying and schema browsing when the object structure is unknown or only partially known.

In addition to its special features for querying over semistructured data, Lore includes enhancements for managing data in a heterogeneous environment. As well as the usual base types (integer, string, etc.), Lore supports "multimedia" types such as GIF images, URLs, Java applets, audio, and text; additional base types can be incorporated easily. Lore supports seamless access to "external objects"---objects fetched on demand from arbitrary information sources during query execution and cached for later use. Any object in Lore may be a placeholder for an external object, allowing Lore to serve both as a storage repository for semistructured data and as a query-driven integration engine.


DBMiner: Interactive Mining of Multiple-Level Knowledge in Relational Databases

Jiawei Han (Simon Fraser University)
Yongjian Fu (Simon Fraser University)
Wei Wang (Simon Fraser University)
Jenny Chiang (Simon Fraser University)
Osmar R. Zaiane (Simon Fraser University)
Krzysztof Koperski (Simon Fraser University)

Based on our years-of-research, a data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, including generalization, characterization, association, classification, and prediction. By incorporation of several interesting data mining techniques, including attribute-oriented induction, progressive deepening for mining multiple-level rules, and meta-rule guided knowledge mining, the system provides a user-friendly, interactive data mining environment with good performance.


Prospector: A Content-Based Multimedia Server for Massively Parallel Architectures.

Grace Au (NCR)
Alexandros Biliris (AT&T Research)
Homer Chen (Bell Laboratories)
Susan Choo (NCR)
Kicha Ganapathy (Bell Laboratories)
Glenn Linderman (NCR)
William O'Connell (Bell Laboratories)
Euthimios Panagos (AT&T Research)
David Schrader (NCR)


METU Interoperable Database System

Asuman Dogac (Middle East Technical University, Turkey)
Ugur Halici (Middle East Technical University, Turkey)
Ebru Kilic (Middle East Technical University, Turkey)
Gokhan Ozhan (Middle East Technical University, Turkey)
Fatma Ozcan (Middle East Technical University, Turkey)
Sena Nural (Middle East Technical University, Turkey)
Cevdet Dengi (Middle East Technical University, Turkey)
Sema Mancuhan (Middle East Technical University, Turkey)
Budak Arpinar (Middle East Technical University, Turkey)
Pinar Koksal (Middle East Technical University, Turkey)
Cem Evrendilek (Middle East Technical University, Turkey)

METU INteroperable DBMS (MIND) is a multidatabase system based on OMG's distributed object management architecture. It is implemented on top of a CORBA compliant ORB, namely, DEC's ObjectBroker. In MIND all local databases are encapsulated in a generic database object. The interface of the generic database object is defined in CORBA IDL and multiple implementations of this interface, one for each component DBMSs, namely, Oracle7, Sybase, Adabas D and MOOD are provided. MIND provides its users a common data model and a single global query language based on SQL. The main components of MIND are a global query manager, a global transaction manager, a schema integrator, interfaces to supported database systems and a graphical user interface.


System for Optimized Numeric Association Rules

Takeshi Fukuda (IBM Tokyo Research Laboratory)
Yasuhiko Morimoto (IBM Tokyo Research Laboratory)
Shinichi Morishita (IBM Tokyo Research Laboratory)
Takeshi Tokuyama (IBM Tokyo Research Laboratory)

In this demonstration, we introduce SONAR, a system for mining optimized association rules from databases with numeric data as well as Boolean data. An example of an association rule is

(Balance in J) => (Service X = yes),

which implies that bank customers whose balances fall in a range J are likely to use Service X with a certain probability. The above rule is interesting only if the number of customers whose balances are contained in J (called the support of J) is sufficient, and also if the probability (called the confidence ratio) is much higher than the average probability of the condition being met. SONAR focuses on computing an optimized support (confidence) range that maximizes the support (confidence) on the condition that the confidence (support) ratio is at or above a given threshold. SONAR also generates a two-dimensional version of association rules such as

((Age, Balance) in P) => (Service X = yes),

which implies that a bank customer whose age and balance fall in a planar region P tends to use Service X with a certain probability.


CapBasED-AMS: A Capability-based and Event-driven Activity Management System

Patrick C. K. Hung (Hong Kong University of Science and Technology)
Helen P. Yeung (Hong Kong University of Science and Technology)
Kamalakar Karlapalem (Hong Kong University of Science and Technology)

An activity management system deals with management and execution of activities. A problem solving agent is a human, or a hardware system, or a software system having an ability to execute activities. An activity consists of multiple inter-dependent tasks that need to be coordinated, scheduled and executed by a set of problem solving agents, where each task is an atomic activity executed by exactly one problem solving agent. In this demonstration, we present an implementation of a capability-based and event-driven activity management system.


The MultiView Project: Object-Oriented View Technology and Applications

Elke Rundensteiner (University of Michigan, Ann Arbor)
Harumi A. Kuno (University of Michigan, Ann Arbor)
Young-Gook Ra (University of Michigan, Ann Arbor)
Viviane Crestana-Taube (University of Michigan, Ann Arbor)
Matthew Jones (University of Michigan, Ann Arbor)
Pedro Marron (University of Michigan, Ann Arbor)

The MultiView Project is an on-going 5-year NFS-funded effort at the University of Michigan to develop and apply object-oriented view technology to address the needs of recently emerging applications such as data warehousing and workflow management systems that require the sharing, virtual restructuring, and caching of data. MultiView, which is fully implemented on top of the GemStone OODB, is one of the first working systems to provide updatable and incrementally maintained materialized object-oriented views. Unique features of the MultiView system include the incorporation of virtual classes into the global schema as first-class database citizens and the support for capacity-augmenting views. The system exploits optimization strategies for the efficient incremental maintenance of materialized views, including distributed property registration and subsumption-based branch termination. Our experimental studies confirm the performance benefit of our proposed optimizations.

In addition, MultiView applies and augments view technology to address important problems such as achieving transparent schema evolution, developing complex customized view types beyond simple SQL-type of set transformations, and dynamic query processing across heterogeneous collections in digital libraries. For example, we utilize OO view technology to solve the software legacy problem of dealing with continuously evolving database requirements. Our transparent schema evolution (TSE) system, which is built on top of MultiView, integrates schema evolution and view support functionalities into one system, thereby supporting the on-line modification of a shared object repository without disturbing existing applications. One unique contribution of TSE is its extensibility framework, which allows the dynamic addition of new change operations into our system (unlike the predetermined static sets of change operations provided by other OODB systems). To the best of our knowledge, ours is the very first effort to provide an open architecture that supports the dynamic addition of new schema change operations.


BeSS: Storage Support for Interactive Visualization Systems

Alexandros Biliris (AT&T Research)
Thomas Funkhouser (Bell Laboratories)
William O'Connell (Bell Laboratories)
Euthimios Panagos (AT&T Research)

Interactive computer graphics systems for visualization of realistic-looking, three-dimensional models are useful for evaluation, design and training in virtual environments (e.g., architectural and mechanical CAD, flight simulation, and virtual reality). The demonstration shows how the storage and retrieval facilities of the BeSS storage system are being utilized in the UC Berkeley Building Walkthrough System, a computer graphics program for interactive visualization of large, furnished architectural models.


The Garlic Project

Mary Tork Roth (IBM Almaden Research Center)
Manish Arya (IBM Almaden Research Center)
Laura M. Haas (IBM Almaden Research Center)
Michael J. Carey (IBM Almaden Research Center)
William Cody (IBM Almaden Research Center)
Ron Fagin (IBM Almaden Research Center)
Peter M. Schwarz (IBM Almaden Research Center)
John Thomas (IBM Almaden Research Center)
Edward L. Wimmers (IBM Almaden Research Center)

The goal of the Garlic project is to build a multimedia information system capable of integrating data that resides in different database systems as well as in a variety of non-database data servers. This integration must be enabled while maintaining the independence of the data servers, and without creating copies of their data. The Garlic middleware provides an object-oriented schema to end user applications, interprets object queries, dispatches pieces of queries to the appropriate data servers, and assembles query results for delivery back to the applications. In addition to a C++ API, Garlic supports a novel query/browser interface called PESTO, which provides end users of the system with a friendly, graphical interface that supports interactive browsing, navigation, and querying of the contents of Garlic databases. Our demo exploits the unique features of PESTO to explore and query the unified object-oriented view of heterogeneous data presented by Garlic. We show that the Garlic system is capable of coordinating a broad range of object-oriented, relational and multimedia queries across several repositories with very different data models and levels of sophistication in query processing.


My Photo Inderpal Singh Mumick, mumick@research.att.com
Last updated 4/18/96 Copyright ©AT&T 1996. All rights reserved.