|
Abstract:
Histograms are frequently used to represent the distribution of data values in an attribute of a relation. This distribution can be used for data mining as well as for various estimated statistics. Most previous work has focused on identifying the optimal histogram (given a limited number of buckets) for a single attribute, independent of other attributes/histograms. We propose the idea of global optimization of histograms, i.e., single-attribute histograms for a set of attributes are optimized collectively so as to minimize the overall error in using the histograms. The idea is to allocate more buckets to histograms whose attributes are more frequently used and/or distributions are highly skewed. We developed several alternative algorithms for this purpose, and experimentally demonstrated their value. |
Nina Mishra, Dan Oblinger, and
Leonard Pitt. Sublinear Time Approximate Clustering. In
12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA),
Washington, D.C., Jan 2001.
|
Abstract:
We present a sampling-based algorithm for approximate k-median clustering that improves upon previous approximation results in the case that the number of points n being clustered is the dominant problem parameter, assuming a distance metric on [0,M]^d. Unlike previous approximate clustering results, our work has time complexity independent of n. This independence is very attractive for the large clustering problems typically associated with large data sources like the web, phone records or transactional data. We also provide a generalization of the algorithm for use when no bound M is known for the points being clustered, and we describe how a previously developed dimension reduction technique can be applied to eliminate our dependence on the dimension when d > log n. We also apply our sampling approach to the related problem of property testing, i.e., given A and k, determining whether there exists a k-median clustering with average distance from points to centers A. A new definition of clustering similar in spirit to the PAC model of learning is developed that allows for clustering (possibly infinite) probability distributions and naturally allows for conceptual descriptions of clusters. We show that the k-median problem is PAC-clusterable, and conclude with an algorithm for PAC-clustering disjoint k-term DNF expressions. |
S. Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu.
Approximate XML Joins.
In Proc. ACM-SIGMOD Intl. Conf. Management of Data, Madison, WI,
June 2002.
|
Abstract:
A feature, and a challenge, of semi-structured data is that the same information can easily be represented in multiple ways that each differ slightly. There is often a desire to match two pieces of data if they are similar to each other, even if not identical. For instance, two occurrences of a customer record, one with a telephone number and another without, should match. There has been work on matching tree-structured (XML) data towards this end. In this paper we consider the question of performing not just one single such match, but rather finding all similar pairs drawn from two sets of objects to be "joined". Towards this end, we develop a sampling-based technique to select a small number of cluster representatives drawn from either data set. We then compute similarity distances between each element in the data set and use these to bound the number of element pair-wise comparisons that are required. |
Andrew Nierman and H.V. Jagadish. Evaluating Structural Similarity
in XML Documents.
In Proc. ACM-SIGMOD Intl. Workshop on Web and Databases, Madison, WI,
June 2002.
|
Abstract:
Given a set of XML documents, it is often valuable to be able cluster them into classes of structurally similar documents. (For instance, one could then induce a scheme for the set of documents in a clusster). A basic requirement is a metric for structural similarity, and an efficient algorithm to evaluate the structural distance between two XML documents. In this paper, we propose such a metric and show it to be very effective at categorizing documents by their DTD, kept hidden for the purposes of the experiment. We also develop a dynamic programming algorithm to evaluate the structural distance between two XML documents in time that is proportional to the document sizes. |
Andrew Nierman and H. V. Jagadish.
ProTDB: Probabilistic Data in XML
In Proc. 28th Intl. Conf. on Very Large Databases, Hong Kong,
Aug. 2002.
|
Abstract:
Whereas traditional databases manage only deterministic information, many applications that use databases involve uncertain data. This paper presents a Probabilistic Tree Data Base (ProTDB) to manage probabilistic data, represented in XML. Our approach differs from previous efforts to develop probabilistic relational systems in that we build a probabilistic XML database. This design is driven by application needs that involve data not readily amenable to a relational representation. XML data poses several modeling challenges: due to its structure, due to the possibility of uncertainty association at multiple granularities, and due to the possibility of missing and repeated sub-elements. We present a probabilistic XML model that addresses all of these challenges. We devise an implementation of XML query operations using our probability model, and demonstrate the efficiency of our implementation experimentally. |
J. Elble, C. Heeren, and L. Pitt.
Sampling for Optimized Association Rules. In preparation.
This project has been funded in part by the National Science Foundation, under grant numbers IIS-9907483 and IIS-0002356. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the project participants, and do not necessarily reflect the views of the National Science Foundation.