FiST: Scalable XML Document Filtering by Sequencing
Twig Patterns ∗
Joonho Kwon∗ Praveen Rao† Bongki Moon† Sukho Lee∗
∗School of Electrical Engineering and Computer Science
Seoul National University
Seoul 151-742, Korea
†Department of Computer Science
University of Arizona
Tucson, AZ 85721, USA
{rpraveen, bkmoon}@cs.arizona.edu
Abstract
In recent years, publish-subscribe (pub-sub) sys-
tems based on XML document filtering have
received much attention. In a typical pub-
sub system, subscribed users specify their in-
terest in profiles expressed in the XPath lan-
guage, and each new content is matched against
the user profiles so that the content is deliv-
ered to only the interested subscribers. As the
number of subscribed users and their profiles
can grow very large, the scalability of the sys-
tem is critical to the success of pub-sub ser-
vices. In this paper, we propose a novel scal-
able filtering system called FiST (Filtering by
Sequencing Twigs) that transforms twig pat-
terns expressed in XPath and XML documents
into sequences using Pr¨ufer’s method. As a
consequence, instead of matching linear paths
of twig patterns individually and merging the
matches during post-processing, FiST performs
holistic matching of twig patterns with incoming
documents. FiST organizes the sequences into a
dynamic hash based index for efficient filtering.
We demonstrate that our holistic matching ap-
proach yields lower filtering cost and good scal-
ability under various situations.
∗This work was done while Joonho Kwon visited the Univer-
sity of Arizona, whose visit was supported by the Brain Korea 21
Project and the Information Technology Research Center(ITRC)
Supprot Program. This work was also sponsored in part by NSF
Grant No. IIS-0100436, and NSF Research Infrastructure program
EIA-0080123, and the ACIST Fund from the State of Arizona. The
authors assume all responsibility for the contents of the paper.
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the VLDB copyright notice and the
title of the publication and its date appear, and notice is given that
copying is by permission of the Very Large Data Base Endowment.
To copy otherwise, or to republish, requires a fee and/or special
permission from the Endowment.
Proceedings of the 31st VLDB Conference,
Trondheim, Norway, 2005
1 Introduction
The publish-subscribe (pub-sub) systems play an impor-
tant role in e-commerce and Internet applications by en-
abling selective dissemination of information. In a typi-
cal pub-sub system, whenever new content is produced,
it is selectively delivered to interested subscribers. They
have enabled new services such as alerting and notifi-
cation services for users interested in knowing about the
latest products in the market, current affairs, stock price
changes etc. on a variety of devices like mobile phones,
PDAs and desktops. Such services necessitate the de-
velopment of software systems that enable scalable and
efficient matching of a large number of content against
millions of user subscriptions.
Today we come across e-commerce sites such as
Priceline.comand Hotwire.comthat provide email no-
tifications to subscribers about price changes and hot
deals. A recent service by Google.com called Google
Alerts provides email updates to users regarding relevant
Google results. Users can choose to receive notifications
by selecting a topic and providing a list of search key-
words. Another interesting example is the stock quote
tracking service provided by Yahoo.com. There is a
growing use and demand for large-scale information dis-
semination systems.
The popularity of extensible markup language XML
as a standardfor information exchangehas triggeredsev-
eral research efforts to build scalable XML filtering sys-
tems for information dissemination. In such a system,
user profiles are typically expressed in the XPath lan-
guage [4]. In this paper, we consider user profiles that
can be represented as twig patterns. These twig patterns
contain the child and descendant XPath axes. For ex-
ample, a path expression given in XPath syntax
book[author//name="John"]/title
qualifies XML documents by specifying a twig pattern
composed of four elements book, author, name and
title in an XML document, and a value-based selection
predicate name="John". In the filtering system, each in-
coming XML document is examined against user profiles
represented by XPath expressions. The XML document
is sent to users whose profiles are matched. One of the
key challenges in building such a system is to effectively
217
organize a large number of profiles in order to minimize
the filtering cost and achieve good scalability.
It should be noted that the XML filtering problem
is different from the problem of finding all occurrences
of a twig pattern in an XML document. This is due
to the reversal in the roles of twig patterns and XML
documents. Essentially, the filtering problem that we
address in this paper is stated as follows.
Given a set of XPath expressions, identify those
XPath expressions that appear in a given XML
document.
XFilter [1] was one of the early work in XML filtering.
XFilter handles simple XPath expressions by transform-
ing each path expression into a single finite state ma-
chine. Subsequently, the YFilter system [19] was pro-
posed that focused on shared path matching to improve
the scalability of the filtering system. YFilter constructs
a single non-deterministic finite automaton (NFA) for
all the XPath expressions. YFilter supports twig pat-
terns by first matching individual linear paths from root-
to-leaf and then by performing post-processing to iden-
tify matching twig patterns. Consider a nested XPath
expression book[author//name]/title. YFilter splits
the pattern and indexes two linear path expressions in
its NFA, namely book/title and book/author//name.
The individual linear path matches are used during post-
processing to identify twig pattern matches.
In this paper, we propose a novel filtering system
called FiST (Filtering by Sequencing Twigs) that per-
forms holistic matching of twig patterns with each in-
coming XML document. The matching is holistic since
FiST does not break a twig pattern into root-to-leaf
paths. Rather the twig pattern is matched as a whole
due to sequence transformation. Our system focuses on
ordered twig pattern matching, which is essential for ap-
plications where the nodes in a twig pattern follow the
document order in XML. For example, an XML data
model was proposed by Bow et al. for representing in-
terlinear text for linguistic applications, which is used
to demonstrate various linguistic principles in different
languages[7]. Bow’sXML model providesa four-levelhi-
erarchical representation for the interlinear text, namely,
text level, phrase level, word level and morpheme level.
For the purpose of linguistic analysis, it is essential to
preserve linear order between the words in the text [18].
Thus, there is a compelling need for ordered twig pat-
tern matching. In addition to interlinear text, language
treebanks have been widely used in computational lin-
guistics. Treebanks capture syntactic structure of text
data and provide a hierarchical representation of text by
breaking them into syntactic units such as noun clauses,
verbs, adjectives and so on. A recent paper by M¨uller et
al. used ordered pattern matching over treebanks for
question answering systems [15].
Our FiST system matches twig patterns holistically
using the idea of encoding XML documents and twig
patterns into Pr¨ufersequences[17]. The Pr¨ufer’s method
constructs a one-to-one correspondence between labeled
trees and sequences. (Refer to Section 3). It was shown
in the PRIX system [17] that the above encoding sup-
ports ordered twig pattern matching efficiently. A collec-
tion of sequences for twig patterns are organized into a