Journals:
We study a constrained version of the knapsack problem in which dependencies between items are given by the adjacencies of a graph. In the 1-neighbour knapsack problem, an item can be selected only if at least one of its neighbours is also selected. In the all-neighbours knapsack problem, an item can be selected only if all its neighbours are also selected. We give approximation algorithms and hardness results when the vertices have both uniform and arbitrary weight and profit functions, and when the dependency graph is directed and undirected.
An earlier version of this paper appeared at IWOCA 2011.
We give a (lnn +1)-approximation for the decision tree (DT) problem. An instance of DT is a set of m binary tests T=(T1, ..., Tm) and a set of n items X=(X1, ..., Xn). The goal is to output a binary tree where each internal node is a test, each leaf is an item and the total external path length of the tree is minimized. Total external path length is the sum of the depths of all the leaves in the tree. DT has a long history in computer science with applications ranging from medical diagnosis to experiment design. It also generalizes the problem of finding optimal average-case search strategies in partially ordered sets which includes several alphabetic tree problems. Our work decreases the previous upper bound on the approximation ratio by a constant factor. We provide a new analysis of the greedy algorithm that uses a simple accounting scheme to spread the cost of a tree among pairs of items split at a particular node. We conclude by showing that our upper bound also holds for the DT problem with weighted tests.
An earlier version of this paper appeared at APPROX 2008.
We describe a statistical signature of chunks and an algorithm for finding chunks. While there is no formal definition of chunks, they may be reliably identified as configurations with low internal entropy or unpredictability and high entropy at their boundaries. We show that the log frequency of a chunk is a measure of its internal entropy. The Voting-Experts exploits the signature of chunks to find word boundaries in text from four languages and episode boundaries in the activities of a mobile robot
Peer Reviewed Proceedings:
We propose a novel batch active learning method that leverages the availability of high-quality and efficient sequential active-learning policies by approximating their behavior when applied for k steps. Specifically, our algorithm uses Monte-Carlo simulation to estimate the distribution of unlabeled examples selected by a sequential policy over k steps. The algorithm then selects k examples that best matches this distribution, leading to a combinatorial optimization problem that we term “bounded coordinated matching.” While we show this problem is NP-hard, we give an efficient greedy solution, which inherits approximation bounds from supermodular minimization theory. Experiments on eight benchmark datasets show that the proposed approach is highly effective.
We study the fundamental algorithmic rigidity problems for generic frameworks periodic with respect to a fixed lattice or a finite-order rotation in the plane. For fixed-lattice frameworks we give an O(n2) algorithm for deciding generic rigidity and an O(n3) algorithm for computing rigid components. If the order of rotation is part of the input, we give an O(n4) algorithm for deciding rigidity; in the case where the rotation's order is 3, a more specialized algorithm solves all the fundamental algorithmic rigidity problems in O(n2) time.
Let us call a sequence of numbers heapable if they can be sequentially inserted to form a binary tree with the heap property, where each insertion subsequent to the first occurs at a previously placed number. In this paper we consider a variety of problems related to heapable sequences and subsequences that do not appear to have been studied previously. Our motivation for introducing these concepts is two-fold. First, such problems correspond to natural extensions of the well-known secretary problem for hiring an organization with a hierarchical structure. Second, from a purely combinatorial perspective, our problems are interesting variations on similar longest increasing subsequence problems, a problem paradigm that has led to many deep mathematical connections. We provide several basic results. We obtain an efficient algorithm for determining the heapa- bility of a sequence, and also prove that the question of whether a sequence can be arranged in a complete binary heap is NP-hard. Regarding subsequences we show that, with high probability, the longest heapable subsequence of a random permutation of n numbers has length (1 − o(1))n, and a subsequence of length (1 − o(1))n can in fact be found online with high probability. We similarly show that for a random permutation a subsequence that yields a perfect heap of size αn for a constant α can be found with high probability. Our work highlights the interesting structure underlying this class of subsequence problems, and we leave many further interesting variations open for future work.
We consider the problem of finding minimum reset sequences in synchronizing automata. The well-known Cerny conjecture states that every n-state synchronizing automaton has a reset sequence with length at most (n-1)2. While this conjecture gives an upper bound on the length of every reset sequence, it does not directly address the problem of finding the shortest reset sequence. We call this the () problem. We give an O(kmnk+n4/k)-time (n-1)/(k-1) -approximation for the problem for any k >=2. We also show that our analysis is tight. When k=2 our algorithm reduces to Eppstein's algorithm and yields an (n-1)-approximation. When k=n our algorithm is the familiar exponential-time, exact algorithm. We define a non-trivial class of which we call . We show that naturally generalizes two classic optimization problems: min and . Both these problems are known to be hard to approximate, although at present, has a slightly stronger lower bound. In particular, it is NP-hard to approximate to within a factor of c · logn for some c>0. Thus, the problem is as least as hard to approximate as . This improves the previous best lower bound which showed that it was NP-hard to approximate the on binary alphabets to within any constant factor. Our result requires an alphabet of arbitrary size.
We study a constrained version of the knapsack problem in which dependencies between items are given by the adjacencies of a graph. In the 1-neighbour knapsack problem, an item can be selected only if at least one of its neighbours is also selected. We give approximation algorithms and hardness results when the nodes have both uniform and arbitrary weight and profit functions, and when the dependency graph is directed and undirected.
arXiv version
We give the first data structure for the problem of maintaining a dynamic set of n elements drawn from a partially ordered universe described by a tree. We define the Line-Leaf Tree, a linear-sized data structure that supports the operations: insert; delete; test membership; and predecessor. The performance of our data structure is within an O(logw)-factor of optimal. Here w <=n is the width of the partial-order-a natural obstacle in searching a partial order.
arXiv version
We give a (lnn +1)-approximation for the decision tree (DT) problem. An instance of DT is a set of m binary tests T=(T1, ..., Tm) and a set of n items X=(X1, ..., Xn). The goal is to output a binary tree where each internal node is a test, each leaf is an item and the total external path length of the tree is minimized. Total external path length is the sum of the depths of all the leaves in the tree. DT has a long history in computer science with applications ranging from medical diagnosis to experiment design. It also generalizes the problem of finding optimal average-case search strategies in partially ordered sets which includes several alphabetic tree problems. Our work decreases the previous upper bound on the approximation ratio by a constant factor. We provide a new analysis of the greedy algorithm that uses a simple accounting scheme to spread the cost of a tree among pairs of items split at a particular node. We conclude by showing that our upper bound also holds for the DT problem with weighted tests.
The vast number of applications featuring multimedia and geometric data has made the R-tree a ubiquitous data structure in databases. A popular and fundamental operation on R-trees is nearest neighbor search. While nearest neighbor on R-trees has received considerable experimental attention, it has received somewhat less theoretical consideration. We study pruning heuristics for nearest neighbor queries on R-trees. Our primary result is the construction of non-trivial families of R-trees where k-nearest neighbor queries based on pessimistic (i.e. min-max) distance estimates provide exponential speedup over queries based solely on optimistic (i.e. min) distance estimates. The exponential speedup holds even when k=1. This result provides strong theoretical evidence that min-max distance heuristics are an essential component to depth-first nearest-neighbor queries. In light of this, we also consider the time-space tradeoffs of depth-first versus best-first nearest neighbor queries and construct a family of R-trees where best-first search performs exponentially better than depth-first search even when depth-first employs min-max distance heuristics.
We introduce the Constrained Subtree Selection (CSS) problem as a model for the optimal design of websites. Given a hierarchy of topics represented as a DAG G and a probability distribution over the topics, we select a subtree of the transitive closure of G which minimizes the expected path cost. We define path cost as the sum of the page costs along a path from the root to a leaf. Page cost, gamma, is a function of the number of links on a page. We give a sufficient condition for gamma which makes CSS NP-Complete. This result holds even for the uniform probability distribution. We give a polynomial time algorithm for instances of CSS where G does not constrain the choice of subtrees and gamma favors pages with at most k links. We show that CSS remains NP-Hard for constant degree DAGs and a class of degree costs and give an O(log(k)gamma(d+1)) approximation for any gamma favoring pages with at most k links and for any G with maximum degree d. We give complete characterizations of the optimal trees for the linear degree cost in unconstrained graphs and uniform probability distributions, and the logarithmic degree cost in arbitrary DAGs and uniform probability distributions.
A complete version of this paper is available as University of Massachusetts Technical Report 04-09
This paper describes an unsupervised algorithm for segmenting categorical time series into episodes. The Voting-Experts algorithm first collects statistics about the frequency and boundary entropy of ngrams, then passes a window over the series and has two “expert methods” decide where in the window boundaries should be drawn. The algorithm successfully segments text into words in four languages. The algorithm also segments time series of robot sensor data into subsequences that represent episodes in the life of the robot. We claim that Voting-Experts finds meaningful episodes in categorical time series because it exploits two statistical characteristics of meaningful episodes.
An earlier version is available in the Working Notes of the 2002 ESF Exploratory Workshop on Pattern Detection and Discovery in Data Mining
Estimating the parameters of stochastic context-free grammars (SCFGs) from data is an important, well-studied problem. Almost without exception, existing approaches make repeated passes over the training data. The memory requirements of such algorithms are ill-suited for embedded agents exposed to large amounts of training data over long periods of time. We present a novel algorithm, called HOLA, for estimating the parameters of SCFGs that computes summary statistics for each string as it is observed and then discards the string. The memory used by HOLA is bounded by the size of the grammar, not by the amount of training data. Empirical results show that HOLA performs as well as the Inside-Outside algorithm on a variety of standard problems, despite the fact that it has access to much less information.
Coalition operations are increasingly effects-based, which means they apply force only as necessary to achieve political and psychological effects. Capture the Flag is a war gaming environment that includes intelligent, autonomous adversaries. Our planner previously focused on two goals: occupying objectives and attrition. But attrition is a means to the defeat of one's enemies, not an end in itself. For Capture the Flag to plan for defeat, it needs a model of defeat. We model the “capacity for conflict” as a leaky bucket: when a unit's bucket is full, it has no more capacity for conflict and it capitulates. Flow into and out of the bucket is modulated by several factors including attrition and heroism. The model is inherently dynamical, so it exhibits the time-dependent behaviors one observes in real conflicts; for example, identical attacks will have different effects on capitulation as a function of their timing.
An earlier version appears in the Proceedings of the Second International Conference on Knowledge Systems for Coalition Operations. Pages 85-90
The Hierarchical Agent Control Architecture (HAC) is a general toolkit for specifying an agent's behavior. HAC supports action abstraction, resource management, sensor integration, and is well suited to controlling large numbers of agents in dynamic environments. It relies on three hierarchies: action, sensor, and context. The action hierarchy controls the agent's behavior. It is organized around tasks to be accomplished, not the agents themselves. This facilitates the integration of multi-agent actions and planning into the architecture. The sensor hierarchy provides a principled means for structuring the complexity of reading and transforming sensor information. Each level of the hierarchy integrates the data coming in from the environment into conceptual chunks appropriate for us by actions at this level. Actions and sensors are written using the same formalism. The context hierarchy is a hierarchy of goals. In addition to their primary goals, most actions are simultaneously trying to achieve a set of implicit assumptions. These assumptions are made explicit through the context hierarchy and provide the context under which the agent operates. We have developed a planner, GRASP, implemented within HAC, which is capable of resolving multiple goals in real time.HAC was intended to have wide applicability. It has been used to control agents in commercial computer games and physical robots. Our primary application domain is a simulator of land-based military-engagements called “Capture the Flag.” HAC's simulation substrate models physics at an abstract level. HAC supports any domain in which behaviors can be reduced to a small set of primitive effectors such as MOVE and APPLY-FORCE. At this time defining agent behavior requires lisp programming skills; we are moving towards more graphical programming languages.
Defeat mechanisms are strategies for achieving victory over an opponent. Although defeat mechanisms often rely on influencing the opponent psychologically and emotionally, most simulations of warfare do not model these “soft” factors, they model only victory by attrition. To create more accurate, adaptable, and believable systems, we must be able to model a variety of defeat mechanisms. We propose a model where parameters and attributes that affect emotional and physical fatigue are combined to produce an overall measure of fatigue called effective fatigue. Effective fatigue, along with an agent's state, is combined by a defeat model to produce probabilities of surrender. We create warfare scenarios involving catastrophe and surprise, and then examine the model's behavior under these scenarios. We conclude with a discussion of how the model is related to our own Capture the Flag war gaming system.
An earlier version appears in Working Notes of the 2000 AAAI Fall Symposium on Simulating Human Agents. Pages 39-45.
Peer Reviewed Workshops and Symposia:
Stochastic context-free grammars (SCFGs) are often used to represent the syntax of natural languages. Most algorithms for learning them require storage and repeated processing of a sentence corpus. The memory and computational demands of such algorithms are ill-suited for embedded agents such as a mobile robot. Two algorithms are presented that incrementally learn the parameters of stochastic context-free grammars as sentences are observed. Both algorithms require a fixed amount of space regardless of the number of sentence observations. Despite using less information than the inside-outside algorithm, the algorithms perform almost as well.
Another version of this paper is available as University of Massachusetts Technical Report 01-21
Erlang is a modern functional programming language with additional features that support explicit concurrency through lightweight processes and message passing. Erlang's clean semantics and lack of side effects make it possible to develop high quality software that can easily take advantage of the availability of multiple processors. Like most functional programming languages, Erlang provides a number of library routines (e.g. map) for processing and manipulating lists. These routines are an excellent target for the introduction of concurrent processing techniques. In this work we explore various methods for introducing concurrency into list processing routines such as map. We report on the results of our experiments using the Erlang Development Environment, and discuss which approaches to concurrency are most beneficial given the number of processing nodes available and the properties of the computation (e.g. type of function being applied, size of the input list, etc.).
Winner of the Best Student Paper award.
Technical Reports and Unpublished Documents:
We introduce several new models and methods for improving access to organized information. The first model, Constrained Subtree Selection (CSS), has applications in web site design and the reorganization of directory structures. Given a hierarchy represented as a rooted DAG G with n weighted leaves, one selects a subtree of the transitive closure of G that minimizes the expected path cost. Path cost is the sum of the degree costs along a path from the root to a leaf. Degree cost, γ, is a function of the out degree of a node. We give a sufficient condition for γ that makes CSS NP-complete. This result holds even when the leaves have equal weight. Turning to algorithms, we give a polynomial time solution for instances of CSS where G does not constrain the choice of subtrees and γ favors nodes with at most k links. Even though CSS remains NP-hard for constant degree DAGs, we give an O(log(k)γ(d+1)) approximation for any G with maximum degree d, provided that γ favors nodes with at most k links. Finally, we give a complete characterization of the optimal trees for two special cases: (1) linear degree cost in unconstrained graphs and uniform probability distributions, and (2) logarithmic degree cost in arbitrary DAGs and uniform probability distributions.The second problem, Category Tree (CT), seeks a decision tree for categorical data where internal nodes are categories, edges are appropriate values for the categories, and leaves are data items. CT generalizes the well-studied Decision Tree (DT) problem. Our results resolve two open problems: We give a lnn +1-approximation for DT and show DT does not have a polynomial time approximation scheme unless P=NP. Our work, while providing the first non-trivial upper and lower bounds on approximating DT, also demonstrates that DT and a subtly different problem which also bears the name decision tree have fundamentally different approximation complexity. We complement the above models with a new pruning method for k nearest neighbor queries on R-trees. We show that an extension to a popular depth-first 1-nearest neighbor query results in a theoretically better search. We call this extension Promise-Pruning and construct a class of R-trees where its application reduces the search space exponentially.
We introduce the Constrained Subtree Selection (CSS) problem as a model for the optimal design of websites. Given a hierarchy of topics represented as a DAG G and a probability distribution over the topics, we select a subtree of the transitive closure of G which minimizes the expected path cost. We define path cost as the sum of the page costs along a path from the root to a leaf. Page cost, gamma is a function of the number of links on a page. We give a sufficient condition for gamma which makes CSS NP-Complete. This result holds even for the uniform probability distribution. We give a polynomial time algorithm for instances of CSS where G does not constrain the choice of subtrees and gamma favors pages with at most k links. We show that CSS remains NP-Hard for constant degree DAGs and a class of degree costs and give an O(log(k)gamma(d+1)) approximation for any gamma favoring pages with at most k links and for any G with maximum degree d. We give complete characterizations of the optimal trees for the linear degree cost in unconstrained graphs and uniform probability distributions, and the logarithmic degree cost in arbitrary DAGs and uniform probability distributions.
We are interested in how robots might learn language from exposure to utterances and sensory information about the physical contexts in which they occur. Although stochastic context free grammars are often used to represent the syntactic structure of natural languages, most algorithms for learning them from data require repeated processing of a corpus of sentences. The memory and computational demands of such algorithms are ill-suited for embedded agents. We present an online algorithm that uses summary statistics computed from sentences in combination with repeated sampling techniques to learn the parameters of stochastic context free grammars. Empirical results demonstrate that the algorithm performs just as well as the Inside-Outside algorithm de- spite the fact that it has access to less information about the training data.