This software package provides a toolbox for testing pool-based
active-learning algorithms in MATLAB.
Active Learning
Specifically, we consider the following scenario. There is a pool of
datapoints . We may successively select a set of points
to observe. Each observation reveals a discrete,
integer-valued label for . This labeling process
might be nondeterministic; we might choose the same point
twice and observe different labels each time. In active learning, we
typically assume we have a budget that limits the number of
points we may observe.
Our goal is to iteratively build a set of observations
that achieves some goal in an efficient manner. One typical goal is
that this training set allows us to accurately predict the labels on
the unobserved points. Assume we have a probabilistic model
and let represent the set of unobserved points. We
might with to minimize either the 0/1 loss on the unlabeled points
,
where , or the log loss:
.
We could sample a random set of points, but by careful
consideration of our observation locations, we hope we can do
significantly better than this. One common active learning strategy,
known as uncertainty sampling, iteratively chooses to make an
observation at the point with the largest marginal entropy given the
current data:
,
with the hope that these queries can better map out the boundaries
between classes.
Of course, there are countless goals besides minimizing generalization
error and numerous other strategies besides the highly myopic
uncertainty sampling. Indeed, many active learning scenerios might not
involve probability models at all. Providing a highly adaptable and
extensible toolbox for conducting arbitrary pool-based active learning
experiments is the goal of this project.
Using this Toolbox
The most-important function is active_learning, which simulates an
active learning experiment using the following procedure:
Given: initially labeled points X,
corresponding labels Y,
budget B
for i = 1:B
% find points available for labeling
eligible_points = selector(x, y)
% decide on point(s) to observe
x_star = query_strategy(x, y, eligible_points)
% observe point(s)
y_star = label_oracle(x_star)
% add observation(s) to training set
X = [X, x_star]
Y = [Y, y_star]
end
The implementation supports user-specified:
Selectors, which given the current training set, return a set of
points currently eligible for labeling. See selectors.m for usage
and available implementations.
Query strategies, which given a training set and the selected
eligible points, decides which point(s) to observe next. Note that a
query strategy can return multiple points, allowing for batch
observations. See query_strategies.m for usage and available
implementations.
Label oracles, which given a set of points, return a set of
corresponding labels. Label oracles may optionally be
nondeterministic (see, for example, bernoulli_oracle). See
label_oracles.m for usage and available implementations.
Each of these are provided as function handles satisfying a desired
API, described below.
This function also supports arbitrary user-specified callbacks
called after each round of the experiment. This can be useful, for
example, for plotting the progress of the algorithm and/or printing
statistics such as test error online.
Selectors
A selector considers the current labeled dataset and indicates which
of the unlabeled points should be considered for observation at this
time.
problem: a struct describing the problem, containing fields:
`points`: an ![(n x d)][13] data matrix for the available points
num_classes: the number of classes
num_queries: the number of queries to make
train_ind: a list of indices into problem.points indicating the
thus-far observed points
observed_labels: a list of labels corresponding to the
observations in train_ind
test_ind: a list of indices into problem.points indicating the
points eligible for observation
Output:
query_ind: an index into problem.points indicating the point(s)
to query next (every entry in query_ind will always be a member
of the set of points in test_ind)
The following query strategies are provided in this toolbox:
argmax: samples the point(s) maximizing a given score function
argmin: samples the point(s) minimizing a given score function
expected_error_reduction: samples the point giving lowest
expected loss on unlabeled points
margin_sampling: samples the point with the smallest margin
query_by_committee: samples the point with the highest disagreement
between models
uncertainty_sampling: samples the most uncertain point
Label Oracles
Label oracles are functions that, given a set of points chosen to be
queried, returns a set of corresponding labels. In general, they need
not be deterministic, which is especially interesting when points can
be queried multiple times.
Label oracles must satisfy the following interface:
label = label_oracle(problem, query_ind)
Inputs:
problem: a struct describing the problem, containing fields:
`points`: an ![(n x d)][13] data matrix for the available points
num_classes: the number of classes
query_ind: an index into problem.points specifying the point(s) to be
queried
Output:
label: a list of integers between 1 and problem.num_classes
indicating the observed label(s)
The following general-purpose label oracles are provided in this
toolbox:
lookup_oracle: a trivial lookup-table label oracle given a fixed
list of ground-truth labels
bernoulli_oracle: a label oracle that, conditioned on the queried
point(s), samples labels independently from a Bernoulli distribution
with given success probability
multinomial_oracle: a label oracle that, conditioned on the
queried point(s), samples labels independently from a multinomial
distribution with given success probabilities
请发表评论