Self-Training and Co-Training - JAIST Repository: A Study of Classifier Combination and Semi-Su

5.2.1 Co-Training

The co-training paradigm applies when accurate classification hypotheses for a task can be learned from either of two sets of features of the data, each called a view. The intuition behind Blum and Mitchell’s co-training algorithm [Blum & Mitchell (1998)] is that two views of the data can be used to train two classifiers that can help each other. Each classifier is trained using one view of the labeled data, and then used to predicts labels for unlabeled examples. The most confident predictions of each classifier are selected and adding the corresponding examples with their predicted labels to the other’s available training data. According to Blum and Mitchell [Blum & Mitchell (1998)], the condition for co-training algorithm works exactly is that the two views have to satisfy the two assumptions: firstly each view itself is sufficient to label task (or is sufficient to build a good classifier); the second is independent assumption of the two views. In the original definition of co-training, Blum and Mitchell [Blum & Mitchell (1998)] state conditional independence of the views as a required criterion for co-training to work.

In particularly, suppose that each example is represented by a feature vector x drawn from a set of possible values (an instance space) X. The task is to learn a classification function f :X →Y where Y is a set of possible labels. The characteristics of co-training can be described as follows.

• The features can be separated into two types: X = X₁ ×X₂ where X₁ and X₂ correspond to two different views of an example. In the named entity task, X₁ might be the instance space for the spelling features, X2 might be the instance space for the contextual features. By this assumption, each elementx∈X can also be represented as (x_l, x₂)∈X₁×X₂.

• Each view of the example is sufficient for classification. That is, there exist functions f_l and f₂ such that for any example x= (x_l, x₂),f(x) =f_l(X_l) =f₂(x₂). We never see an examplex= (x_l, x₂) in training or test data such that f_l(x_l)6=f₂(x₂).

Thus the method makes the fairly strong assumption that the features can be par-titioned into two types such that each type alone is sufficient for classification (x_l and x₂ are not correlated too tightly). Now assume we have n pairs (x_l,i, x_2,i) drawn from X₁xX₂, where the first m pairs have labels y_i, whereas for i=m+ 1, . . . , n the pairs are unlabeled. In a fully supervised setting, the task is to learn a function f such that for all i = 1, . . . , m, f(x_l,i, x_2,i) = y_i. In the co-training case, [Blum & Mitchell (1998)] argues that the task should be to induce functions f_l and f₂ such that:

1. f₁(x_1,i) = f₂(x_2,i) = y_i, for i= 1, . . . , m 2. f1(x1,i) = f2(x2,i), for i=m+ 1, . . . , n

So f_l and f₂ must (1) correctly classify the labeled examples, and (2) must agree with each other on the unlabeled examples. The key point is that the second con-straint can be remarkably powerful in reducing the complexity of the learning problem.

[Blum & Mitchell (1998)] gives an example that illustrates just how powerful the second constraint can be. Consider the case where |X₁|=|X₂|=N and N is a “medium” sized

Unlabeled Data

Labeled Data

Classifier 1 Classifier 2 New labeled

examples

New labeled examples

Figure 5.5: A Scheme for Co-Training Algorithm

number so that it is feasible to collect O(N) unlabeled examples. Assume that the two classifiers are “rote learners”: that is,f_landf₂ are defined through look-up tables that list a label for each member ofX1 orX2. The problem is a binary classification problem. The problem can be represented as a graph with 2N vertices corresponding to the members of X₁ and X₂. Each unlabeled pair (x_l,i, x_2,i) is represented as an edge between nodes corresponding to xl,i and x2,i in the graph. An edge indicates that the two features must have the same label. Given a sufficient number of randomly drawn unlabeled examples (i.e., edges), we will induce two completely connected components that together span the entire graph. Each vertex within a connected component must have the same label – in the binary classification case, we need a single labeled example to identify which component should get which label.

The original co-training algorithm [Blum & Mitchell (1998)] is presented in Algorithm 4, and a corresponding scheme is presented in Fig. 5.5.

Algorithm 4– The Original Co-Training Algorithm

Input: a set L of labeled training examples; a set U of unlabeled examples; K is the number of iteration;

1: Create a pool U⁰ of examples by choosingu examples at random from U

2: k ←0

3: repeat

4: k ←k+ 1

5: use L to train a classifierh₁ that considers only the x₁ portion of x

6: use L to train a classifierh₂ that considers only the x₂ portion of x

7: allow h₁ to label ppositive and n negative examples fromU⁰

8: allow h₂ to label ppositive and n negative examples fromU⁰

9: add these self-labeled examples to L

10: randomly choose 2p+ 2n examples from U to replenishU⁰

11: until k > K

5.2.2 Self-Training

Self-training algorithm, also called Yarowsky algorithm [Abney (2004)] is one of the first bootstrapping algorithms to become widely known in computational linguistics. This algorithm, in brief, consists of two loops. The “inner loop” orbase learner is a supervised learning algorithm. Specifically, [Yarowsky (1995)] uses a simple decision list learner that considers rules of the form, “If instance x contains feature f, then predict label j”, and selects those rules whose precision on the training data is highest. The “outer loop” is given a seed set of rules to start with. In each iteration, it uses the current set of rules to assign labels to unlabeled data. It selects those instances on which the base learners predictions are most confident, and constructs a labeled training set from them. It then calls the inner loop to construct a new classifier (that is, a new set of rules), and the cycle repeats. The Algorithm presented in 3 can be considered as the conventional self-training algorithm.

5.2.3 Comparison between Self-training and Co-training

As discussed in [Abney (2004)], co-training [Blum & Mitchell (1998)] has subsequently become more popular, perhaps in part because it has proven amenable to theoretical analysis, in contrast to the Yarowsky algorithm, which is as yet mathematically poorly understood. The Yarowsky algorithm does have the advantage of placing less restriction on the data sets it can be applied to. Co-training requires data attributes to be separable into two views that are conditionally independent given the target label; the Yarowsky algorithm makes no such assumption about its data.

Both self-training and co-training are bootstrapping algorithms and their mission is to extend the original set of labeled examples. Under our opinion, the effect of a boot-strapping algorithm is depend on the quantity of added labeled examples which can be evaluated via two criteria: the accuracy of added examples; and how much of informa-tion will be achieved from new labeled examples. In our considerainforma-tion, we regard the difference between self-training and co-training at the important point: self-training uses only one view while co-training using two views for representing data. The easy way to understand these views is that each view can be considered as a set of features. The view of self-training is the set containing all features and this set will split into two distinct sets which represent for the two views of co-training. Now, we discuss about these two criteria under the implementation of co-training and self-training algorithms.

• For the first criterion, it is natural that the more prior information the classifier contains, the more confidence of its detection is. Co-training algorithm uses two views with one of assumptions that each view its self is sufficient for detecting labels for new examples. But this assumption is impractical, and it is easy to see that even the set of all features is also lack of information for the labelling task.

Therefore, for this criterion, self-training seems to more confident than co-training.

Note that this problem concerns the problem P₂ as presented before.

• For the second criterion, at first we have to answer the question which information can be brought from new labeled examples? One kind of added information which is easy to see is information of new features. Under this consideration, if the assump-tions of co-training is satisfied then it seems to be more useful than self-training

because the two classifier trained on the two views will provide information for each other.

ドキュメント内 JAIST Repository: A Study of Classifier Combination and Semi-Supervised Learning for Word Sense Disambiguation (ページ 84-87)