5.2.1 Co-Training
The co-training paradigm applies when accurate classification hypotheses for a task can be learned from either of two sets of features of the data, each called a view. The intuition behind Blum and Mitchell’s co-training algorithm [Blum & Mitchell (1998)] is that two views of the data can be used to train two classifiers that can help each other. Each classifier is trained using one view of the labeled data, and then used to predicts labels for unlabeled examples. The most confident predictions of each classifier are selected and adding the corresponding examples with their predicted labels to the other’s available training data. According to Blum and Mitchell [Blum & Mitchell (1998)], the condition for co-training algorithm works exactly is that the two views have to satisfy the two assumptions: firstly each view itself is sufficient to label task (or is sufficient to build a good classifier); the second is independent assumption of the two views. In the original definition of co-training, Blum and Mitchell [Blum & Mitchell (1998)] state conditional independence of the views as a required criterion for co-training to work.
In particularly, suppose that each example is represented by a feature vector x drawn from a set of possible values (an instance space) X. The task is to learn a classification function f :X →Y where Y is a set of possible labels. The characteristics of co-training can be described as follows.
• The features can be separated into two types: X = X1 ×X2 where X1 and X2 correspond to two different views of an example. In the named entity task, X1 might be the instance space for the spelling features, X2 might be the instance space for the contextual features. By this assumption, each elementx∈X can also be represented as (xl, x2)∈X1×X2.
• Each view of the example is sufficient for classification. That is, there exist functions fl and f2 such that for any example x= (xl, x2),f(x) =fl(Xl) =f2(x2). We never see an examplex= (xl, x2) in training or test data such that fl(xl)6=f2(x2).
Thus the method makes the fairly strong assumption that the features can be par-titioned into two types such that each type alone is sufficient for classification (xl and x2 are not correlated too tightly). Now assume we have n pairs (xl,i, x2,i) drawn from X1xX2, where the first m pairs have labels yi, whereas for i=m+ 1, . . . , n the pairs are unlabeled. In a fully supervised setting, the task is to learn a function f such that for all i = 1, . . . , m, f(xl,i, x2,i) = yi. In the co-training case, [Blum & Mitchell (1998)] argues that the task should be to induce functions fl and f2 such that:
1. f1(x1,i) = f2(x2,i) = yi, for i= 1, . . . , m 2. f1(x1,i) = f2(x2,i), for i=m+ 1, . . . , n
So fl and f2 must (1) correctly classify the labeled examples, and (2) must agree with each other on the unlabeled examples. The key point is that the second con-straint can be remarkably powerful in reducing the complexity of the learning problem.
[Blum & Mitchell (1998)] gives an example that illustrates just how powerful the second constraint can be. Consider the case where |X1|=|X2|=N and N is a “medium” sized
Unlabeled Data
Labeled Data
Classifier 1 Classifier 2 New labeled
examples
New labeled examples
Figure 5.5: A Scheme for Co-Training Algorithm
number so that it is feasible to collect O(N) unlabeled examples. Assume that the two classifiers are “rote learners”: that is,flandf2 are defined through look-up tables that list a label for each member ofX1 orX2. The problem is a binary classification problem. The problem can be represented as a graph with 2N vertices corresponding to the members of X1 and X2. Each unlabeled pair (xl,i, x2,i) is represented as an edge between nodes corresponding to xl,i and x2,i in the graph. An edge indicates that the two features must have the same label. Given a sufficient number of randomly drawn unlabeled examples (i.e., edges), we will induce two completely connected components that together span the entire graph. Each vertex within a connected component must have the same label – in the binary classification case, we need a single labeled example to identify which component should get which label.
The original co-training algorithm [Blum & Mitchell (1998)] is presented in Algorithm 4, and a corresponding scheme is presented in Fig. 5.5.
Algorithm 4– The Original Co-Training Algorithm
Input: a set L of labeled training examples; a set U of unlabeled examples; K is the number of iteration;
1: Create a pool U0 of examples by choosingu examples at random from U
2: k ←0
3: repeat
4: k ←k+ 1
5: use L to train a classifierh1 that considers only the x1 portion of x
6: use L to train a classifierh2 that considers only the x2 portion of x
7: allow h1 to label ppositive and n negative examples fromU0
8: allow h2 to label ppositive and n negative examples fromU0
9: add these self-labeled examples to L
10: randomly choose 2p+ 2n examples from U to replenishU0
11: until k > K
5.2.2 Self-Training
Self-training algorithm, also called Yarowsky algorithm [Abney (2004)] is one of the first bootstrapping algorithms to become widely known in computational linguistics. This algorithm, in brief, consists of two loops. The “inner loop” orbase learner is a supervised learning algorithm. Specifically, [Yarowsky (1995)] uses a simple decision list learner that considers rules of the form, “If instance x contains feature f, then predict label j”, and selects those rules whose precision on the training data is highest. The “outer loop” is given a seed set of rules to start with. In each iteration, it uses the current set of rules to assign labels to unlabeled data. It selects those instances on which the base learners predictions are most confident, and constructs a labeled training set from them. It then calls the inner loop to construct a new classifier (that is, a new set of rules), and the cycle repeats. The Algorithm presented in 3 can be considered as the conventional self-training algorithm.
5.2.3 Comparison between Self-training and Co-training
As discussed in [Abney (2004)], co-training [Blum & Mitchell (1998)] has subsequently become more popular, perhaps in part because it has proven amenable to theoretical analysis, in contrast to the Yarowsky algorithm, which is as yet mathematically poorly understood. The Yarowsky algorithm does have the advantage of placing less restriction on the data sets it can be applied to. Co-training requires data attributes to be separable into two views that are conditionally independent given the target label; the Yarowsky algorithm makes no such assumption about its data.
Both self-training and co-training are bootstrapping algorithms and their mission is to extend the original set of labeled examples. Under our opinion, the effect of a boot-strapping algorithm is depend on the quantity of added labeled examples which can be evaluated via two criteria: the accuracy of added examples; and how much of informa-tion will be achieved from new labeled examples. In our considerainforma-tion, we regard the difference between self-training and co-training at the important point: self-training uses only one view while co-training using two views for representing data. The easy way to understand these views is that each view can be considered as a set of features. The view of self-training is the set containing all features and this set will split into two distinct sets which represent for the two views of co-training. Now, we discuss about these two criteria under the implementation of co-training and self-training algorithms.
• For the first criterion, it is natural that the more prior information the classifier contains, the more confidence of its detection is. Co-training algorithm uses two views with one of assumptions that each view its self is sufficient for detecting labels for new examples. But this assumption is impractical, and it is easy to see that even the set of all features is also lack of information for the labelling task.
Therefore, for this criterion, self-training seems to more confident than co-training.
Note that this problem concerns the problem P2 as presented before.
• For the second criterion, at first we have to answer the question which information can be brought from new labeled examples? One kind of added information which is easy to see is information of new features. Under this consideration, if the assump-tions of co-training is satisfied then it seems to be more useful than self-training
because the two classifier trained on the two views will provide information for each other.