Learning from Insufficient Data: Active Learning


Ya Zhang and Wenbin Cai, Shanghai Jiao Tong University


In many real-world applications, unlabeled data are usually abundant whereas labeleddata are scarce due to the high cost of data labeling. Thus, data collection becomes increasinglyimportant, especially in the period of big data. In practice, the widely adopted method for datacollection is passive learning, where training examples are randomly and independently selectedfrom a certain underlying distribution and manually annotated by human editors. However, it isoften the case that there are no sufficient labeled examples to train a satisfied quality model dueto the high effort associated with data annotation. To solve this problem, Active Learning (AL) hasrecently drawn a great deal of attention in the machine learning community. In this talk, we willdiscuss our work on active learning with three machine-learned applications: (i) active learningfor search ranking with the idea of noise injection, (ii) active learning for classification that usesthe idea of maximum model change, and (iii) active learning for regression, including both linear and non linear  models.  This  talk  will  also  introduce  a  general  stopping  criterion  for  activelearning, which is an important AL problem in practical applications.