8.2 Stronger Baseline DNN
8.2.2 Combining ELMo
ELMo [10] is one of the strongest SSL approaches in the research field. Thus, we conducted an experiment with a baseline that utilizes ELMo. Specifically, we combinedLSTMwith the ELMo embeddings, namely, ELMo-LSTM6. The error rate of this network on the IMDB test set was 8.67%, which is worse than that of LM-LSTM reported in Table 3. This result suggests that, at least in this task setting, pre-training the RNN language model for initialization is more effective than using the ELMo embeddings.
6We used the implementation available in AllenNLP [38].
9 Conclusion
In this paper, we proposed a novel method for SSL, which we named Mixture of Expert/Imitator Networks (MEIN). TheMEINframework consists of a baseline DNN, i.e., anEXN, and several auxiliary networks,IMNs. The unique property of our method is that theIMNs learn to “imitate” the estimated label distribution of theEXNover the unlabeled data with only a limited view of the given input. In this way, theIMNs effectively learn a set of features that potentially contributes to improving the classification performance of theEXN.
Experiments on text classification datasets demonstrated that theMEIN frame-work consistently improved the performance of three distinct settings of theEXN. We also trained the IMNs with extra large-scale unlabeled data and achieved a new state-of-the-art result. This result indicates that our method has the more data, better performanceproperty. Furthermore, our method operates eight times faster than the current strongest SSL method (VAT), and thus, it has promising scalability to the amount of unlabeled data.
Acknowledgements
I am deeply grateful to Dr. Kentaro Inui and Dr. Jun Suzuki for able guidance and generous support. I would like to show my greatest appreciation to Dr. Sho Takase for insightful comments and constructive suggestions. Discussions with my academic colleagues in our laboratory have been quite illuminating.
References
[1] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial Training Methods For Semi-Supervised Text Classification. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
[2] Motoki Sato, Jun Suzuki, Hiroyuki Shindo, and Yuji Matsumoto. Inter-pretable Adversarial Perturbation in Input Embedding Space for Text. In Proceedings of the 27th International Joint Conference on Artificial Intelli-gence and the 23rd European Conference on Artificial IntelliIntelli-gence (IJCAI-ECAI 2018), pages 4323–4330, 2018.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pages 770–778, 2016.
[4] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep Speech 2: End-to-End Speech Recog-nition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pages 173–182, 2016.
[5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s Neural Machine Translation System: Bridging the Gap be-tween Human and Machine Translation. arXiv preprint arXiv:1609.08144, 2016.
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Im-ageNet: A Large-Scale Hierarchical Image Database. In 2009 IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR 2009), pages 248–255, 2009.
[7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean.
Distributed Representations of Words and Phrases and their Compositional-ity. InAdvances in Neural Information Processing Systems 26 (NIPS 2013), pages 3111–3119, 2013.
[8] Andrew M Dai and Quoc V Le. Semi-supervised Sequence Learning. In Advances in Neural Information Processing Systems 28 (NIPS 2015), pages 3079–3087, 2015.
[9] Kevin Clark, Thang Luong, and Quoc V. Le. Cross-View Training for Semi-Supervised Learning. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), 2018.
[10] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep Contextualized Word Rep-resentations. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018), pages 2227–2237, 2018.
[11] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe:
Global Vectors for Word Representation. InProceedings of the 2014 Confer-ence on Empirical Methods in Natural Language Processing (EMNLP 2014), pages 1532–1543, 2014.
[12] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the Limits of Language Modeling. arXiv preprint arXiv:1602.02410, 2016.
[13] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton.
Adaptive Mixtures of Local Experts. Neural Computation, 3(1):79–87, 1991.
[14] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks:
The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
[15] Jimmy Ba and Rich Caruana. Do Deep Nets Really Need to be Deep? In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 2654–2662, 2014.
[16] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
[17] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher.
Learned in Translation: Contextualized Word Vectors. In Advances in Neu-ral Information Processing Systems 30 (NIPS 2017), pages 6294–6305, 2017.
[18] Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power.
Semi-supervised Sequence Tagging with Bidirectional Language Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pages 1756–1765, 2017.
[19] Jun Suzuki and Hideki Isozaki. Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data. InProceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), pages 665–673, 2008.
[20] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One Billion Word Benchmark for Mea-suring Progress in Statistical Language Modeling. InINTERSPEECH, pages 2635–2639, 2014.
[21] Sepp Hochreiter and J¨urgen Schmidhuber. Long Short-Term Memory. Neu-ral Computation, 9(8):1735–1780, 1997.
[22] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), pages 315–323, 2011.
[23] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Opti-mization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015.
[24] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), pages 1243–1252, 2017.
[25] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A Convolutional Neural Network for Modelling Sentences. InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), pages 655–665, 2014.
[26] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, 2013.
[27] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y.
Ng, and Christopher Potts. Learning Word Vectors for Sentiment Analysis.
In Proceedings of the 49th Annual Meeting of the Association for Computa-tional Linguistics (ACL 2016), pages 142–150, 2011.
[28] Rie Johnson and Tong Zhang. Semi-supervised Convolutional Neural Net-works for Text Categorization via Region Embedding. InAdvances in Neural Information Processing Systems 28 (NIPS 2015), pages 919–927, 2015.
[29] Bo Pang and Lillian Lee. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pages 115–124, 2005.
[30] Julian McAuley and Jure Leskovec. Hidden Factors and Hidden Topics:
Understanding Rating Dimensions with Review Text. InProceedings of the 7th ACM conference on Recommender systems (RecSys 2013), pages 165–
172, 2013.
[31] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research.Journal of Machine Learning Research, 5:361–397, 2004.
[32] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kon-tokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, S¨oren Auer, et al. DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 6(2):167–195, 2015.
[33] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Jauvin.
A Neural Probabilistic Language Model. Journal of Machine Learning Re-search, 3(Feb):1137–1155, 2003.
[34] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015.
[35] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Trans-lation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 1715–1725, 2016.
[36] Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.
InProceedings of the 2018 Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP 2018), pages 66–71, 2018.
[37] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher.
Quasi-Recurrent Neural Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
[38] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer.
AllenNLP: A Deep Semantic Natural Language Processing Platform. 2017.
Appendix
A Notation Rules and Tables
This paper uses the following notation rules:
1. Calligraphy letter represents a mathematical set (e.g., Ds denotes a set of labeled training data)
2. Bold capital letter represents a (two-dimensional) matrix (e.g., W denotes a trainable matrix)
3. Bold lower case letter represents a (one-dimensional) vector (e.g., x is a one-hot vector)
4. Non-bold capital letter represents a fixed scalar value (e.g., H denotes the LSTM hidden state dimension)
5. Non-bold lower case letter represents a scalar variable (e.g., t denotes a scalar time step t)
6. Greek bold capital letter represents a set of (trainable) parameters (e.g.,Θ denotes a set of parameter of the EXN)
7. Non-bold Greek letter represents a scalar (trainable) parameter (e.g., λi
denotes a scalar trainable parameter) 8. (xt)Tt=1 is a short notation for (x1, . . . ,xT) 9. x[i] representsi-th element of the vector x
10. Xa:brepresents an operation that slices a sequence of vectors (xa,xa+1. . . ,xb−1,xb) from the matrix X.
A set of notations used in this paper is summarized in Table 6.
SymbolDescriptionSymbolDescription Xsequenceofone-hotvectorsxpprobability xaone-hotvectorrepresentingsingletoken(word)WhweightmatrixfortheLSTMhiddenstatehT Ysetofoutputclassesαii-thIMNlogit yscalarclassIDoftheoutputLlossfunction λicoefficientofi-thIMN’slogitKLKL-divergencefunction σsigmoidfunctionwyweightvectorofsoftmaxclassifierforclassy tvariablefortimestepΦsetofparameteroftheIMN Ttime(typicallydenotesthesequencelength)ΘsetofparameteroftheEXN VvocabularyofthebaselineDNNΛsetofparameterforcombiningtheEXNandIMN(s) V′vocabularyoftheIMNJnumberofoutputsfromasingleIMN Dssetoflabeledtrainingdataciwindowsizeofthei-thIMN Dusetofunlabeledtrainingdata0concatenationofzerovector InumberofIMNssMLPfinalhiddenstate igeneralvariableojhiddenstateoftheIMN jgeneralvariableastart-indexofslidingwindow Ewordembeddingmatrixbend-indexofslidingwindow hiLSTMhiddenstateHLSTMhiddenstatedimension DwordembeddingdimensionofexpertNCNNKerneldimension MMLPfinalhiddenstatedimensionzytheEXNlogit bhavectorbiastermforLSTMhiddenstatehTz′ yEXN+IMNlogit byascalarbiastermforclassy-- Table6:NotationTable
Base A B C D 8.0
9.0 10.0 11.0
Er ro r R at e ( % )
10.09
9.82 9.39
9.11 8.83
Elec
Base A B C D 4.0
6.0 8.0 10.0 12.0
Er ro r R at e ( % ) 10.98 10.67 10.58 10.31 10.04 IMDB
Base A B C D 20.0
22.5 25.0 27.5 30.0
Er ro r R at e ( % )
26.47 26.80 25.81 25.72 24.93
Rotten Tomatoes
Base A B C D 5.0
7.5 10.0 12.5 15.0
Er ro r R at e ( % ) 14.14 13.84 13.23 12.66 12.31
RCV1
Figure 5: Effect of the IMNwith different window size ci on the final error rate (%) of LSTM. A lower error rate indicates better performance. Base:
EXN (LSTM) without the IMN, A: ci = 1, B: ci = 1,2, C: ci = 1,2,3, D:
ci = 1,2,3,4
B Effect of Window Size of the IMN
Following Section 7.3, we investigated the effectiveness of combining the IMNs with different window sizes (ci) on the final error rate (%) of theEXN. We carried out experiment for both LSTM+IMN (Figure 5) and LM-LSTM+IMN (Fig-ure 6). The result is consistent to that of ADV-LM-LSTM+IMN (Figure 4), that greater window size improves the performance.
Base A B C D 4.0
5.0 6.0 7.0
Er ro r R at e ( % )
5.72 5.64 5.60 5.57 5.48
Elec
Base A B C D 4.0
5.0 6.0 7.0 8.0
Er ro r R at e ( % ) 7.25
6.91 6.75 6.55 6.51
IMDB
Base A B C D 10.0
12.0 14.0 16.0 18.0
Er ro r R at e ( % ) 16.80 16.76 16.21 16.14 15.91 Rotten Tomatoes
Base A B C D 5.0
6.0 7.0 8.0 9.0
Er ro r R at e ( % ) 8.37
7.71 7.64 7.56 7.53
RCV1
Figure 6: Effect of the IMN with different window size ci on the final error rate (%) of LM-LSTM.A lower error rate indicates better performance.
Base: EXN (LM-LSTM) without the IMN, A: ci = 1, B: ci = 1,2, C: ci = 1,2,3,D: ci = 1,2,3,4
List of Publications
Awards
1. 言語処理学会第24回年次大会(NLP2018) 優秀賞
International Conferences Papers
1. Shun Kiyono, Jun Suzuki, and Kentaro Inui. 2019. Mixture of Expert/Imitator Network: Scalable Semi-supervised Learning Framework (to appear). In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI 2019).
January.
2. Shun Kiyono, Sho Takase, Jun Suzuki, Naoaki Okazaki, Kentaro Inui, and Masaaki Nagata. 2018. Reducing Odd Generation from Neural Headline Generation (to appear). In 32nd Pacific Asia Conference on Language, Information and Computation (PACLIC 32). December.
3. Shun Kiyono, Sho Takase, Jun Suzuki, Naoaki Okazaki, Kentaro Inui, and Masaaki Nagata. 2018. Unsupervised Token-wise Alignment to Improve Interpretation of Encoder-Decoder Models. In Analyzing and Interpreting Neural Networks for NLP (EMNLP 2018 Workshop), pages 74–81. Novem-ber.
Domestic Conferences Papers
1. 藤井 諒, 清野 舜, 鈴木 潤, and乾 健太郎. 2019. ニューラル機械翻訳にお ける文脈情報の選択的利用 (to appear). In 言語処理学会第25回年次大会 予稿集. March.
2. 今野 颯人, 松林 優一郎, 大内 啓樹, 清野 舜, and 乾 健太郎. 2019. 前方文 脈の埋め込みを利用した日本語述語項構造解析 (to appear). In言語処理学 会第25回年次大会予稿集. March.
3. 北山 晃太郎, 清野 舜, 鈴木 潤, and 乾 健太郎. 2019. 画像言語同時埋め込 みベクトル空間の構築に向けた埋め込み粒度の比較検討 (to appear). In言 語処理学会第25回年次大会予稿集. March.