Combining ELMo

8.2 Stronger Baseline DNN

8.2.2 Combining ELMo

ELMo [10] is one of the strongest SSL approaches in the research field. Thus, we conducted an experiment with a baseline that utilizes ELMo. Specifically, we combinedLSTMwith the ELMo embeddings, namely, ELMo-LSTM⁶. The error rate of this network on the IMDB test set was 8.67%, which is worse than that of LM-LSTM reported in Table 3. This result suggests that, at least in this task setting, pre-training the RNN language model for initialization is more eﬀective than using the ELMo embeddings.

6We used the implementation available in AllenNLP [38].

9 Conclusion

In this paper, we proposed a novel method for SSL, which we named Mixture of Expert/Imitator Networks (MEIN). TheMEINframework consists of a baseline DNN, i.e., anEXN, and several auxiliary networks,IMNs. The unique property of our method is that theIMNs learn to “imitate” the estimated label distribution of theEXNover the unlabeled data with only a limited view of the given input. In this way, theIMNs eﬀectively learn a set of features that potentially contributes to improving the classification performance of theEXN.

Experiments on text classification datasets demonstrated that theMEIN frame-work consistently improved the performance of three distinct settings of theEXN. We also trained the IMNs with extra large-scale unlabeled data and achieved a new state-of-the-art result. This result indicates that our method has the more data, better performanceproperty. Furthermore, our method operates eight times faster than the current strongest SSL method (VAT), and thus, it has promising scalability to the amount of unlabeled data.

Acknowledgements

I am deeply grateful to Dr. Kentaro Inui and Dr. Jun Suzuki for able guidance and generous support. I would like to show my greatest appreciation to Dr. Sho Takase for insightful comments and constructive suggestions. Discussions with my academic colleagues in our laboratory have been quite illuminating.

References

[1] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial Training Methods For Semi-Supervised Text Classification. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.

[2] Motoki Sato, Jun Suzuki, Hiroyuki Shindo, and Yuji Matsumoto. Inter-pretable Adversarial Perturbation in Input Embedding Space for Text. In Proceedings of the 27th International Joint Conference on Artificial Intelli-gence and the 23rd European Conference on Artificial IntelliIntelli-gence (IJCAI-ECAI 2018), pages 4323–4330, 2018.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pages 770–778, 2016.

[4] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep Speech 2: End-to-End Speech Recog-nition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), pages 173–182, 2016.

[5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s Neural Machine Translation System: Bridging the Gap be-tween Human and Machine Translation. arXiv preprint arXiv:1609.08144, 2016.

[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Im-ageNet: A Large-Scale Hierarchical Image Database. In 2009 IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR 2009), pages 248–255, 2009.

[7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeﬀ Dean.

Distributed Representations of Words and Phrases and their Compositional-ity. InAdvances in Neural Information Processing Systems 26 (NIPS 2013), pages 3111–3119, 2013.

[8] Andrew M Dai and Quoc V Le. Semi-supervised Sequence Learning. In Advances in Neural Information Processing Systems 28 (NIPS 2015), pages 3079–3087, 2015.

[9] Kevin Clark, Thang Luong, and Quoc V. Le. Cross-View Training for Semi-Supervised Learning. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), 2018.

[10] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep Contextualized Word Rep-resentations. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018), pages 2227–2237, 2018.

[11] Jeﬀrey Pennington, Richard Socher, and Christopher Manning. GloVe:

Global Vectors for Word Representation. InProceedings of the 2014 Confer-ence on Empirical Methods in Natural Language Processing (EMNLP 2014), pages 1532–1543, 2014.

[12] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the Limits of Language Modeling. arXiv preprint arXiv:1602.02410, 2016.

[13] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoﬀrey E Hinton.

Adaptive Mixtures of Local Experts. Neural Computation, 3(1):79–87, 1991.

[14] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoﬀrey Hinton, and Jeﬀ Dean. Outrageously Large Neural Networks:

The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.

[15] Jimmy Ba and Rich Caruana. Do Deep Nets Really Need to be Deep? In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 2654–2662, 2014.

[16] Geoﬀrey Hinton, Oriol Vinyals, and Jeﬀrey Dean. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop, 2015.

[17] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher.

Learned in Translation: Contextualized Word Vectors. In Advances in Neu-ral Information Processing Systems 30 (NIPS 2017), pages 6294–6305, 2017.

[18] Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power.

Semi-supervised Sequence Tagging with Bidirectional Language Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pages 1756–1765, 2017.

[19] Jun Suzuki and Hideki Isozaki. Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data. InProceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), pages 665–673, 2008.

[20] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One Billion Word Benchmark for Mea-suring Progress in Statistical Language Modeling. InINTERSPEECH, pages 2635–2639, 2014.

[21] Sepp Hochreiter and J¨urgen Schmidhuber. Long Short-Term Memory. Neu-ral Computation, 9(8):1735–1780, 1997.

[22] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), pages 315–323, 2011.

[23] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Opti-mization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015.

[24] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), pages 1243–1252, 2017.

[25] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A Convolutional Neural Network for Modelling Sentences. InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), pages 655–665, 2014.

[26] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, 2013.

[27] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y.

Ng, and Christopher Potts. Learning Word Vectors for Sentiment Analysis.

In Proceedings of the 49th Annual Meeting of the Association for Computa-tional Linguistics (ACL 2016), pages 142–150, 2011.

[28] Rie Johnson and Tong Zhang. Semi-supervised Convolutional Neural Net-works for Text Categorization via Region Embedding. InAdvances in Neural Information Processing Systems 28 (NIPS 2015), pages 919–927, 2015.

[29] Bo Pang and Lillian Lee. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pages 115–124, 2005.

[30] Julian McAuley and Jure Leskovec. Hidden Factors and Hidden Topics:

Understanding Rating Dimensions with Review Text. InProceedings of the 7th ACM conference on Recommender systems (RecSys 2013), pages 165–

172, 2013.

[31] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research.Journal of Machine Learning Research, 5:361–397, 2004.

[32] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kon-tokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, S¨oren Auer, et al. DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 6(2):167–195, 2015.

[33] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Jauvin.

A Neural Probabilistic Language Model. Journal of Machine Learning Re-search, 3(Feb):1137–1155, 2003.

[34] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015.

[35] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Trans-lation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 1715–1725, 2016.

[36] Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.

InProceedings of the 2018 Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP 2018), pages 66–71, 2018.

[37] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher.

Quasi-Recurrent Neural Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.

[38] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer.

AllenNLP: A Deep Semantic Natural Language Processing Platform. 2017.

Appendix

A Notation Rules and Tables

This paper uses the following notation rules:

1. Calligraphy letter represents a mathematical set (e.g., Ds denotes a set of labeled training data)

2. Bold capital letter represents a (two-dimensional) matrix (e.g., W denotes a trainable matrix)

3. Bold lower case letter represents a (one-dimensional) vector (e.g., x is a one-hot vector)

4. Non-bold capital letter represents a fixed scalar value (e.g., H denotes the LSTM hidden state dimension)

5. Non-bold lower case letter represents a scalar variable (e.g., t denotes a scalar time step t)

6. Greek bold capital letter represents a set of (trainable) parameters (e.g.,Θ denotes a set of parameter of the EXN)

7. Non-bold Greek letter represents a scalar (trainable) parameter (e.g., λi

denotes a scalar trainable parameter) 8. (xt)^T_t=1 is a short notation for (x1, . . . ,xT) 9. x[i] representsi-th element of the vector x

10. X_a:brepresents an operation that slices a sequence of vectors (x_a,x_a+1. . . ,x_b₋₁,x_b) from the matrix X.

A set of notations used in this paper is summarized in Table 6.

SymbolDescriptionSymbolDescription Xsequenceofone-hotvectorsxpprobability xaone-hotvectorrepresentingsingletoken(word)WhweightmatrixfortheLSTMhiddenstatehT Ysetofoutputclassesαii-thIMNlogit yscalarclassIDoftheoutputLlossfunction λicoeﬃcientofi-thIMN’slogitKLKL-divergencefunction σsigmoidfunctionwyweightvectorofsoftmaxclassifierforclassy tvariablefortimestepΦsetofparameteroftheIMN Ttime(typicallydenotesthesequencelength)ΘsetofparameteroftheEXN VvocabularyofthebaselineDNNΛsetofparameterforcombiningtheEXNandIMN(s) V′vocabularyoftheIMNJnumberofoutputsfromasingleIMN Dssetoflabeledtrainingdataciwindowsizeofthei-thIMN Dusetofunlabeledtrainingdata0concatenationofzerovector InumberofIMNssMLPfinalhiddenstate igeneralvariableojhiddenstateoftheIMN jgeneralvariableastart-indexofslidingwindow Ewordembeddingmatrixbend-indexofslidingwindow hiLSTMhiddenstateHLSTMhiddenstatedimension DwordembeddingdimensionofexpertNCNNKerneldimension MMLPfinalhiddenstatedimensionzytheEXNlogit bhavectorbiastermforLSTMhiddenstatehTz′ yEXN+IMNlogit byascalarbiastermforclassy-- Table6:NotationTable

Base A B C D 8.0

9.0 10.0 11.0

Er ro r R at e ( % )

10.09 9.82 9.39

9.11 8.83

Elec

Base A B C D 4.0

6.0 8.0 10.0 12.0

Er ro r R at e ( % ) 10.98 10.67 10.58 10.31 10.04 IMDB

Base A B C D 20.0

22.5 25.0 27.5 30.0

Er ro r R at e ( % )

26.47 26.80 25.81 25.72 24.93

Rotten Tomatoes

Base A B C D 5.0

7.5 10.0 12.5 15.0

Er ro r R at e ( % ) 14.14 13.84 13.23 12.66 12.31

RCV1

Figure 5: Eﬀect of the IMNwith diﬀerent window size c_i on the final error rate (%) of LSTM. A lower error rate indicates better performance. Base:

EXN (LSTM) without the IMN, A: c_i = 1, B: c_i = 1,2, C: c_i = 1,2,3, D:

c_i = 1,2,3,4

B Eﬀect of Window Size of the IMN

Following Section 7.3, we investigated the eﬀectiveness of combining the IMNs with diﬀerent window sizes (c_i) on the final error rate (%) of theEXN. We carried out experiment for both LSTM+IMN (Figure 5) and LM-LSTM+IMN (Fig-ure 6). The result is consistent to that of ADV-LM-LSTM+IMN (Figure 4), that greater window size improves the performance.

Base A B C D 4.0

5.0 6.0 7.0

Er ro r R at e ( % )

5.72 5.64 5.60 5.57 5.48

Elec

Base A B C D 4.0

5.0 6.0 7.0 8.0

Er ro r R at e ( % ) 7.25

6.91 6.75 6.55 6.51

IMDB

Base A B C D 10.0

12.0 14.0 16.0 18.0

Er ro r R at e ( % ) 16.80 16.76 16.21 16.14 15.91 Rotten Tomatoes

Base A B C D 5.0

6.0 7.0 8.0 9.0

Er ro r R at e ( % ) ^8.37

7.71 7.64 7.56 7.53

RCV1

Figure 6: Eﬀect of the IMN with diﬀerent window size ci on the final error rate (%) of LM-LSTM.A lower error rate indicates better performance.

Base: EXN (LM-LSTM) without the IMN, A: ci = 1, B: ci = 1,2, C: ci = 1,2,3,D: ci = 1,2,3,4

List of Publications

Awards

1. 言語処理学会第24回年次大会(NLP2018) 優秀賞

International Conferences Papers

1. Shun Kiyono, Jun Suzuki, and Kentaro Inui. 2019. Mixture of Expert/Imitator Network: Scalable Semi-supervised Learning Framework (to appear). In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI 2019).

January.

2. Shun Kiyono, Sho Takase, Jun Suzuki, Naoaki Okazaki, Kentaro Inui, and Masaaki Nagata. 2018. Reducing Odd Generation from Neural Headline Generation (to appear). In 32nd Pacific Asia Conference on Language, Information and Computation (PACLIC 32). December.

3. Shun Kiyono, Sho Takase, Jun Suzuki, Naoaki Okazaki, Kentaro Inui, and Masaaki Nagata. 2018. Unsupervised Token-wise Alignment to Improve Interpretation of Encoder-Decoder Models. In Analyzing and Interpreting Neural Networks for NLP (EMNLP 2018 Workshop), pages 74–81. Novem-ber.

Domestic Conferences Papers

1. 藤井諒, 清野舜, 鈴木潤, and乾健太郎. 2019. ニューラル機械翻訳における文脈情報の選択的利用 (to appear). In 言語処理学会第25回年次大会予稿集. March.

2. 今野颯人, 松林優一郎, 大内啓樹, 清野舜, and 乾健太郎. 2019. 前方文脈の埋め込みを利用した日本語述語項構造解析 (to appear). In言語処理学会第25回年次大会予稿集. March.

3. 北山晃太郎, 清野舜, 鈴木潤, and 乾健太郎. 2019. 画像言語同時埋め込みベクトル空間の構築に向けた埋め込み粒度の比較検討 (to appear). In言語処理学会第25回年次大会予稿集. March.

ドキュメント内 February5,2019GraduateSchoolofInformationSciencesTohokuUniversity ShunKiyono Master’sThesisMixtureofExpert/ImitatorNetworksforLarge-scaleSemi-supervisedLearning B7IM2020 (ページ 31-44)

8.2 Stronger Baseline DNN

8.2.2 Combining ELMo

9 Conclusion

Acknowledgements

References

Appendix

A Notation Rules and Tables

Base A B C D 8.0

9.0 10.0 11.0

Er ro r R at e ( % )

10.09

9.82 9.39

9.11 8.83

Elec

Base A B C D 4.0

6.0 8.0 10.0 12.0

Er ro r R at e ( % ) 10.98 10.67 10.58 10.31 10.04 IMDB

Base A B C D 20.0

22.5 25.0 27.5 30.0

Er ro r R at e ( % )

26.47 26.80 25.81 25.72 24.93

Rotten Tomatoes

Base A B C D 5.0

7.5 10.0 12.5 15.0

Er ro r R at e ( % ) 14.14 13.84 13.23 12.66 12.31

RCV1

B Eﬀect of Window Size of the IMN

Base A B C D 4.0

5.0 6.0 7.0

Er ro r R at e ( % )

5.72 5.64 5.60 5.57 5.48

Elec

Base A B C D 4.0

5.0 6.0 7.0 8.0

Er ro r R at e ( % ) 7.25

6.91 6.75 6.55 6.51

IMDB

Base A B C D 10.0

12.0 14.0 16.0 18.0

Er ro r R at e ( % ) 16.80 16.76 16.21 16.14 15.91 Rotten Tomatoes

Base A B C D 5.0

6.0 7.0 8.0 9.0

Er ro r R at e ( % ) 8.37

7.71 7.64 7.56 7.53

RCV1

List of Publications

Awards

International Conferences Papers

Domestic Conferences Papers

Er ro r R at e ( % ) ^8.37