• 検索結果がありません。

4 1 次元曲線のあてはめ

4.2 実データ

次に,実データセット上での実験結果を示す.実デー タセットには Reuters-215781, RCV1 [9], news20を用 いた.

Reuters-21578に対しては,部分データセット mod-ified Apte (“ModApte”) を用いた.このデータセット は,トピックによってラベルづけされた 10170個の事 例からなる.我々は, “acq” という主要なトピックを 正例,他のトピックを負例とし,2値分類問題を構成し た.仮説として,30,839の単語に対応した決定株を用 意した.ここで,各決定株は1つの単語に対応し,与え られた文書にその単語が含まれていれば+1,そうでな ければ 1 を返す.

RCV1データセット,news20には,LIBSVM tools [4]

で提供されているものを用いた.データ数と仮説の総数 はそれぞれ,m = 20,242, n = 47,236, m = 19,996, n= 1,355,193である.

各データセットに対して,さらに定数項に対応する仮 説 −1 を用いる.LPBoost and Sparse LPBoostのパ ラメータについては,ε= 104 とし,Reuters-21578,

RCV1, news20データセットそれぞれに対してν= 0.2m とした.

実験結果を表2にまとめる.Sparse LPBoostは他の 手法よりも数倍以上高速に動作していることがわかる.

人工データでの結果と同様,Sparse LPBoostはより少 ない事例(約0.6mから0.8m程度)で最適解を近似的 に求めている.また,Sparse LPBoostは仮説間の疎性 も利用しているように見える.実際,両データセットに おいて,Sparse LPBoost は30,000以上の仮説の中か ら数パーセントの仮説のみを用いている.

5 結論

本稿では,与えられた任意の ε >0に対して,1 ソ フトマージン最適化問題のε近似を求める手法を提案し た.本手法は最適解における仮説間と事例間の疎性を利 用することにより,標準的な線形計画ソルバやLPBoost よりも高速に動作する.

今後の課題として,計算時間の理論的保証が得られる

ようにSparse LPBoost を改良することが挙げられる.

また,実用的な観点では,仮説と事例を選択するための より良いヒューリスティクスの開発も重要である.

1http://www.daviddlewis.com/resources/testcollections/reuters21578.

ν= 0 (ノイズなし) ν= 0.2m(ノイズあり)

m 手法 計算時間(秒) #(di>0)(%) #(αj>0)(%) 計算時間(秒) #(di>0)(%) #(αj >0) (%)

104 LP 5.99 0.25 18.8 2.8 20.0 9.9

LPB 11.64 0.19 17.8(69.3) 3.16 20.0 9.9(9.9)

SLPB 6.15 0.25(23.0) 25(98) 2.99 20.0(47) 9.9(10.9)

105 LP 82.4 0.1 53.5 33.78 20 9.9

LPB 120 0.02 15.8(61.4) 43.9 20 9.9(10.9)

SLPB 63.9 0.03(18.9) 20.8(80) 53.1 20(58.7) 10.9(59.4)

106 LP メモリ不足 n/a n/a メモリ不足 n/a n/a LPB メモリ不足  n/a n/a メモリ不足 n/a n/a

SLPB 684 0.002(25.2) 15.8(71.3) 1679 20(46.5) 90(91)

表1: ノイズなし(ν= 1),ノイズあり(ν = 0.2m)データに対する実験結果.#(di>0),および#(αj >0)はそ れぞれ,d,αにおける非ゼロ要素数の割合を表す.LPBoost (LPB)とSparse LPBoost (SLPB)に対しては,選択 された仮説や事例の個数の割合をかっこ内に示す.

データ 手法 計算時間(秒) #(di>0)(%) #(αj>0)(%)

Reuters-21578 LP 21.0 22.1 1.55

(m=10,170,n=30,839) LPB 52.5 22.1 1.53(1.67)

SLPB 8.5 22.2(68) 1.52(1.96)

RCV1 LP 2135 25.4 5.1

(m=20,242,n=47,237) LPB 4154 24.3 3.7(4.0)

SLPB 690 24.6(63.4) 3.9(4.6)

news20 LP 121525 25.4 0.16

(m=19,996,n=1,355,193) LPB 7251 22.7 0.080(0.090)

SLPB 1223 23.1(68.1) 0.088(0.117)

表2: 実データセットに対する実験結果.#(di>0),および#(αj >0) はそれぞれ,d, αにおける非ゼロ要素数 の割合を表す.LPBoost (LPB)と Sparse LPBoost (SLPB)に対しては,選択された仮説や事例の個数をかっこ内 に示す.

謝辞

本研究は科研費若手研究(B) 21700171の援助によっ てなされた.また,本稿に有益なコメントを下さった豊 橋技術科学大学の山本悠二氏に感謝します.

参考文献

[1] N. Balcan, A. Blum, and N. Srebro. A theory of learning with similarity functions.Machine Learn-ing, 72(1-2):89–112, 2008.

[2] P. S. Bradley and O. L.Mangasarian. Massive data discrimination via linear support vector machines.

Optimization Methods and Software, 13(1):1–10, 2000.

[3] Y. Censor and S. A. Zenios. Parallel Optimiza-tion: Theory, Algorithms, and Applications. Ox-ford University Press, 1998.

[4] C. C. Chang and C. J. Lin. Libsvm: a li-brary for support vector machines. Software avail-able at http://www.csie.ntu.edu.tw/~cjlin/

libsvm, 2001.

[5] A. Demiriz, K. P. Bennett, and J. Shawe-Taylor.

Linear programming boosting via column genera-tion. Machine Learning, 46(1-3):225–254, 2002.

[6] T. Graepel, R. Herbrich, B. Sch¨olkopf, A. Smola, P. Bartlett, K. M¨uller, K. Obermayer, and R. Williamson. Classification on proximity data

with LP-machines. In International Conference on Artificial Neural Networks, pages 304–309, 1999.

[7] A. J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned en-sembles. In Proceedings of the fifteenth National Conference on Artificial Intelligence (AAAI-98), pages 692–698, 1998.

[8] M. Hein, O. Bousquet, and B. Sch¨olkopf. Maximal margin classification for metric spaces.Journal of Computer ans System Sciences, 71:333–359, 2005.

[9] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li.

Rcv1: A new benchmark collection for text cate-gorization research.Journal of Machine Learning Research, 5:361–397, 2004.

[10] O. Mangasarian. Exact 1-norm support vector machines via unconstrained convex differentiable minimization. Journal of Machine Learning Re-search, 7:1517–1530, 2006.

[11] S. Nash and A. Sofer. Linear and Nonlinear Pro-gramming. Macgraw-Hill, 1996.

[12] R. E. Schapire, Y. Freund, P. Bartlett, and W. S.

Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998.

[13] S. Sra. Efficient large scale linear progamming support vector machines. In Machine Learning:

ECML 2006, pages 767–774, 2006.

[14] L. Wang, M. Sugiyama, C. Yang, K. Hatano, and J. Fung. Theory and algorithms for learning with dissimilarity functions.Neural Computation, 21(5):1459–1484, 2009.

[15] M. Warmuth, K. Glocer, and G. R¨atsch. Boosting algorithms for maximizing the soft margin. In Ad-vances in Neural Information Processing Systems 20, pages 1585–1592, 2008.

[16] M. Warmuth, K. Glocer, and S. V. N. Vish-wanathan. Entropy regularized LPBoost. In Proceedings of the 19th International Conference on Algorithmic Learning Theory, pages 256–271, 2008.

情報論的学習理論テクニカルレポート2009

Technical Report on Information-Based Induction Sciences 2009 (IBIS2009)

Multiple Kernel Learning for Object Classification

Shinichi Nakajima

Alexander Binder

Christina M¨uller

Wojciech Wojcikiewicz

§

Marius Kloft

Ulf Brefeld

k

Klaus-Robert M¨uller

∗∗

Motoaki Kawanabe

††

Abstract: Combining information from various image descriptors has become a standard technique for image classification tasks. Multiple kernel learning (MKL) approaches allow to determine the optimal combination of such similarity matrices and the optimal classifier simultaneously. Most MKL approaches employ an`1-regularization on the mixing coefficients to promote sparse solu-tions; an assumption that is often violated in image applications where descriptors hardly encode orthogonal pieces of information. In this paper, we compare`1-MKL with a recently developed non-sparse MKL in object classification tasks. We show that the non-sparse MKL outperforms both the standard MKL and SVMs with average kernel mixtures on the PASCAL VOC data sets.

Keywords: multiple kernel learning, support vector machine, image classification, sparsity.

1 Introduction

Data fusion is an important topic in computer vision. Im-ages can be represented by a multiplicity of features cap-turing certain aspects, including color, textures, and shapes.

Unfortunately, the importance of different types of features varies with the tasks; color information, for instance, sub-stantially increases the detection of stop signs while color-ing is almost irrelevant for findcolor-ing cars in images. Tech-niques for appropriately combining relevant features for a task at hand are therefore crucial for state-of-the-art object recognition systems.

From a machine learning view, different representations give rise to different kernel functions. Kernels define (pos-sibly nonlinear) similarities between data points and allow to abstract learning algorithms from data. Thus, kernel ma-chines have been successfully applied to many practical problems in various fields [19]. Given a task at hand, de-signing an appropriate kernel is essential for achieving good generalizations, for instance by incorporating prior assump-tions and domain knowledge [9, 28]. However, in the ab-sence of prior knowledge one has to resort to alternatives.

For object recognition tasks, combining information from

Nikon Corporation, [email protected],

Fraunhofer Institute FIRST, [email protected],

Technische Universit¨at Berlin, [email protected],

§Technische Universit¨at Berlin, [email protected],

Technische Universit¨at Berlin, [email protected],

kTechnische Universit¨at Berlin, [email protected],

∗∗Technische Universit¨at Berlin, [email protected],

††Fraunhofer Institute FIRST, [email protected],

various image descriptors into several kernelsK1, . . . , Km

has become a standard technique. Unfortunately, the choice of the right kernel mixture is often a matter of trial and er-ror. As a remedy, uniform mixtures of normalized kernels [14, 26] or brute-force approaches [2] are employed fre-quently. However, the former approach may lead to sub-optimal kernels and the latter is computationally infeasible if many kernels are to be combined.

Recently, multiple kernel learning (MKL) [13, 1, 20, 18, 27] was applied to object classification tasks involving vari-ous image descriptors [24]. Compared to uniform mixtures and brute-force approaches, MKL has the appealing prop-erty of always finding the optimal kernel combination and converges quickly as it can be wrapped around a regular support vector machine (SVM) [20]. MKL aims at learning the optimal kernel mixture and the model parameters simul-taneously. More specifically, MKL approaches find a linear mixture of the kernels, that isK =P

jβjKj. To support the interpretability of the solution, many MKL approaches promote sparse mixtures by incorporating an`1-norm con-straint on the mixing coefficients. However, it has often been observed that`1-norm MKL is outperformed by the average-sum kernel K = P

jKj. An explanation is that enforcing sparse mixtures may lead to degenerate models if the optimal kernel mixture is non-sparse. A remedy might be recently developed non-sparse variants of MKL promot-ing non-sparse kernel mixtures [10].

In this contribution, we empirically compare sparse and

non-sparse MKL approaches to object classification tasks.

We employ candidate kernels obtained from many different image descriptors including the 30 color SIFT features by the VOC2008 winner [22]. Our empirical results on im-age data sets from the PASCAL visual object classification (VOC) challenge 2007 and 2008 [8] show that the non-sparse MKL significantly outperforms the uniform mixture and`1-norm MKL.

This paper is organized as follows. In Section 2, we briefly review the underlying techniques, including sparse and non-sparse MKL. Section 3 discusses similarities be-tween the prepared kernels. Based on this analysis, we pre-compute averages of similar kernels and apply MKL with a substantially reduced sets of kernels. We discuss our em-pirical results in Section 4 and Section 5 concludes.