( 色 ) ( 輝度 ) 1. 視覚特徴の検出と Gaussian pyramid の生成 ( 方位 ) 処理の流れ 2. Pyramid 画像間の差分計算 Feature map の生成 (Center- Surround) 3. 各 Feature map のスケール画像の統合と正規化 Cons

(1)

1254 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 11, NOVEMBER 1998

Short Papers

A Model of Saliency-Based Visual Attention

for Rapid Scene Analysis

Laurent Itti, Christof Koch, and Ernst Niebur

Abstract—A visual attention system, inspired by the behavior and the

neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

Index Terms—Visual attention, scene analysis, feature extraction,

target detection, visual search.

———————— !_{————————} 1 INTRODUCTION

PRIMATES have a remarkable ability to interpret complex scenes in

real time, despite the limited speed of the neuronal hardware avail-able for such tasks. Intermediate and higher visual processes appear to select a subset of the available sensory information before further processing [1], most likely to reduce the complexity of scene analysis [2]. This selection appears to be implemented in the form of a spa-tially circumscribed region of the visual field, the so-called “focus of attention,” which scans the scene both in a rapid, bottom-up, sali-ency-driven, and task-independent manner as well as in a slower, top-down, volition-controlled, and task-dependent manner [2].

Models of attention include “dynamic routing” models, in which information from only a small region of the visual field can progress through the cortical visual hierarchy. The attended region is selected through dynamic modifications of cortical connectivity or through the establishment of specific temporal patterns of ac-tivity, under both top-down (task-dependent) and bottom-up (scene-dependent) control [3], [2], [1].

The model used here (Fig. 1) builds on a second biologically-plausible architecture, proposed by Koch and Ullman [4] and at the basis of several models [5], [6]. It is related to the so-called “feature integration theory,” explaining human visual search strategies [7]. Visual input is first decomposed into a set of topo-graphic feature maps. Different spatial locations then compete for saliency within each map, such that only locations which locally stand out from their surround can persist. All feature maps feed, in a purely bottom-up manner, into a master “saliency map,” which topographically codes for local conspicuity over the entire visual scene. In primates, such a map is believed to be located in the posterior parietal cortex [8] as well as in the various visual maps in the pulvinar nuclei of the thalamus [9]. The model’s saliency map is endowed with internal dynamics which generate attentional shifts. This model consequently represents a complete account of

bottom-up saliency and does not require any top-down guidance to shift attention. This framework provides a massively parallel method for the fast selection of a small number of interesting im-age locations to be analyzed by more complex and time-consuming object-recognition processes. Extending this approach in “guided-search,” feedback from higher cortical areas (e.g., knowledge about targets to be found) was used to weight the im-portance of different features [10], such that only those with high weights could reach higher processing levels.

2 MODEL

Input is provided in the form of static color images, usually digit-ized at 640 ¥ 480 resolution. Nine spatial scales are created using dyadic Gaussian pyramids [11], which progressively low-pass filter and subsample the input image, yielding horizontal and ver-tical image-reduction factors ranging from 1:1 (scale zero) to 1:256 (scale eight) in eight octaves.

Each feature is computed by a set of linear “center-surround” operations akin to visual receptive fields (Fig. 1): Typical visual neurons are most sensitive in a small region of the visual space (the center), while stimuli presented in a broader, weaker antago-nistic region concentric with the center (the surround) inhibit the neuronal response. Such an architecture, sensitive to local spatial discontinuities, is particularly well-suited to detecting locations which stand out from their surround and is a general computa-tional principle in the retina, lateral geniculate nucleus, and pri-mary visual cortex [12]. Center-surround is implemented in the model as the difference between fine and coarse scales: The center is a pixel at scale c Œ {2, 3, 4}, and the surround is the corresponding pixel at scale s = c + d, with d Œ {3, 4}. The across-scale difference between two maps, denoted “!” below, is obtained by interpolation to the finer scale and point-by-point subtraction. Using several scales not only for c but also for d = s - c yields truly multiscale feature extraction, by including different size ratios between the center and surround regions (contrary to previously used fixed ratios [5]).

2.1 Extraction of Early Visual Features

With r, g, and b being the red, green, and blue channels of the in-put image, an intensity image I is obtained as I = (r + g + b)/3. I is

•!L. Itti and C. Koch are with the Computation and Neural Systems Pro-gram, California Institute of Technology—139-74, Pasadena, CA 91125.

!

E-mail: {itti, koch}@klab.caltech.edu.

•!E. Niebur is with the Johns Hopkins University, Krieger Mind/Brain Insti-tute, Baltimore, MD 21218. E-mail: [email protected].

Manuscript received 5 Feb. 1997; revised 10 Aug. 1998. Recommended for accep-tance by D. Geiger.

For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 107349.

(2)

1254

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 11, NOVEMBER 1998

Short Papers

A Model of Saliency-Based Visual Attention

for Rapid Scene Analysis

Laurent Itti, Christof Koch, and Ernst Niebur

Abstract—A visual attention system, inspired by the behavior and the

neuronal architecture of the early primate visual system, is presented.

Multiscale image features are combined into a single topographical

saliency map. A dynamical neural network then selects attended

locations in order of decreasing saliency. The system breaks down the

complex problem of scene understanding by rapidly selecting, in a

computationally efficient manner, conspicuous locations to be analyzed

in detail.

Index Terms—Visual attention, scene analysis, feature extraction,

target detection, visual search.

————————

!

_{————————}

1 I

NTRODUCTION

P

RIMATES

have a remarkable ability to interpret complex scenes in

real time, despite the limited speed of the neuronal hardware

avail-able for such tasks. Intermediate and higher visual processes appear

to select a subset of the available sensory information before further

processing [1], most likely to reduce the complexity of scene analysis

[2]. This selection appears to be implemented in the form of a

spa-tially circumscribed region of the visual field, the so-called “focus of

attention,” which scans the scene both in a rapid, bottom-up,

sali-ency-driven, and task-independent manner as well as in a slower,

top-down, volition-controlled, and task-dependent manner [2].

Models of attention include “dynamic routing” models, in

which information from only a small region of the visual field can

progress through the cortical visual hierarchy. The attended region

is selected through dynamic modifications of cortical connectivity

or through the establishment of specific temporal patterns of

ac-tivity, under both top-down (task-dependent) and bottom-up

(scene-dependent) control [3], [2], [1].

The model used here (Fig. 1) builds on a second

biologically-plausible architecture, proposed by Koch and Ullman [4] and at

the basis of several models [5], [6]. It is related to the so-called

“feature integration theory,” explaining human visual search

strategies [7]. Visual input is first decomposed into a set of

topo-graphic feature maps. Different spatial locations then compete for

saliency within each map, such that only locations which locally

stand out from their surround can persist. All feature maps feed, in

a purely bottom-up manner, into a master “saliency map,” which

topographically codes for local conspicuity over the entire visual

scene. In primates, such a map is believed to be located in the

posterior parietal cortex [8] as well as in the various visual maps in

the pulvinar nuclei of the thalamus [9]. The model’s saliency map

is endowed with internal dynamics which generate attentional

shifts. This model consequently represents a complete account of

bottom-up saliency and does not require any top-down guidance

to shift attention. This framework provides a massively parallel

method for the fast selection of a small number of interesting

im-age locations to be analyzed by more complex and

time-consuming object-recognition processes. Extending this approach

in “guided-search,” feedback from higher cortical areas (e.g.,

knowledge about targets to be found) was used to weight the

im-portance of different features [10], such that only those with high

weights could reach higher processing levels.

2 M

ODEL

Input is provided in the form of static color images, usually

digit-ized at 640

¥ 480 resolution. Nine spatial scales are created using

dyadic Gaussian pyramids [11], which progressively low-pass

filter and subsample the input image, yielding horizontal and

ver-tical image-reduction factors ranging from 1:1 (scale zero) to 1:256

(scale eight) in eight octaves.

Each feature is computed by a set of linear “center-surround”

operations akin to visual receptive fields (Fig. 1): Typical visual

neurons are most sensitive in a small region of the visual space

(the center), while stimuli presented in a broader, weaker

antago-nistic region concentric with the center (the surround) inhibit the

neuronal response. Such an architecture, sensitive to local spatial

discontinuities, is particularly well-suited to detecting locations

which stand out from their surround and is a general

computa-tional principle in the retina, lateral geniculate nucleus, and

pri-mary visual cortex [12]. Center-surround is implemented in the

model as the difference between fine and coarse scales: The center

is a pixel at scale c Œ {2, 3, 4}, and the surround is the corresponding

pixel at scale s = c +

d, with d Œ {3, 4}. The across-scale difference

between two maps, denoted “!” below, is obtained by interpolation

to the finer scale and point-by-point subtraction. Using several scales

not only for c but also for

d = s - c yields truly multiscale feature

extraction, by including different size ratios between the center and

surround regions (contrary to previously used fixed ratios [5]).

2.1 Extraction of Early Visual Features

With r, g, and b being the red, green, and blue channels of the

in-put image, an intensity image I is obtained as I = (r + g + b)/3. I is

0162-8828/98/$10.00 © 1998 IEEE

!!!!!!!!!!!!!!!!

•!

L. Itti and C. Koch are with the Computation and Neural Systems

Pro-gram, California Institute of Technology—139-74, Pasadena, CA 91125.

!

E-mail:

{itti, koch}@klab.caltech.edu

.

•!

E. Niebur is with the Johns Hopkins University, Krieger Mind/Brain

Insti-tute, Baltimore, MD 21218. E-mail: [email protected].

Manuscript received 5 Feb. 1997; revised 10 Aug. 1998. Recommended for

accep-tance by D. Geiger.

For information on obtaining reprints of this article, please send e-mail to:

[email protected], and reference IEEECS Log Number 107349.

Fig. 1. General architecture of the model.

処理の流れ

1.  視覚特徴の検出と

Gaussian pyramidの生成

2. Pyramid 画像間の

差分計算・Feature

mapの生成

(Center-‐Surround)

3. 各Feature mapの

スケール画像の

統合と正規化・

Conspicuity map

の生成

(レアな特徴の強調)

（色）

（輝度）

（方位）

4. Conspicuity mapを

統合・

Saliency mapの生成

(3)

Short Papers

A Model of Saliency-Based Visual Attention

for Rapid Scene Analysis

Abstract—A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

Index Terms—Visual attention, scene analysis, feature extraction, target detection, visual search.

———————— !_{————————}

1 I

NTRODUCTION

PRIMATES have a remarkable ability to interpret complex scenes in real time, despite the limited speed of the neuronal hardware avail-able for such tasks. Intermediate and higher visual processes appear to select a subset of the available sensory information before further processing [1], most likely to reduce the complexity of scene analysis [2]. This selection appears to be implemented in the form of a spa-tially circumscribed region of the visual field, the so-called “focus of attention,” which scans the scene both in a rapid, bottom-up, sali-ency-driven, and task-independent manner as well as in a slower, top-down, volition-controlled, and task-dependent manner [2].

2 M

ODEL

Input is provided in the form of static color images, usually

digit-ized at 640 ¥ 480 resolution. Nine spatial scales are created using

dyadic Gaussian pyramids [11], which progressively low-pass filter and subsample the input image, yielding horizontal and ver-tical image-reduction factors ranging from 1:1 (scale zero) to 1:256 (scale eight) in eight octaves.

!

•!E. Niebur is with the Johns Hopkins University, Krieger Mind/Brain

Insti-tute, Baltimore, MD 21218. E-mail: [email protected].

Fig. 1. General architecture of the model.

1. 視覚特徴の検出とGaussian pyramidの生成

刺激画像を、

1/1, ½., ¼, 1/8,

1/16, 1/32, 1/64, 1/128, 1/256

のスケールに変更

(pyramid)

Pyramid

Gaussianフィルタを画像に掛ける。

(hFp://squeak.qp.land.to:8080/wiki/index.php?Squeak

%2FSampleCode%2FTinyImageProcessing%2FGaussianより)

Gaussianフィルタによる画像処理→

輪郭や境界がぼかされる

(4)

Short Papers

A Model of Saliency-Based Visual Attention

for Rapid Scene Analysis

———————— !_{————————}

1 I

NTRODUCTION

2 M

ODEL

!

1. 視覚特徴の検出とGaussian pyramidの生成

Pyramid

また、この段階で入力画像を画像

特徴ごとに分離する。このモデル

では、

色

(Color)、輝度(intensity)、方位

(orientaSon、線分の傾き)

に分離。これらは実際に大脳視覚

野でそれぞれ別に検出されている。

Color :

赤

(R), 緑(G),青(B),

黄

(Y)

Intensity :

(R+G+B)/3

OrientaSon :

0°, 45°,90°,135°など

(5)

Short Papers

A Model of Saliency-Based Visual Attention

for Rapid Scene Analysis

———————— !_{————————}

1 I

NTRODUCTION

2 M

ODEL

!

2. Pyramid 画像間の差分計算・Feature mapの生成(Center-‐Surround)

網膜

や

初期視覚野

の情報処理

をモデル化。ここでは、同じ特

徴の

Pyramid画像間の差分を取

る。

I(c, s) = I(c)(−)I(s)

例

)輝度(Intensity)の場合

ここで、

c と s はPyramid画像の

スケールを意味する。

c ∈ {2, 3, 4}

s = c +

δ

,

δ

∈ {3, 4}

上式の

(ー)はスケールの異なるpyramid画像間の差を取る事を意味

する。

(6)

2. Pyramid 画像間の差分計算・Feature mapの生成(Center-‐Surround)

スケールの異なる

L. Itti, C. Koch/Vision Research

pyramid画像間の差を取る事の意味は？

40 (2000) 1489–1506

1494

Fig. 2. (a) Gaussian pixel widths for the nine scales used in the model. Scale !=0 corresponds to the original image, and each subsequent scale is coarser by a factor 2. Two examples of the six center-surround receptive field types are shown, for scale pairs 2–5 and 4–8. (b) Illustration of the spatial competition for salience implemented within each of the 42 feature maps. Each map receives input from the linear filtering and center-surround stages. At each step of the process, the convolution of the map by a large Difference-of-Gaussians (DoG) kernel is added to the current contents of the map. This additional input coarsely models short-range excitatory processes and long-range inhibitory interactions between neighboring visual locations. The map is half-wave rectified, such that negative values are eliminated, hence making the iterative process non-linear. Ten iterations of the process are carried out before the output of each feature map is used in building the saliency map.

Each feature map is subjected to ten iterations of the

process described in Eq. (2). The choice of the number

of iterations is somewhat arbitrary: In the limit of an

infinite number of iterations, any non-empty map will

converge towards a single peak (except for a few

unre-alistic, singular configurations), hence constituting only

a poor representation of the scene. With few iterations

however, spatial competition is weak and inefficient.

Two examples of the time evolution of this process are

shown in Fig. 3, and illustrate that using on the order

of ten iterations yields adequate distinction between the

two example images shown. As expected, feature

maps with initially numerous peaks of similar

ampli-tude are suppressed by the interactions, while maps

with one or a few initially stronger peaks become

enhanced. It is interesting to note that this

within-fea-ture spatial competition scheme resembles a

‘winner-take-all’ network with localized inhibitory spread,

which allows for a sparse distribution of winners across

the visual scene (see Horiuchi, Morris, Koch &

De-Weerth, 1997 for a 1-D real-time implementation in

Analog-VLSI).

After normalization, the feature maps for intensity,

color, and orientation are summed across scales into

three separate ‘conspicuity maps’, one for intensity, one

for color and one for orientation (Fig. 1b). Each

con-spicuity map is then subjected to another ten iterations

of Eq. (2). The motivation for the creation of three

separate channels and their individual normalization is

the hypothesis that similar features compete strongly

スケール0(源画に対する Gaussianフィルタのサイズスケール_{1
(1/2)} スケール2 (1/4) スケール3 (1/8) スケール4 (1/16) スケール5 (1/32) スケール6 (1/64) スケール7 (1/128) スケール_{8
(1/256)}

同じ

Gaussianフィルタを使用しても、

画像の大きさによって影響が与えら

れる範囲が大きく異なる。

大きいサイズの画像

(スケール0や1)

に適用すると、フィルタリング効果は

局所的。

小さいサイズの画像

(スケール6な

ど

)には、その効果が広い範囲に及

ぶ。

(7)

2. Pyramid 画像間の差分計算・Feature mapの生成(Center-‐Surround)

スケールの異なる

L. Itti, C. Koch/Vision Research

pyramid画像間の差を取る事の意味は？

40 (2000) 1489–1506

1494

Each feature map is subjected to ten iterations of the

process described in Eq. (2). The choice of the number

of iterations is somewhat arbitrary: In the limit of an

infinite number of iterations, any non-empty map will

converge towards a single peak (except for a few

unre-alistic, singular configurations), hence constituting only

a poor representation of the scene. With few iterations

however, spatial competition is weak and inefficient.

Two examples of the time evolution of this process are

shown in Fig. 3, and illustrate that using on the order

of ten iterations yields adequate distinction between the

two example images shown. As expected, feature

maps with initially numerous peaks of similar

ampli-tude are suppressed by the interactions, while maps

with one or a few initially stronger peaks become

enhanced. It is interesting to note that this

within-fea-ture spatial competition scheme resembles a

‘winner-take-all’ network with localized inhibitory spread,

which allows for a sparse distribution of winners across

the visual scene (see Horiuchi, Morris, Koch &

De-Weerth, 1997 for a 1-D real-time implementation in

Analog-VLSI).

After normalization, the feature maps for intensity,

color, and orientation are summed across scales into

three separate ‘conspicuity maps’, one for intensity, one

for color and one for orientation (Fig. 1b). Each

con-spicuity map is then subjected to another ten iterations

of Eq. (2). The motivation for the creation of three

separate channels and their individual normalization is

the hypothesis that similar features compete strongly

スケール0(源画に対する Gaussianフィルタのサイズスケール_{1
(1/2)} スケール2 (1/4) スケール3 (1/8) スケール4 (1/16) スケール5 (1/32) スケール6 (1/64) スケール7 (1/128) スケール_{8
(1/256)}

(I\ and Koch, Vision Research (2000)より)

L. Itti, C. Koch / Vision Research 40 (2000) 1489–1506 1494

Each feature map is subjected to ten iterations of the

process described in Eq. (2). The choice of the number

of iterations is somewhat arbitrary: In the limit of an

infinite number of iterations, any non-empty map will

converge towards a single peak (except for a few

unre-alistic, singular configurations), hence constituting only

a poor representation of the scene. With few iterations

however, spatial competition is weak and inefficient.

Two examples of the time evolution of this process are

shown in Fig. 3, and illustrate that using on the order

of ten iterations yields adequate distinction between the

two example images shown. As expected, feature

maps with initially numerous peaks of similar

ampli-tude are suppressed by the interactions, while maps

with one or a few initially stronger peaks become

enhanced. It is interesting to note that this

within-fea-ture spatial competition scheme resembles a

‘winner-take-all’ network with localized inhibitory spread,

which allows for a sparse distribution of winners across

the visual scene (see Horiuchi, Morris, Koch &

De-Weerth, 1997 for a 1-D real-time implementation in

Analog-VLSI).

After normalization, the feature maps for intensity,

color, and orientation are summed across scales into

three separate ‘conspicuity maps’, one for intensity, one

for color and one for orientation (Fig. 1b). Each

con-spicuity map is then subjected to another ten iterations

of Eq. (2). The motivation for the creation of three

separate channels and their individual normalization is

the hypothesis that similar features compete strongly

L. Itti, C. Koch/Vision Research 40 (2000) 1489–1506

1494

Each feature map is subjected to ten iterations of the process described in Eq. (2). The choice of the number of iterations is somewhat arbitrary: In the limit of an infinite number of iterations, any non-empty map will converge towards a single peak (except for a few unre-alistic, singular configurations), hence constituting only a poor representation of the scene. With few iterations however, spatial competition is weak and inefficient. Two examples of the time evolution of this process are shown in Fig. 3, and illustrate that using on the order of ten iterations yields adequate distinction between the two example images shown. As expected, feature maps with initially numerous peaks of similar ampli-tude are suppressed by the interactions, while maps with one or a few initially stronger peaks become

enhanced. It is interesting to note that this within-fea-ture spatial competition scheme resembles a ‘winner-take-all’ network with localized inhibitory spread, which allows for a sparse distribution of winners across the visual scene (see Horiuchi, Morris, Koch & De-Weerth, 1997 for a 1-D real-time implementation in Analog-VLSI).

After normalization, the feature maps for intensity, color, and orientation are summed across scales into three separate ‘conspicuity maps’, one for intensity, one for color and one for orientation (Fig. 1b). Each con-spicuity map is then subjected to another ten iterations of Eq. (2). The motivation for the creation of three separate channels and their individual normalization is the hypothesis that similar features compete strongly

スケール2(-‐)スケール5

スケール4(-‐)スケール8

スケールの異なる

Pyramid画像間の差を取る事で、若干語弊はある

が「フィルタの中心を強調し、周囲から抑制される」状況となる。

この処理を

Diﬀerence of Gaussian (DOG)と呼ぶ。また、この処理は

ラ

プラシアンフィルタ

に良く似ている。

(8)

2. Pyramid 画像間の差分計算・Feature mapの生成(Center-‐Surround)

スケールの異なる

pyramid画像間の差を取る事の意味は？

◯ラプラシアンフィルタによる画像処理

hFp://hir.skr.jp/mt/cat10/cat11/より

つまり、

Pyramid画像間の差分を取る事で、画像の

輪郭

や

コントラスト

の検出処理を行っている。

この輪郭・コントラスト検出機能は、網膜細胞の反応特性と良く似て

いる。

(9)

ちょっと脱線

ラプラシアンフィルタを使う事で、

Hermann Grid錯視を説明できる。

物理的に描かれていない黒い点

らしき物が、縦線と横線の交差点

に見える。

ラプラシアンフィルタを使うと、より多くの抑

制効果が交差点にある事が分かる。

(hFp://www.michaelbach.de/ot/lum_herGrid/より)

(10)

Short Papers

A Model of Saliency-Based Visual Attention

for Rapid Scene Analysis

———————— !_{————————}

1 I

NTRODUCTION

2 M

ODEL

!

2. Pyramid 画像間の差分計算・Feature mapの生成(Center-‐Surround)

I(c, s) = I(c)(−)I(s)

ここで、

c と s はPyramid画像の

スケールを意味する。

c ∈ {2, 3, 4}

s = c +

δ

,

δ

∈ {3, 4}

Pyramid画像の組み合わせは、

6通り。

Intensityに対しては6 maps

Colorに対してはred-‐green, blue-‐yellowの2種類とPyramidの組み合

わせで

12通り。

OrientaSonに対しては0°, 45°, 90°, 135°の4種類とPyramidの組み合

わせで

24通り。

(11)

Short Papers

A Model of Saliency-Based Visual Attention

for Rapid Scene Analysis

———————— !_{————————}

1 I

NTRODUCTION

2 M

ODEL

!

3. 各Feature mapのスケール画像の統合と正規化・Conspicuity map

の生成(レアな特徴の強調)

各特徴マップに対して正規化処

理を行い、スケールの異なる

Pyramid画像を統合。

(Conspicuity mapの生成)

1. マップ上の値が固定範囲[0…M]とな

るように正規化する。

2. 最大値Mを取る位置を検出。それ

以外の局所的な最大値の平均値

_mを

計算する。

3. 全ての値に(M-‐m)

2

_{を掛ける。}

モデルの正規処理

(12)

3. 各Feature mapのスケール画像の統合と正規化・Conspicuity map

の生成(レアな特徴の強調)

モデルの正規化処理の意味は？

◯特定の特徴マップ上の

1点だけが大きな値を持つとき：

正規化により、その

1点だけが大きな値とな

り、その他は０に近い値となる。

◯最大値に近い値が各特徴マップに分散しているとき：

極大値の平均値

m は M に非常に近くなり、

最大値を含む全てのピクセルの値は非常

に小さな値となる。

L. Itti, C. Koch/Vision Research40 (2000) 1489–1506 1495

for salience, while different modalities contribute inde-pendently to the saliency map. Although we are not aware of any supporting experimental evidence for this hypothesis, this additional step has the computational advantage of further enforcing that only a spatially sparse distribution of strong activity peaks is present within each visual feature type, before combination of all three types into the scalar saliency map.

2.3. The saliency map

After the within-feature competitive process has taken place in each conspicuity map, these maps are linearly summed into the unique saliency map, which resides at scale 4 (reduction factor 1:16 compared to the original image). At any given time, the maximum of the saliency map corresponds to the most salient stimulus

Fig. 3. (a) Iterative spatial competition for salience in a single feature map with one strongly activated location surrounded by several weaker ones. After a few iterations, the initial maximum has gained further strength while at the same time suppressing weaker activation regions. (b) Iterative spatial competition for salience in a single feature map containing numerous strongly activated locations. All peaks inhibit each other more-or-less equally, resulting in the entire map being suppressed.

2.3. The saliency map

L. Itti, C. Koch/Vision Research40 (2000) 1489–1506 1495

2.3. The saliency map