Figure 6.2: General architecture of an autoencoder (AE) network.

their dimension, respectively. Each * x*e

*i*

*∈*R

*is the reconstructed object of the corresponding input. The AE encodes every input into a different representation,*

^{d}

**z***= (z*

_{i}

_{i1}*, . . . , z*

*)*

_{il}*∈*R

*, in its encoding part, where*

^{l}*l*denotes the dimension of the encoded feature. The decoding part then generates the reconstructed input

*e*

**x***. Let us define the network parameters as*

_{i}*W*= (w

_{1}

*, . . . ,*

**w***)*

_{l}*and*

^{T}*= (b1*

**b***, . . . , b*

*l*), which are a weight matrix of size

*l×d*and a bias vector of dimension

*l*for the encoding part. The parameters in the decoding part,

*W*f = (

*e*

**w**_{1}

*, . . . ,*e

**w***)*

_{d}*and e*

^{T}*= (e*

**b***b*1

*, . . . ,*e

*b*

*d*), are similarly defined. The outputs of the hidden and reconstruction layers can be derived respectively as

*z*_{·}* _{j}* =

*a(w*

^{T}

_{j}*+*

**x***b*

*) (j = 1, . . . , l), e*

_{j}*x*_{·}* _{j}* =

*a(*e

**w**

^{T}

_{j}*+e*

**z***b*

*) (j = 1, . . . , d),*

_{j}where**w*** _{j}* and

*e*

**w***are the*

_{j}*j-th rows of the respective weight matrices,W*and

*W*f;

*b*

*ande*

_{j}*b*

*are the*

_{j}*j-th elements of the respective bias vectors,*ande

**b***:R*

**b; and**a*→*Ris a nonlinear activation function.

The network parameters can be derived by minimizing the empirical error between original
inputs and their reconstructions. For the sake of simplicity, let us define a combined network
parameter as * θ* and the corresponding space as

**Θ, such that**

**θ***∈*

**Θ. Also, let us denote the**combined encoder-decoder function as

*f*:R

^{d}*→*R

*. Then, the optimal network parameters can be derived as*

^{d}**θ*** ^{∗}* = arg min

**θ***∈***Θ**

∑*n*
*i=1*

* x*e

*i*

*−*

**x***i*

^{2}

2

= arg min

**θ***∈***Θ**

∑*n*
*i=1*

*f*(x* _{i}*;

**θ)**−**x**

_{i}^{2}

2*.*

The training is done by*back propagation*in the same way as for a standard NN. We could, of
course, adopt a deeper network architecture by concatenating many hidden layers. In this case,
greedy layer-wise training would be applied to efficiently train the network.

In the context of feature learning on remotely sensed data, the simplest approach assumes a
network input of spectral information from each pixel. The AE network is thus trained to
recon-struct inputs pixel by pixel. The new feature representation is then given by the encoded features,
i.e., * z* = (z

_{1}

*, . . . , z*

*)*

_{l}*∈*R

*. Patch-based approaches are often used to learn spatial context, in which case the inputs are given as cuboids or flattened spectral vectors.*

^{l}6.2.2 Spatial Embedding into Autoencoder

Even though patch-based training can capture spatial context to some extent, the encoded features do not always satisfy the spatial property of neighboring pixels having similar representation.

The desired feature representation is for a translated feature set of inputs to hold both spatial context information and similarity with the geometric neighborhood. The following optimization framework for AEs was designed to achieve that goal:

**θ*** ^{∗}* = arg min

**θ***∈***Θ**

∑*n*
*i=1*

∑

*j**∈**N**i*

* x*e

_{i}*−*

**x**

_{j}^{2}

2

= arg min

**θ***∈***Θ**

∑*n*
*i,j=1*

*B*_{ij}^{}_{}*f*(x* _{i}*;

**θ)**−**x**

_{j}^{2}

2*,* (6.1)

where*N** _{i}* is the set of indexes of the pixels neighboring pixel

*i*(including

*i*itself), and

*B*is an adjacency matrix. This formulation requires that the reconstructed input of

**x***i*,

*e*

**x***i*, should approx-imate not only its original input

**x***but also its neighbors*

_{i}

**x***(j*

_{j}*∈N*

*). The embedded features of input*

_{i}

**x***i*should also represent its neighborhood well, meaning that some of its features are shared among its neighbors. As a result, adjacent pixels are expressed by a feature representation re-sembling their neighborhood in a compressed space. As another effect, we can extract a potential representation by considering reconstruction of neighbors as well as original inputs.

Figure 6.3 graphically summarizes the new AE’s workflow. In the generalized AE, the spec-tral intensities of pixels or their flattened patches are included in the network as inputs. They are forward-propagated through the network, and the encoded features reconstruct the inputs at the output layer. The reconstruction error is evaluated in terms of the original input and its neigh-boring inputs. The error is then back-propagated through the network, and the gradient updates the network parameters. After network training terminates, we extract the encoded features of the samples given as outputs of the hidden layer and use those features as explanatory variables in various kinds of analysis, such as image classification, segmentation, and restoration.

This formulation is not limited to simple AEs: it can be applied even to deep networks and convolutional AEs. Moreover, combining it with patch-based training can yield contextual features shared by neighbors. In this case, the formulation becomes

**θ*** ^{∗}* = arg min

**θ***∈***Θ**

∑*n*
*i,j=1*

*B**ij**F*(X*i*;**θ)**−**X***j*^{2}

2*,*

Figure 6.3: AE workflow on spectral imagery.

where each**X*** _{i}*denotes an input cuboid or vectorized patch, and

*F*(

*·*)is an encoder-decoder func-tion defined on the corresponding space.

Relationships with Other Methodologies

The approach here was inspired by Wang et al. (2014), which termed the formulation as a
*gener-alized autoencoder. The original work sought to address issues in the computer vision field such*
as manifold learning and image processing. Their AE network explicitly minimized the residuals
among neighbors through a framework like Eq.(6.1), with the neighbors defined according to the
feature distance in the original input space. That is, the neighborhood of the*i-th sample was *
basi-cally determined according to*∥ x*

*i*

*−*

**x***j*

*∥*

^{2}2(j= 1, . . . , n). For remote sensing images, on the other hand, we can define the neighborhood more simply because pixels or patches naturally maintain a geographical spatial dependency defined by grids.

*Graph embedding* has a close relationship with the generalized AE. It constructs a
low-dimensional manifold of the original input space or a graph by minimizing the following:

∑

*i,j=1*

*B**ij**∥ z*

*i*

*−*

**z***j*

*∥*

^{2}2

*,*(6.2)

under a constraint to avoid a trivial solution. We can interpret this as direct control of sample relations described not in the original space but in the embedded space (Yan et al., 2007). In the

generalized AE, assume that*W*f = *W** ^{T}*, the dimension of the embedded feature

*l*= 1, and the activation is an identity. Then, the optimal solution

**w***of the generalized AE can be derived as*

^{∗}**w*** ^{∗}*= arg min∑

*i,j*

*B*_{ij}*∥ ww*

^{T}

**x**

_{i}*−*

**x**

_{j}*∥*

^{2}2

*.*(6.3)

Letting**w**^{T}* w*=

*c*and

*z*

*i*=

**w**

^{T}

**x***i*, the equation becomes

**w***= arg min*

^{∗}**w**^{T}**w=c**

∑

*i,j*

*B**ij*

{*∥ w*

^{T}

**x***i*

*−*

**w**

^{T}

**x***j*

*∥*

^{2}2+ (c

*−*2)

*z*

_{i}^{2}}

*.* (6.4)

When*c*= 2, i.e., the second term inside the sum operation vanishes, the generalized AE behaves
similarly to graph embedding. When*c >*2, the additional term prevents the encoded
representa-tion from being too large.

Furthermore, we can jointly minimize the trade-off between the reconstruction loss and the residuals of embedded features (see, e.g., Wang et al. (2016); Ma et al. (2016)). The formulation of Eq.(6.1), however, implicitly preserves sample relationships in the input space even in the em-bedded space. Wang et al. (2014) showed in their experiments with the MNIST handwritten image dataset that the embedded features obtained by the generalized AE constituted a more separable feature space than those obtained by normal AEs. These embedded features made the inter- and intra-class variability among samples smaller and larger, respectively.

A spectral feature pattern of a pixel is generally close to those of the neighbors, that is, it can be regarded as a feature of adjacent pixels added small noise. In this sense, we can expect that our AE plays in the same way as denoising AEs (Vincent et al., 2008; Xing et al., 2016). In denoising AEs, inputs are intentionally corrupted, which can make the trained network be robust to distortion of inputs and generate more generalized features.