Figure 6.2: General architecture of an autoencoder (AE) network.
their dimension, respectively. Each xei ∈ Rd is the reconstructed object of the corresponding input. The AE encodes every input into a different representation,zi = (zi1, . . . , zil) ∈ Rl, in its encoding part, whereldenotes the dimension of the encoded feature. The decoding part then generates the reconstructed inputxei. Let us define the network parameters asW = (w1, . . . ,wl)T andb = (b1, . . . , bl), which are a weight matrix of size l×dand a bias vector of dimension l for the encoding part. The parameters in the decoding part, Wf = (we1, . . . ,wed)T and eb = (eb1, . . . ,ebd), are similarly defined. The outputs of the hidden and reconstruction layers can be derived respectively as
z·j = a(wTjx+bj) (j = 1, . . . , l), e
x·j = a(weTjz+ebj) (j = 1, . . . , d),
wherewj andwej are thej-th rows of the respective weight matrices,W andWf;bj andebj are thej-th elements of the respective bias vectors,bandeb; anda:R→Ris a nonlinear activation function.
The network parameters can be derived by minimizing the empirical error between original inputs and their reconstructions. For the sake of simplicity, let us define a combined network parameter as θ and the corresponding space asΘ, such that θ ∈ Θ. Also, let us denote the combined encoder-decoder function asf :Rd→ Rd. Then, the optimal network parameters can be derived as
θ∗ = arg min
θ∈Θ
∑n i=1
xei−xi2
2
= arg min
θ∈Θ
∑n i=1
f(xi;θ)−xi2
2.
The training is done byback propagationin the same way as for a standard NN. We could, of course, adopt a deeper network architecture by concatenating many hidden layers. In this case, greedy layer-wise training would be applied to efficiently train the network.
In the context of feature learning on remotely sensed data, the simplest approach assumes a network input of spectral information from each pixel. The AE network is thus trained to recon-struct inputs pixel by pixel. The new feature representation is then given by the encoded features, i.e., z = (z1, . . . , zl) ∈ Rl. Patch-based approaches are often used to learn spatial context, in which case the inputs are given as cuboids or flattened spectral vectors.
6.2.2 Spatial Embedding into Autoencoder
Even though patch-based training can capture spatial context to some extent, the encoded features do not always satisfy the spatial property of neighboring pixels having similar representation.
The desired feature representation is for a translated feature set of inputs to hold both spatial context information and similarity with the geometric neighborhood. The following optimization framework for AEs was designed to achieve that goal:
θ∗ = arg min
θ∈Θ
∑n i=1
∑
j∈Ni
xei−xj2
2
= arg min
θ∈Θ
∑n i,j=1
Bijf(xi;θ)−xj2
2, (6.1)
whereNi is the set of indexes of the pixels neighboring pixel i(including iitself), andB is an adjacency matrix. This formulation requires that the reconstructed input ofxi,xei, should approx-imate not only its original inputxibut also its neighborsxj(j∈Ni). The embedded features of inputxishould also represent its neighborhood well, meaning that some of its features are shared among its neighbors. As a result, adjacent pixels are expressed by a feature representation re-sembling their neighborhood in a compressed space. As another effect, we can extract a potential representation by considering reconstruction of neighbors as well as original inputs.
Figure 6.3 graphically summarizes the new AE’s workflow. In the generalized AE, the spec-tral intensities of pixels or their flattened patches are included in the network as inputs. They are forward-propagated through the network, and the encoded features reconstruct the inputs at the output layer. The reconstruction error is evaluated in terms of the original input and its neigh-boring inputs. The error is then back-propagated through the network, and the gradient updates the network parameters. After network training terminates, we extract the encoded features of the samples given as outputs of the hidden layer and use those features as explanatory variables in various kinds of analysis, such as image classification, segmentation, and restoration.
This formulation is not limited to simple AEs: it can be applied even to deep networks and convolutional AEs. Moreover, combining it with patch-based training can yield contextual features shared by neighbors. In this case, the formulation becomes
θ∗ = arg min
θ∈Θ
∑n i,j=1
BijF(Xi;θ)−Xj2
2,
Figure 6.3: AE workflow on spectral imagery.
where eachXidenotes an input cuboid or vectorized patch, andF(·)is an encoder-decoder func-tion defined on the corresponding space.
Relationships with Other Methodologies
The approach here was inspired by Wang et al. (2014), which termed the formulation as a gener-alized autoencoder. The original work sought to address issues in the computer vision field such as manifold learning and image processing. Their AE network explicitly minimized the residuals among neighbors through a framework like Eq.(6.1), with the neighbors defined according to the feature distance in the original input space. That is, the neighborhood of thei-th sample was basi-cally determined according to∥xi−xj∥22(j= 1, . . . , n). For remote sensing images, on the other hand, we can define the neighborhood more simply because pixels or patches naturally maintain a geographical spatial dependency defined by grids.
Graph embedding has a close relationship with the generalized AE. It constructs a low-dimensional manifold of the original input space or a graph by minimizing the following:
∑
i,j=1
Bij∥zi−zj∥22, (6.2)
under a constraint to avoid a trivial solution. We can interpret this as direct control of sample relations described not in the original space but in the embedded space (Yan et al., 2007). In the
generalized AE, assume thatWf = WT, the dimension of the embedded featurel = 1, and the activation is an identity. Then, the optimal solutionw∗of the generalized AE can be derived as
w∗= arg min∑
i,j
Bij∥wwTxi−xj∥22. (6.3)
LettingwTw=candzi =wTxi, the equation becomes w∗ = arg min
wTw=c
∑
i,j
Bij
{∥wTxi−wTxj∥22+ (c−2)zi2}
. (6.4)
Whenc= 2, i.e., the second term inside the sum operation vanishes, the generalized AE behaves similarly to graph embedding. Whenc >2, the additional term prevents the encoded representa-tion from being too large.
Furthermore, we can jointly minimize the trade-off between the reconstruction loss and the residuals of embedded features (see, e.g., Wang et al. (2016); Ma et al. (2016)). The formulation of Eq.(6.1), however, implicitly preserves sample relationships in the input space even in the em-bedded space. Wang et al. (2014) showed in their experiments with the MNIST handwritten image dataset that the embedded features obtained by the generalized AE constituted a more separable feature space than those obtained by normal AEs. These embedded features made the inter- and intra-class variability among samples smaller and larger, respectively.
A spectral feature pattern of a pixel is generally close to those of the neighbors, that is, it can be regarded as a feature of adjacent pixels added small noise. In this sense, we can expect that our AE plays in the same way as denoising AEs (Vincent et al., 2008; Xing et al., 2016). In denoising AEs, inputs are intentionally corrupted, which can make the trained network be robust to distortion of inputs and generate more generalized features.