訂正確認報告書

全文

(1)訂訂正承認日. 正確認. 報告書. 2018 年 11 月 19 日訂正申請日. 2018 年 11 月 1 日. 題名. Research on Visual Features based Content Adaptive Strategies for Video Coding. 著者氏名. Minghui WANG 集積システム分野、博士論文. 報告者氏名. 確認者氏名. 訂正ワーキング長. 木村晋二. 1. 巽. 宏平.

(2) 本論文は、学位規則第 23 条第 1 項に照らし、学位の取消には該当しないが、訂正を要する箇所が認められたため、これに対して著者によりなされた訂正について確認した結果を以下の通り報告する。. 1. 訂正箇所と訂正内容 (1) 訂正箇所：Page 3: Line1 – Line14 訂正内容：記述の訂正具体的内容： The encoding process of a H.264/AVC encoder is shown in Fig.2 which can be divided into forward path and reconstruction path. In the forward path, each macro block (MB) is predicted to remove redundant information in spatial domain. Motion estimation and motion compensation, known as inter prediction uses the relation between different frames to perform the prediction. Intra prediction uses the relation inside one single frame to perform the prediction. A rate-distortion-optimization (RDO) based mode decision is performed after inter and intra prediction and decide which mode the current MB should apply. The prediction is subtracted from the current block to produce a residual block Dn. It is then transformed by a block transform and quantized to X, a set of quantized transform coefficients which are reordered and entropy encoded. The entropy encoded coefficients, together with side information required to decode each block within the MB form the compressed bit stream. In the reconstruction path, the encoder decodes it to provide reference frames for further inter prediction. There are two major path of the standard H.264/AVC encoding process, as shown in Fig.2, forward path and reconstruction path. The forward path encode the current frame/current Macro-Block (MB) by Intra and Inter prediction, which reduce the spacial redundancy in a frame significantly. In order to decide which mode the current MB should apply，after inter and intra prediction，the rate-distortion-optimization (RDO) based mode decision is performed to find the optimized mode to process the current MB. A residual block Dn is produced by a subtraction between current block and reconstructed block， which is the output of the reconstruction path. Block transform and quantization are taken to further compress the MB.. (2) 訂正箇所：Page 11: Line9 – Line21 訂正内容：記述の訂正具体的内容： The spatial resolution of the human visual field drops greatly from the center to the edges [14]. This is the motivation of ROI detection. Each eye has approximately six million retinal cone cells. They are packed much more tightly in the center of our visual field — a small region called the fovea — than they are at the edges of the retina (see Fig.11). The fovea is only about 1% of the retina, but the brain’s visual cortex devotes about 50% of its area to input from the fovea. Furthermore, fovea cone cells connect 1:1 to the ganglia neuron cells that begin the processing 2.

(3) and transmission of visual data, while elsewhere on the retina, multiple photoreceptor cells (cones and rods) connect to each ganglion cell. In technical terms, information from the visual periphery is compressed (with data loss) before transmission to the brain, while information from the fovea is not. All of this causes our vision to have much, much greater resolution in the center of our visual field than elsewhere[15][16]. The center of the human visual field has much larger spatial resolution than the edges [14], that leads to the motivation of ROI detection [75]. In each of human eye ball, as much as six million retinal cone cells are gathering more tightly in the center of our visual field so called fovea, than than they are at the edges of the retina (see Fig.11). Although the fovea only take 1% area in the retina, it takes 50% area of the input of brain’s visual cortex. Finally, foveal cone cells have 1-to-1 connection with the ganglia neuron cells. The Information from fovea delivers with no transmission lost. As a result, in the center of our visual field, the resolution is much greater than it in other area [15][16]. 文献の追加(文献番号は順次繰り下げ): [75] Jeff Johnson, Chapter 6 - Our Peripheral Vision is Poor, Editor(s): Jeff Johnson, Designing with the Mind in Mind, Morgan Kaufmann, 2010, Pages 65-77, ISBN 9780123750303, https://doi.org/10.1016/B978-0-12-375030-3.00006-5.. (3) 訂正箇所：Page 12: Line3 – Line8 訂正内容：記述の訂正具体的内容： Our visual system is much more sensitive to differences in color and brightness [14]. To see this, compare the two green circles in Fig.12. They are the same exact shade of green—the circle on the right was copied from the one on the left—but the different backgrounds make the one on the left appear darker to our contrast-sensitive visual system. Brain researcher Edward H. Adelson at the Massachusetts Institute of Technology developed an outstanding illustration of our visual system’s insensitivity to absolute brightness and its sensitivity to contrast (see Fig.13). As difficult as it may be to believe, square A on the checkerboard is exactly the same shade as square B. Square B only appears white because it is depicted as being in the cylinder’s shadow. In Fig.12., It shows that differences in color and brightness can make much more sensitive to the visual system. Although the circle on the right is exactly same with the one on the left, the different backgrounds make the one on the right appear brighter to our contrast-sensitive visual system. To test the insensitivity to absolute brightness and its sensitivity to contrast, Brain researcher Edward H. Adelson at MIT developed an outstanding illustration (see Fig.13). Although square A and B has exactly the same shade, Square B appears whiter than A.. (4) 訂正箇所：Page 13: Line5 – Line10 3.

(4) 訂正内容：記述の訂正具体的内容： Our peripheral vision exists mainly to provide low-resolution cues to guide our eye movements so that our fovea visits all the interesting and crucial parts of our visual field [14]. Our eyes don’t scan our environment randomly. They move so as to focus our fovea on important things, the most important ones (usually) first. The fuzzy cues on the outskirts of our visual field provide the data that helps our brain plan where to move our eyes, in what order. In human visual system, eyes don’t move randomly. Peripheral vision provides low-resolution cues, which guides the eye ball movements. It indicates that all the interesting and crucial parts of our visual field are through the fovea [14] .. (5) 訂正箇所：Page 13: Line11 – Line16 訂正内容：記述の訂正具体的内容： Our peripheral vision is good at detecting motion. Anything that moves in our visual periphery, even slightly, is likely to draw our attention—and hence our fovea—toward it. The reason for this phenomenon is that our ancestors—including pre-human ones—were selected for their ability to spot food and avoid predators. As a result, even though we can move our eyes under conscious, intentional control, some of the mechanisms that control where they look are preconscious, involuntary, and very fast. The peripheral vision is more sensitive at detecting motion. This is a result of the evolution of human beings. This feature helps our ancestors find food and avoid predators. As a result, some of the mechanisms make our visual system can track moving objects without conscious. (6) 訂正箇所：Page 15: Line16 – Line21 訂正内容：記述の訂正具体的内容： In order to maximize the visual quality, some works are proposed to avoid the distortion during the coding process and focus on “protecting” the sharp edges in depth map. One approach is depth down-sampling before classical MVC encoding and special up-sampling after decoding to recover some of the original depth edge information [22]. Also wavelet coding was applied in order to perform better edge preservation in depth compression [23]. A computer graphics-based approach was taken in [24], where depth maps were converted into meshes and coded with mesh-based compression methods. In order to maximize the visual quality, some works are proposed to avoid the distortion during the coding process and focus on “protecting” the sharp edges in depth map [76]. In [22], it proposed a method that taking depth down-sampling before classical MVC encoding and special up-sampling after decoding, which can somehow recover some information of the original depth edges. In [23], wavelet coding was applied for better performance of edge preservation in depth 4.

(5) compression. In [24], mesh-based compression methods are adopted, that depth maps are converted into meshes then coded. 文献の追加(文献番号は順次繰り下げ)： [76] Muller, Karsten, P. Merkle, and T. Wiegand. "3-D Video Representation Using Depth Maps." Proceedings of the IEEE 99.4(2011):643-656.. (7) 訂正箇所：Page 17: Line19 – Line24 訂正内容：記述の訂正具体的内容： In Gautier’s work [63], the edges at object boundaries are first detected using a Sobel operator and their positions are encoded using the JBIG. The luminance values of the pixels along the edges are then encoded using an optimized path encoder. The decoder runs a fast diffusion-based inpainting algorithm which fills in the unknown pixels within the objects by starting from their boundaries. In Gautier’s work [63], first a Sobel operator detects object boundaries. Then JBIG is applied to code the positions of edges. Pixels along the edges are coded with path encoder. Finally, diffusion-based inpainting algorithm is adopted to fill the other non-edge parts, starting from their boundaries.. (8) 訂正箇所：Page 20: Line17 – Line21: 訂正内容：記述の訂正具体的内容： Michele A. Saad and Alan C. Bovik’s work [47] explores four distinct approaches to extracting ROI from still images, and compared their performance. The four approaches are: 1) a block based discrete wavelet transform (DWT) algorithm, 2) a color saliency approach, 3) a wavelet coefficients variance saliency approach, and 4) an approach based on mean-shift clustering of image pixels M. A. Saad and A. C. Bovik introduce four approaches of defining ROI in a still image [47], which are: 1) discrete wavelet transform (DWT) algorithm on a block structure, 2) color saliency analysis, 3) an approach based on wavelet coefficients variance saliency and 4) algorithm of mean-shift clustering image pixels. 5.

(6) (9) 訂正箇所： Page16, 18, 20, 21,33: Fig.1, Fig.3, Fig.4, Fig.5, Fig. 11, Fig. 15 訂正内容：図の削除具体的内容：図 1, 3,4, 5, 11, 15 を削除し、図の番号を順次繰り上げる。また、関連する部分の図の参照情報 (Fig. 1 など) を削除する。図番号は順次繰り上げる。. Fig.1 “Face-time” on iOS devices. Fig.3 Progress of capturing and display. 6.

(7) Fig.4 Camera array and the capturing layout for the multiview test sequence (MPEG-FTV). 7.

(8) Fig.5. The Stanford Multi-Camera Array. Fig.11 Distribution of photoreceptor cells (cones and rods) across the retina.. Fig.15. A face and smile detection algorithm by OMRON. 8.

(9) (10) 訂正箇所： Page 22: Fig.6 訂正内容：図の変更具体的内容：他の図に合わせて Fig. 6 の図の様式変更を行った。変更後の図 Fig. 2 を示す。図番号の変更は図の削除による。. 変更後の図：. Fig.2.. MVC system with depth map included. 9.

(10) (11) 訂正箇所： Page 26: Fig. 12 訂正内容：図の変更具体的内容： Fig. 12 の図の変更を行った。変更後の図 Fig. 7 を示す。図番号の変更は図の削除による。。. Fig.12 Same green circles in different background. 変更後の図:. Fig. 7. Same green circles in different background. 10.

(11) (12) 訂正箇所： Page 27: Fig. 13 訂正内容：図の変更具体的内容： Fig. 13 の図の変更を行った。変更後の図 Fig. 8 を示す。図番号の変更は図の削除による。. Fig.13 Absolute brightness and contrast 変更後の図:. Fig. 8. Absolute brightness and contrast. 11.

(12) (13) 訂正箇所： Page 39: Fig. 18 訂正内容：図の変更具体的内容： Fig. 18 の図の変更を行った。変更後の図 Fig. 12 を示す。図番号の変更は図の削除による。. Fig.18 Experimental results of the combined filter 変更後の図:. Fig.12 Experimental results of the combined filter 12.

(13) (14) 訂正箇所：Page 88: Line1 – Line3 訂正内容：記述の訂正具体的内容： Furthermore, in order to evaluate the visual quality “objectively”, we adopt the structural similarity (SSIM) index. It is a method for measuring the similarity between two images. The SSIM index is a full reference metric, in other words, the measuring of image quality based on an initial uncompressed or distortion-free image as reference. SSIM is designed to improve on traditional methods like peak signal-to-noise ratio (PSNR) and mean squared error (MSE), which have proved to be inconsistent with human eye perception. The difference with respect to other techniques mentioned previously such as MSE or PSNR, is that these approaches estimate perceived errors on the other hand SSIM considers image degradation as perceived change in structural information. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene. Furthermore, structural similarity (SSIM) index is applied as an objectively visual quality evaluation [77]. It measures the similarity between images. Uncompressed or distortion-free image is taken as reference to measure the image quality. In contrast to the traditional methods like “peak signal-to-noise ratio” (PSNR) and “mean squared error” (MSE), SSIM is more consistent with human visual perception. Image degradation is considered as perceived change in structural information in SSIM. It indicates that spatially-close pixels have strong inter-dependencies, which carrying large amount of structure information in a certain scene. 文献の追加(文献番号は順次繰り下げ)： [77] Zhou Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," in IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, April 2004.. 13.

(14) 2. 訂正理由序章および各章の導入部での不適切な引用が認められたため、14 か所において訂正を指示した。とくに図や写真に関する不適切な引用が多かったので、削除を含めた訂正を行なわせた。具体的には、(1)～(8)、(14) が記述の訂正で、(9) が図の削除、 (10)～(13) が図の訂正である。. 3. 訂正を認めた理由訂正箇所はいずれも序章あるいは各章の導入部であり、今回の訂正によって博士論文の本質的な成果に影響を与えないため、訂正は妥当であると認める。図の削除についても、付加的な説明のための図を削除したもので、本論文の技術的な影響を与えるものではないため、訂正は妥当と認める。. 14.

(15)

訂 正 確 認 報 告 書

訂正確認報告書