訂正確認報告書

(1)

訂正確認報告書

訂正承認日

2018

年

11

月

19

日訂正申請日

2018

年

11

月

12

日

題名 Research on Transcoding of MPEG-2/H.264 Video Compression

著者氏名 Xianghui WEI

報告者氏名

集積システム分野、博士論文訂正ワーキング長

木村晋二

確認者氏名巽宏平

(2)

本論文は、学位規則第 23 条第 1 項に照らし、学位の取消には該当しないが、訂正を要する箇所が認められたため、これに対して著者によりなされた訂正について確認した結果を以下の通り報告する。

1．訂正箇所と訂正内容

(1) 訂正箇所 Abstract, Page 1, Paragraph 1 訂正内容：記述の訂正

具体的内容：

Video transcoding performs one or more operations, such as bit-rate and format conversions, to transform one compressed video stream to another. Transcoding can enable multimedia devices of diverse capabilities and formats to exchange video content on heterogeneous network platforms such as the Internet. One application is delivering a high-quality multimedia source (such as a DVD or HDTV) to various receivers as PDA, Pocket PC, and fast desktop PC) on wireless and wireline networks. This application requires function of adjusting bit-rate of video bitstream. Another application is a video conferencing system on the Internet in which the participants may be using different terminals. Thus transcoder in this situation must offer two functionalities: provide video format conversion to enable content exchange, and perform dynamic bit rate adjustment to facilitate proper scheduling of network resources.

Video transcoding transforms source video to target video with changes of video syntax, resolution and bit-rate etc. It provides fundamental facilities for multimedia system, which aims at meeting the requirement of diverse processing ability of end equipment and heterogeneous network conditions, such as bandwidth.

Video syntax alteration is a primary function of transcoder, which means transform from one video standard to another. Resolution, bit-rate and frame-rate are adjusted based on ability of target equipment or network condition too. One obvious advantage of transcoder is that it doesn't put forward extra change on decoder.

(2) 訂正箇所：Abstract, Page 1, Paragraph 3 訂正内容：記述の訂正

具体的内容：

This dissertation focuses on transcoding from MPEG-2 to H.264/AVC for HDTV application. The MPEG-2 video coding standard (also known as ITU-T H.262), which was developed about ten years ago primarily as an extension of prior MPEG-1 video capability with support of interlaced video coding, was an enabling technology for digital television systems worldwide. It is widely used for the transmission of standard definition (SD) and HDTV signals over satellite, cable, and terrestrial emission and the storage of high-quality SD video signals onto DVDs.

ITU-T Recommendation H.264 and ISO/IEC MPEG-4 (Part 10) Advanced Video Coding (or referred to in short as H.264/AVC) is the powerful and state-of-the-art video compression standard developed by the ITU-T/ISO/IEC Joint Video Team (JVT) consisting of experts from ITU-T's Video Coding Experts Group (VCEG) and ISO/IEC's Moving Picture Experts Group (MPEG). H.264/AVC represents a delicate balance between coding gain, implementation complexity, and costs based on state of VLSI (ASICs and Microprocessors) design technology. H.264/AVC design emerged

(3)

with an improvement in coding efficiency typically by a factor of two over MPEG-2--the most widely used video coding standard today--while keeping the cost within the acceptable range.

This dissertation focuses on transcoding from MPEG-2 to H.264/AVC for HDTV application. The MPEG-2 video coding standard (ITU-T H.262), which was first publicly released in 1996, is an extension from MPEG-1 providing coding tools for interlaced video to make digital television available around the world.

MPEG-2 is used in transmission of video signal both in standard definition(SD) and high definition(HD) through channels of satellite channel, cable and others.

The other application scenario is storage of video into DVD. ITU-T Recommendation H.264 and ISO/IEC MPEG-4 (Part 10) Advanced Video Coding (H.264/AVC) was first publicly released in 2003 aiming at providing the same video quality as MPEG-2 with half bitrate. It was developed by video experts from ITU-T/ISO/IEC Joint Video Team (JVT). H.264/AVC strikes a good balance between coding efficiency and implementation complexity including software and hardware. H.264/AVC achieves about half bitrate with the same picture quality compared with MPEG-2 while keep the cost of implementation acceptable considering equipment processing ability available at that time.

(3) 訂正箇所：Abstract, Page 4, Paragraph 1, Line 3 訂正内容：記述の訂正

具体的内容：

Furthermore, from a systematic point of view, memory traffic introduced by other components such as variable-length-codec, DCT/IDCT, and so on must also be considered. Additionally, the motion-estimation process only uses the luminance pixel data, while the other components also use chrominance data thus increasing the importance of strong data-reuse level.

In addition, bandwith consumed by other parts of a video codec such as variable length coding (VLC), discrete cosine transform/inverse discrete cosine transform (DCT/IDCT) should be taken into consideration. Another factor is chrominance components of video signal which must be processed in addition to luminance part and put extra consuming of bandwidth.

(4) 訂正箇所：Abstract, Page 4, Paragraph 2 訂正内容：記述の訂正

具体的内容：

In transcoding, motion estimation is usually not performed in the transcoder because of its computational complexity. Instead, motion vectors extracted from the incoming bit-stream are reused. In many existing works, motion vector is improved by a procedure called motion vector refinement. This method is based on observation that the motion vector deviation in most macroblocks is within a small range and the position of the optimal motion vector will be near that of the incoming motion vector.

Motion estimation is often avoided in transcoding to reduce computation complexity. Motion vector from incoming video stream is often reused. In many

(4)

existing works, motion vector is improved by a procedure called motion vector refinement. This method is based on observation that the motion vector deviation in most macroblocks is very small and the final motion vector from motion estimation is very close to the input motion vector from MPEG-2.

(5) 訂正箇所：Page 9, Paragraph 1 – Page 12, Paragraph 1 訂正内容：記述の訂正

具体的内容：

Video transcoding performs one or more operations, such as bit-rate and format conversions, to transform one compressed video stream to another. Transcoding can enable multimedia devices of diverse capabilities and formats to exchange video content on heterogeneous network platforms such as the Internet. One application is delivering a high-quality multimedia source (such as a DVD or

…(中略)…

rate is set high, the base-layer video may not get through the network completely.

In general, the achievable quality of scalable coding is significantly lower than that of non-scalable coding. In addition, scalable video coding demands additional complexities at both encoders and decoders. Thus, pre-encoded video is used, scalable coding is inflexible since the number of different predefined layers is limited and the bit-rate of the target video cannot be reduced lower than the bit-rate of the base layer. Scalability alone does not solve the bit-rate adjustment problem.

Video transcoding transforms source video to target video by changing of syntax, resolution or bit-rate on video stream. Transcoding provides one of the possible ways to communicate video signals among end equipments with diverse processing abilities over various network protocols. One application scenario is playing back video contents on various end equipments, such as mobile phone, laptop or desktop pc. Alteration of bitrate, resolution or video syntax is a primary requirement to meet various end equipment. Another application scenario is video conferencing over internet, which also requires dynamic

adjustment of video properties to fit heterogeneous network and various end users.

Many other application scenarios of transcoding exist. One is video multiplexing [1], which is defined as transporting several compressed video streams through single channel. [1] proposes a complexity estimations method of input video stream to jointly allocating bitrate for output video in constant bitrate (CBR) channel. Transcoding is also used in implanting extra information, for example, watermaking etc. This usually exists in complete transcoding which is composed of a full decoding, image processing and encoding. Transcoding is also used in video playback operation, such as fast forward/reverse. [2, 3, 4] propose a video transcdoing method to facilitate fast forward/reverse playing on online video.

Transcoding is also applied in other forms of video signal, for example, [5]

proposes a video transcoding framework for object-based video signal.

Transcoding is analysed in difference theory framework. An utility-based transcoding framework is dicussed in [6]. It considers adaption of transcoding to diverse channels and terminals as a constrained utility maximization problem.

(5)

[7] proposes an utility function prediction approach to reduce complexity for MPEG-4 realtime video transcoding. Rate-distortion models is a fundamental tools in video coding. [8] proposes rate-distortion model for transcoding to make optimal selection among serval options which satisfy delivering constraint. Due to the importance of transcoding, video standard also take it into consideration.

The MPEG-7 standard [9] which is a multimedia content description standard provides coding tool

called transcoding hints to facilitate the processing of transcoding [10, 11].

There exist several video coding standards available in markets, each of which is developed at different times for various applications. H.261 [12] and H.263 [13] are developed by International Telecommunication Unit (ITU) aiming at video phone and video conferencing applications which require comparable low bitrate.

MPEG-2 [14] is designed for high video quality applications such as TV broadcasting and DVD storage requiring high bitrate. MPEG-4 [15] is developed for video stream on mobile phone. MPEG-2 and MPEG-4 are developed by ISO/IEC Moving Picture Experts Group (MPEG). H.264/MPEG-4 AVC is jointly developed by ITU and ISO aiming at providing better video quality at lower bitrate than previous video standards. Since diverse video coding standards available, video syntax alteration between standards becomes a primary requirement for video contents exchanging and delivering over heterogenous networks. Transcoding between video signals of the same standard involve operations of resolution, bitrate, frame rate etc. as shown in Fig. 1.2. Transcoding also servers in applications like inserting extra information, for example company logo or error-prevention features.

Scalable video coding (SVC) provides another way of adaption to diverse end equipments and heterogenous networks. Scalable video stream is composed of several subset streams, each of which contains a specified spatial resolution, temporal resolution or video quality. Subset stream can be dropped to achieve certain adaptivity. Scalable video coding has several forms. Temporal scalability has specified motion compensation structure to drop frames achieving target frame rate while maintains the integrity of the subset stream. Spacial scalability contains several layer of various spacial resolution to fit the displaying resolution of end user. Quality scalability has the same resolution with different video quality. All the subset streams must be combined to form integrate SVC stream. The overall bitrate is larger than each subset stream and coding efficiency is lower than transcoding which produces only one output stream.

In video conferencing, end equipment or intermediate node of network must decide which subset stream to be kept based on available network bandwidth. Bandwidth efficiency before intermediate node is lower than transcoding. Transcoding is a more flexible approach for these scenarios, but transcoding requires more processing power for intermediate node.

具体的内容：

The definitions of spatial-domain and frequency-domain are shown in Fig. 1.3.

(6)

Spatial-domain architecture (SDTA) can perform dynamic bit-rate adaptation via the rate-control at the encoder side. This architecture is flexible since the decoder-loop and the encoder-loop can be totally independent of each other. This architecture is drift-free, but its computational complexity is high for

…(中略)…

Thus, DCT/IDCT in all B frames can be omitted, and the transcoding of B frames can be directly done in the DCT domain. The transcoding delay can be further reduced without degrading the video quality in this architecture. P frames with frequent scene changes and rapid motion may contain a large number of INTRA blocks.

One can further omit the IDCT/DCT and MC operation of these INTRA blocks in P frames. In other words, blocks of I and B pictures and INTRA blocks of P pictures are transcoded in frequency-domain, the spatial-domain motion compensation is done only when the block is inter block in P frames. This transcoding architecture is defined as hybrid domain transcoding architecture (HDTA).

The spatial-domain and frequency-domain architecture are defined in Fig. 1.3.

Spatial-domain architecture (SDTA) is composed of a complete decoding processing and a complete encoding processing which are independent. SDTA introduces no drift-error since video stream is completely restore to pixel domain by decoding processing and then be encoded into target video format. The computation complexity of SDTA is high for its fully decoding and encoding operations. The complexity can be reduced through reusing incoming information from input video stream of decoder. Motion information and mode decision information can be reused to alleviate the computation of encoding by providing a more accurate search center or reducing range of candidate modes. In [17], it is reported that 60%–70%

computation of motion estimation which accounts for the most computation intensive part in encoder are saved with negligible video quality degradation.

The compressed source video stream is fully decoded in pixel domain by performing variable length decoding (VLD), inverse quantization Q−1S , inverse discrete cosine transform (IDCT), and motion compensation (MC). Then the decoded frames are imported in to encoder to perform mode decision, discrete cosine transform (DCT), quantization and entropy coding. The MV resue and mode resue approaches are effective in complexity reduction for encoding parts of transcoding. Extra MC can be applied when scaling operations exist between decoding and encoding to reduce drift error.

Frequency-domain transcoding architecture (FDTA) provides another approach for transcoding. The DCT/IDCT are ignored to make a more simpler architecture compared to SDTA by utilizing linearity in DCT/IDCT [18]. In decoding side, only VLD and invere quantization are applied to get DCT coefficient of each block in video frame. In encoding side, MC is performed in frequency-domain. The motion compensated residual signals are quantized and are performed by VLC to get final compressed video stream. The reference frame buffer in encoder stores the DCT coefficient from decoding and then fed to MC module operated in frequency-domain [19, 20]. The computation requirement is less in FDTA than SDTA because of DCT/IDCT are ignored. But drift error can not be avoided because of non-linear operations like sub-pixel MC, DCT coefficient clipping etc. FDTA also provides less flexibility in transcoding function compared SDTA because operation like

(7)

resolution/timing scaling is more feasible in pixel domain. The further simplified architectures are proposed in [19, 21, 22].

The transcoder structure can be customized to achieve a balance between computation complexity and video quality. In video coding, there exist three frame types: I, B, P. For I frame, only intra coded block is allowed. For P frame, both intra coded block and inter coded block with forward prediction are allowed.

For B frame, intra coded block, inter coded block with forward prediction and inter coded block with forward/backward prediction are allowed. Transcoding algorithm can be customized on frame level or block level based on frame/block types. [23] performs MC only in P frame. I frame ignore ME and MC because blocks in I frame are intra coded. IDCT/DCT are also not performed in I frame. Since I frame act as reference frame for P and B frame, IDCT in decoding end, inverse quantization and IDCT in encoding end for I frame are performed to reconstruct reference frame. MC, DCT and IDCT must be performed for P frame since it also act as reference. B frame can not be used as reference signal that drift error generated in it is acceptable since it is impossible to propagated to other frames.

Base on this, MC, DCT/IDCT are ignored without significant video quality damage.

Since some main computation intensive modules are omitted, transcoding delay is also reduced. P frame can contain intra coded block when sharp motion and scene conversion happen. Thus MC and DCT/IDCT are omitted for intra coded block in P frame. In summary, transcoding can be performed in a hybrid architecture, in which blocks in I and B frame and intra coded block in P frame are transcoded by frequency-domain approach while inter coded blocks in P frame should perform MC in spatial-domain as ordinary encoder.

(7) 訂正箇所：Page 15, Paragraph 2, line 2 - 6 訂正内容：記述の訂正

具体的内容：

H.264 is a joint effort of MPEG and ITU with the first version finalized in May 2003 [24]. It aims to deliver a compression efficiency as twice as MPEG-2. The MPEG-2 standard [14] is widely used for the transmission of SD and HD TV signals over satellite, cable, and terrestrial emission and the storage of high-quality SD video signals onto DVDs.

H.264 is a joint developed by MPEG and ITU with the first version completed in May 2003 [24]. It aims to deliver a compression efficiency as twice as MPEG-2.

The MPEG-2 standard [14] is widely applied in transmission of SD and HD TV signals through channels of satellite and cable.

具体的内容：

Motion estimation has been widely employed in the H.264/AVC, MPEG-1, -2, and -4 video standards to exploit the temporal redundancies inherent amony video frames. Block matching is the most popular method for ME. Pixel-by-pixel difference is central to the block matching algorithms and result in high computation complexity and huge memory bandwidth. Benefitting from the rapid

(8)

progress of VLSI technology, computation complexity requirements can easily be satisfied by multiple PEs architecture, even for large frame sizes and frame rate. However, without enough data, these PEs can hardly be fully utilized and

…(中略)…

unrealistic without exploiting the locality or tricky designs. For example, the MPEG2 MP@ML format requires a memory bandwidth of tens of gigabytes per second, while the HDTV format with a large search range requires a terabytes per second bandwidth. This work only considers uni-directional ME, and bandwidth becomes even higher for bi-directional predictions. Redundancy relief is the solution to the huge bandwidth problem because many redundant accesses exist in the memory traffic of ME.

Motion estimation (ME) computing pixel difference between current block and reference block has been widely applied in many video coding standards, such as H.264/AVC , MPEG-1/2/4 etc. For inter-frame mode decision, ME provide metric of block difference for rate-distortion optimaztion (RDO). ME is done by computing block difference at each candidate position in reference frame.

Pixel-by-pixel difference is the core computation which require huge computation and memory traffic between processing entity (PE) and memory storage.

Computation complexity can be addressed by parallel PE architecture in which multiple PE working concurrently. Such architecture introduce no comparable hardware cost due to rapidly developing VLSI technology. The number and utilization ratio of parallel PE are limited by available data because of limited bandwidth of memory system that current technology can implement. At most time, ME processing ability is limited by available bandwidth no processing ability provided by PE. Memory bandwidth issue in ME is possible to be addressed by data scheduling to avoid unnecessary redundancy of data traffic and appropriate memory size. Hardware architecture of ME which is optimized can provide necessary processing power together with appropriate bandwidth under limitation.

There are many algorithms and hardware architecture are proposed to implement ME. The full-search block-matching (FSBM) algorithm is mostly applied one which compute block difference at each reference position to get the optimal match.

Various fast algorithms[25, 26, 27, 28, 29] of ME are proposed to alleviate the computation complexity introduce by FSBM which search fewer reference position mostly based on empirical rules. Fast algorithms break the regular searching path of FSBM leading to its rarely application in hardware. FSBM computes block difference in a pre-defined searching path in favor of utilizing data locality or redundancy to address bandwidth problem. Many architectures are proposed to address processing power problem, such as systolic array[30, 31, 32, 33],

具体的内容：

Locality in Current Frame In FSBM, each MB is independent, i.e., pixels of one MB do not overlap with other ones. Consequently, the lifetime of one MB is just the time period of ME of this MB. Each MB pixel is used srh × srv times during

(9)

this period, showing that pixels of current MB have good locality compared with pixels of reference frame. This characteristic allows the simple approach of keeping all N ×N MB pixels locally, allowing the design to result in a 1/(srh×srv) reduction in memory accesses of the current frame. Therefore, the additional on-chip memory for current MB reduces the access count of the current frame to just W × H for each frame, which is also the maximum possible saving.

…(中略)…

Level B, except that it applies to the search area strips. Level D repeatedly reuses data already loaded from former search area strips for latter search area strips. Applying this level, one-access is achieved.

Locality in Current Frame Each current block in FSBM remains unchanged for each search window because neighbouring current blocks have no overlapping area. The pixels of current block is applied in srh × srv times of block difference computation which reveals locality of current block during ME. Thus keeping all N × N current block pixels in on-chip memory will alleviate bandwidth problem by reducing memory traffic of current block to 1/(srh×srv). For each frame, the current block data traffic is reduced to W × H which is the maximum saving of current block data. This idea is employed in my ME architecture design to reduce current data bandwidth as low as possible while the on-chip memory cost maintains a low level.

Locality in Reference frame Each search window in reference frame is a rectangle area of (N + srh − 1) × (N + srv − 1) in which the center is current block.

Neighbouring search window are overlapped along the processing order of current block. Two kinds of locality exist: local and global locality. Local locality addresses overlapping between search windows of the same current block. Global locality exploits data overlapping among search windows of a group current blocks which are connected both horizontally and vertically. Five data localities are defined from Level A to Level D. Data reuse level defines the degree of bandwidth reduction of ME. The higher the bandwidth reduction level, the lower the required memory bandwidth. The five reuse levels are defined as below.

1. Level A–Locality within Reference Block Row Local locality is defined as locality within the same row and locality between neighbouring rows. A row of reference blocks is defined as a reference block row, as shown in upper part of Fig. 1.4. For the two neighbouring reference blocks, part of reference pixels are overlapped and unchanged. The data traffic of two neighbouring searching positions without data reuse scheme is is 2×N ×N which can be reduced to N

×1 because only this part is refreshed. Reference block 1 and 2 in Fig. 1.4 illustrate this kind of data locality. The unchanged part of reference block 1 is reused without introducing any extra reference data traffic when ME is processed in reference block 2. This kind of data locality can be applied in the entire reference block row, in which reference pixels is repeatedly reused for each searching position until row end is reached.

2. Level B–Local Locality Between Neighbouring Reference Block Rows Vertically neighbouring candidate block rows are overlapped in most parts, which is shown

(10)

in Fig. 1.4 by reference block rows 1 and 2 in the lower part. The area of overlapped parts of two reference block rows is (N + srh − 1)(N − 1). The pixels in the overlapped parts can be reused until ME processing reaches the next reference block row. Only one line reference pixels need to be refreshed. This idea is applied to all reference block rows within the search area for each ME processing.

3. Level C–Global Locality Within Search Window Row The idea behind Level C is similar to Level A. The global locality of search window row defines reference data reuse between neighbouring search windows corresponding to their current blocks along ME processing order. The search window row is an area which height is equal to the height of search window and width is equal to the width of reference frame. There only one column of reference pixel need to be refreshed between neighbouring search position which leads to significant data traffic reduction.

4. Level C+–Global Locality Between Search Window of MB Group Search window reuse can be extended to search windows of a group of current blocks which are connected vertically [37]. Search windows of current block group are loaded together leading to both horizontal and vertical search window reuse simultaneously, since vertical overlapped area between vertically connected current blocks is loaded only once and overlapped area of search windows between neighbouring current block group remains unchanged. This scenario is shown in Fig. 1.5.

The Level C+ method can achieve more reduction of external memory with less on-chip memory. At the same time, The Level C+ method combined with scanning order of vertically-connected blocks introduces conflicts with coding tools like mode decision and motion vector prediction in which neighboring block information is utilized. In general, information from neighboring blocks helps to compress current block, such as in motion vector prediction, motion vectors from neighboring block act as predictor which is subtracted from current motion vector. The practical implementation of Level C+ method in hardware should take the issues mentioned above into consideration to avoid performance degradation.

The processing order of block in video coding is called raster scan, in which blocks are processed from left to right and from up to bottom. The bottom block of vertically-connected blocks cannot utilize the information from left block because it is not processed yet in raster scan. This information is available if the left vertically-connected blocks are processed as a unit. Thus the both the bandwidth efficiency and algorithm efficiency can be satisfied.

The data from left blocks is satisfied by scan order mentioned above. But data from the top and top-right blocks are not available. Such problem is solved by processing a group of blocks connected horizontally in a row first. Then the blocks connected vertically are processed. The previous processing provides side information from top and top-right blocks. The second processing provides information from left blocks and guarantee bandwidth efficiency of the Level C++ method.

(11)

5. Level D–Global Locality Among Neighbouring Search Area Rows The idea behind Level D is similar to Level B except the reference area rows is processed. Level D utilizes data overlapping between neighbouring reference area rows. The maximum data reuse is achieved with the largest on-chip memory.

(10) 訂正箇所：Page 22, Paragraph 1 訂正内容：記述の訂正

具体的内容：

In the evaluation of bandwidth reduction schemes, two factors can be used to evaluate the performance of data reuse schemes: on-chip memory size for the reference frame and redundancy access factor R. The on-chip memory size represents the required memory size to buffer the data of candidate blocks for data reuse. R is used to evaluate the external memory bandwidth and defined as Ra= Total memory bandwidth for reference frame/ Minimum memory bandwidth In the evaluation of bandwidth reduction schemes, two criteria are utilized to measure the performance of data reuse methods: on-chip memory capacity and data redundant accessing score Ra. The on-chip memory capacity is to evaluate on-chip memory size to hold the data of data reuse method. Ra is utilized to evaluate unnecessary data accessing which is defined as following

Ra= Overall memory bandwidth for reference data/ Minimum memory bandwidth (11) 訂正箇所：Page 22, Paragraph 3, line 1 - 4

訂正内容：記述の訂正具体的内容：

To realize data reuse, on-chip memory holding already loaded data for further access is required. The local memory required for the current frame, i.e., for storing one current block is N × N. The local memory size for the previous frame equals the size of the overlapped region in each data-reuse level.

The implementation of data reuse schemes involve on-chip memory which contains loaded data for further processing. The on-chip memory size of one current block is N ×N. The on-chip memory size of reference data is equal to the size of reference data needed to be kept.

(12) 訂正箇所：Page 23, Caption of Figure 1.4 訂正内容：記述の訂正

具体的内容：

図のキャプションに文献参照を追加。

Figure 1.4: Level A: local locality within reference block strip. Level B: local locality between adjacent reference strips.

Figure 1.4: Level A: local locality within reference block strip. Level B: local locality between adjacent reference strips[36].

(12)

(13) 訂正箇所：Page 24, Caption of Figure 1.4 訂正内容：記述の訂正

具体的内容：

図のキャプションに文献参照を追加。

Figure 1.5: Level C+ searching region data reuse for FSBMA, 2 MB are stitched vertically.

Figure 1.5: Level C+ searching region data reuse for FSBMA, 2 MB are stitched vertically[37].

具体的内容：

The key to reduce the complexity for inter transcoding is the motion re-estimation and mode decision, which typically accounts for more than 80% of a full H.264/AVC encoder.

The most critical part to reducing computation complexity of inter part in transcoding is motion estimation and mode decision, which normally occupy more than 80% computation of a H.264/AVC encoder.

(15) 訂正箇所：Page 27, Paragraph 2 - Page 28, Paragraph 1 訂正内容：記述の訂正

具体的内容：

[38–40] introduce an HDTV recording system that employs MPEG-2 to H.264/AVC transcoding to achieve efficient storage of broadcast streams. A transform-domain MPEG-2 to H.264/AVC intra video transcoder architecture is developed. Input DCT coefficients are first converted to H.264/AVC transform (HT) coefficients in the transform-domain. For intra coding, a fast rate-distortion optimized macroblock mode decision based on a simple cost function calculated in the HT-domain is then performed. Finally, the HT coefficients are coded using the selected modes to generate the output H.264 bitstream. This transcoder architecture reduces the complexity requirement about 50%, while maintaining virtually the same coding efficiency. The inter coding is performed in transform domain based on transform-domain cost function.

Compared to spatial domain method, frequency domain method is more efficient but will introduce more video quality reduction (about 1 db in [38]). [41]

proposed compensation method for requantization errors and achieved about 5 db quality improvement over the open-loop transform-domain-based transcoding.

A transcoding system from MPEG-2 to H.264/AVC for effecient HD video storage of broadcasting signals is introduced in [38, 39, 40]. The main idea is applying intra mode decision in transform-domain from MPEG-2 to H.264/AVC. DCT coefficients from MPEG-2 is converted to H.264/AVC DCT coefficients in transfrom-domain. The intra mode decision based on a fast rate-distortion cost function calculated in transfrom-domain is applied. Transform coefficients are

(13)

coded by using several pre-selected modes to produce final H.264/AVC bitstream.

This transcoding approach reduces computation complexity by a factor of 50%.

The coding efficiency is almost the same compared to original system. The inter coding is performed in transform domain based on transform-domain cost function.

Compared to spatial domain method, frequency domain method is more efficient but will introduce more video quality reduction (about 1 db in [38]). [41]

proposed compensation method for quantization errors and achieved about 5 db quality enhancement through methods of open-loop transform-domain transcoding.

具体的内容：

In [43], MV re-estimation for the 16×16 MB in H.264 encoder end follows the following procedure. The motion costs are compared for the MVPs from the MPEG-2 stream and for the H.264 MVP from the MBs neighbors, and the best one is selected as the search-center. Then the motion costs at the search center and the 4 pixel locations around the search-center with 1 pixel distance (top, bottom, left, and right) are compared to refine the search center. The search range is reduced because the MVPs from MPEG-2 MB are more accurate compared to the H.264 MVPs, especially when the MV field is not consistent. Furthermore, they are available at the picture boundary where some neighbor MBs are not available.

In [43], the motions estimation for the 16x16 block in H.264 encoder is processed as following steps. The rd-cost are compared among MV predictors from MPEG-2 video stream and from H.264 MV predictors from neighboring blocks. The MV predictor with minimum rd-cost is chosen as search center position. The rd-cost of search center and four searching positions with one pixel offset are compared to decide better search center. The MV predictor from inputting MPEG-2 stream provides an accurate indication of search center, thus the search range is greatly decreased. And at the boundary of frame, the unavailable MV predictors in H.264 stream can be provided by MPEG-2 stream which further improves coding efficiency.

具体的内容：

Furthermore, from a systematic point of view, memory traffic introduced by other components such as variable-length-codec, DCT/IDCT, and so on must also be considered. Additionally, the motion-estimation process only uses the luminance pixel data, while the other components also use chrominance data thus increasing the importance of strong data-reuse level. Thus we can make a tradeoff between on-chip memory and bandwidth to meet the system requirement of ultra-low bandwidth.

In addition, memory consuming cased by other parts of video coding system, such as entropy coding, DCT/IDCT etc, should be taken into consideration. The chrominance components in a video coding system introduce extra memory traffic.

These parts put further burden on memory bandwidth. Therefore, a carefully

(14)

designed data reuse scheme strike a good balance between on-chip memory size and limited available bandwidth to fit the urgent requirement of ultra-low bandwidth system.

具体的内容：

For video coding systems, motion estimation (ME) can remove most of temporal redundancy, so a high compression ratio can be achieved. Among various ME algorithms, a full-search block matching algorithm (FSBMA) is usually adopted because of its good quality and regular computation. In FSBMA, the current frame is partitioned into many small macroblocks (MBs) of size N ×N. For each MB in the current frame (current MB), one reference block that is the most similar

…(中略)…

array generates N distortions of a searching candidate and accumulates these distortions with N partial-column SADs in the vertical propagation. After accumulation in the vertical direction, N-column SADs are accumulated in the top adder tree in one cycle. The longer latency for loading reference pixels and large propagation registers are the penalties for the reduction of memory bandwidth and memory bit width.

In video coding system, motion estimation (ME) reduces temporal redundancy of adjacent frames to reach high compression efficiency. Full-search block matching alogrithm (FSBMA) is the mostly applied approach in hardware for its regular computation order and good compression ratio. In FSBMA, the current frame to be compressed is divided into small blocks, called macroblock (MB), of size N

× N. Each current MB which is defined as MB in current frame, one MB in reference frame which the most matched is searched under similarity criteria in a search range [−P, P). The sum of absolute differences (SAD) is the mostly applied criteria which is defined as following

where curPixel(i, j) is pixel value in current frame with position (i, j).

refPixel(m+i, n+j) is pixel value in reference frame corresponding to position (i, j) with displacement (m, n) within search range. D(i, j,m, n) is the distortion measure between current pixel and corresponding reference pixel.

SAD(m, n) is the sum of distortion of current block located at (m, n). After all searching position in search range are examined, the search position with smallest SAD is chosen as the motion vector (MV) of current block. FSBMA can achieve the best coding efficiency among all ME algorithms by extensively searching each position in search range. It also introduces the biggest computation complexity which accounts for 50% to 90% of a video encoder in most case. A carefully designed hardware architecture of ME must be taken into consideration.

Variable block-size motion estimation (VBSME) provides more adaption to picture signal compared with fixed block-size motion estimation (FBSME) by dividing current block into multiple combinations of prediction unit. The best combination of prediction unit is chosen by rate-distortion optimization (RDO)

(15)

with the minimum rd-cost. Thus it can achieve better coding efficiency. VBSME is first proposed in H.263 [13], then applied in MPEG-4 [15], WMV9.0 [50], and H.264/AVC [24]. In H.264/AVC, each MB is divided into prediction unit of size 4 × 4, 4 × 8, 8 × 4, 8 × 8, 8 × 16, 16 × 8, and 16 × 16. One obvious disadvantage is that it introduce more computation complexity and more design difficulty in hardware architecture of ME.

The hardware architecture of FBSME can be categorized into two different kinds:

• [InterME Architecture] The processing element (PE) computes SAD for a specific searching position.

• [Intra ME Architecture] The PE computes the distortion of a current pixel in current block for all searching positions.

A one-dimension inter ME architecture is proposed in [51] as shown in Fig. 4.1.

The number of PE is the same as the number of searching position in horizontal direction of each search window, which is equal to 2ph. The data of each reference pixel is broadcasted to each PE. Reference pixel is inputted into specific PE controlled by the selecting signal. The pixel of current block is propagated into each PE through propagating register. The partially available SAD is stored in each PE. PE calculates the distortion value of each pixel which is summed up with partial SAD to produce the final accumulation of SAD. The key design feature in this architecture is reference pixel broadcasting which reduce the bandwidth of reference data significantly while the global routing complexity is also increased.

The two-dimension inter ME architecture is proposed by [52] as shown in Fig.

4.2. It consists of a matrix of PE in size of 2ph ×2pv. The design idea behind this work is similar to [51], which includes broadcasting of reference pixels to each PE and propagating current pixels using propagating registers. The partial SAD is stored and summed up in each PE. The search windows is divided into (2ph/N)×(2pv/N) reference data areas, each of which is computed by N × N PEs. The number of PE is equal to current block size. The key design feature is reference data broadcasting both horizontally and vertically to further utilize the data reuse.

A two-dimension inter ME architecture with two reference data interleaving arrays is proposed in [53] as shown in Fig. 4.3. The hardware architecture utilize one-dimension PE array. Two different features compared to previous work are:

current pixels are broadcasted int each PE; reference pixels are propagated by propagating registers. The partial SAD is stored and summed up in each PE like previous work. The reference pixels must be loaded before SAD calculation.

Partition of search window into small areas can help reduce latency of reference data loading. In most case, the search window can be divided into (2ph/N) × (2pv/N) partitions.

A two-dimension intra ME architecture is proposed in [54] as shown in Fig. 4.4, in which the number of PE is equal to the size of current block for computing distortion of each current pixel. The propagation of reference pixel is done

(16)

by a matrix of propagating registers, in which reference data can be shifted in upward, downward and rightward. This architecture implements the snake searching of reference data. The reduction of on-chip memory is at cost of propagating register and long loading latency. In this design, distortion value is computed in each PE and partial row SAD are propagated and summed up in horizontal direction. The row SAD is summed up by an adder tree to get final SAD. Thus no partial SAD need to be stored in each PE because the role of propagating registers.

A two-dimension intra ME architecture is proposed in [55] as shown in Fig. 4.5.

This work is featured by a systolic mapping procedure on dependence graph (DG).

The reference pixels are propagated in horizontal direction from one PE to another. The partial column SAD are summed up and propagated vertically then horizontally. The pixel of current is stored in each PE as previous work. The computation of distortion of current pixel in each current block and summation of partial SAD is done in each PE while partial SAD is propagated from top to bottom. Then column SAD are summed up by adders and registers in horizontal propagation.

A two-dimension intra ME architecture is proposed in [56] as shown in Fig. 4.6.

This architecture is composed of PE group in vertical direction, each of which is arranged as a row of PE horizontally. Current pixels are store in each PE.

Reference pixels are propagated by propagating registers leading to serial data input and high memory efficiency. Partial column SAD are propagated vertically and finally summed up to produce final SAD by an adder tree. The memory efficiency and memory pin count reduction is completed at the cost of its long latency of loading reference data and large number of propagating registers.

(19) 訂正箇所：Page 63, Paragraph 2 - Page 65, Paragraph 2 訂正内容：記述の訂正

具体的内容：

4.2.2 節の記述内容の一部を移動するのに合わせて記述の訂正を行った。

For VBSME, [57] proposed two architectures. One is a 2-D intra-level architecture called the Propagate Partial SAD. The architecture is composed of N PE arrays with a 1-D adder tree in the vertical direction. Current pixels are stored in each PE, and two sets of N continuous reference pixels in a row are broadcasted to N PE arrays at the same time. In each PE array with a 1-D adder tree, N distortions are computed and summed by a 1-D adder tree to generate one-row SAD.

The row SADs are accumulated and propagated with propagation registers in the vertical direction. The other is SAD Tree architecture. The proposed SAD Tree is a 2-D intra-level architecture and consists of a 2-D PE array and one 2-D adder tree with propagation registers. Current pixels are stored in each PE, and reference pixels are stored in propagation registers for data reuse. In each cycle, N × N current and reference pixels are inputted to PEs. Simultaneously, N continuous reference pixels in a row are inputted into propagation registers to update reference pixels. In propagation registers, reference pixels are propagated in the vertical direction row by row. In SAD Tree architecture, all

(17)

distortions of a searching candidate are generated in the same cycle, and by an adder tree, N ×N distortions are accumulated to derive the SAD in one cycle.

In this dissertation, an IME module architecture is proposed forMPEG-2 to H.264/AVC transcoding. A modified ping-pang memory control scheme combined with Partial SAD VBSME are applied to realize the proposed Level C bandwidth reduction scheme for transcoding. The detailed contents are shown in the following sections.

In H.264/AVC, VBSME is adopted. For one MB, it has 4 block modes (16 × 16, 16

× 8, 8 × 16, 8 × 8). For 8 × 8 mode, it is further divided into 4 modes, namely 8 × 8, 8 × 4, 4×8, 4×4. For each MB, coding costs of all 7 block mode are calculated, and the block mode with the smallest cost is chosen as the MB mode. Compared with fixed block size ME algorithm, VBSME provides higher compression ratio, but it also puts heavy burden on the ME module. In hardware design, partial-SAD reuse methodology is adopted to reduce computation complexity, which means the SAD of smaller blocks are stored and accumulated to get the SAD of bigger ones.

The SAD tree and partial SAD architectures are proposed by [57] to implement VBSME. The SAD tree architecture is a two-dimension intra ME design which is composed of a two-dimension PE matrix and a two-dimension adder tree together with propagating registers for partial SAD and reference pixel. The reference data reuse scheme is achieved by storing reference pixel in propagating register while current pixels are contained in each PE. Partial SAD is computed by each PE with N ×N current and reference pixels. Reference pixels are refreshed by a row of N successive reference pixels in propagating registers. Reference pixels are transferred from top to bottom row in propagating registers. The distortion value of a searching position is calculated by N ×N PE group and an adder tree in one circle. These distortion value are summed up to get SAD of sub-block.

The partial SAD architecture is composed of PE arrays of size corresponding to a row of current pixels. The PE array connected with a one-dimension adder tree along vertical direction. The PE hold current pixels and two row of successive reference pixels row are broadcasted into PE simultaneously. As shown in Fig.

4.9, distortion value from each PE are summed up by one-dimension adder tree to get one SAD of a row. The row SAD are stored and transferred in propagating registers along vertical direction as shown in Fig. 4.10. The reference pixels of a searching position are fed in an interleaving way as indicated by Ref. Pixels 0 and Ref. Pixels 1. The SAD of the first column in this searching position is calculated in the first circle while the SAD of following searching position are generated in the next few circles. The reference pixels of next column are transferred in when the last searching position is reached. The hardware can be fully utilized except the initial circles. It can achieve shorter critical path and reference data reuse by transferring reference pixels and partial row SAD in propagating registers.

(20) 訂正箇所：Page 65, Paragraph 3

(18)

訂正内容：記述の訂正具体的内容：

4.2.2 節の内容の一部の移動に合わせて概要部分の訂正を行った。

4.2 The Top-Level Architecture

Fig. 4.7 shows top-level architecture of the proposed transcoder IME module for HDTV720p application. There exist four reference pixel memories, each of which is a 47 × 16 bytes single-port SRAM. Actually the search window is 47 × 47 bytes. The applied memories size is for an easy hardware implementation. The memory update and output is controlled by memory input and output control unit, which is controlled by MV smoothness decision unit. The IME module is implemented with Partial SAD architecture.

In this dissertation, an IME architecture is proposed for MPEG-2 to H.264/AVC transcoding. A modified ping-pang memory control scheme combined with Partial SAD VBSME are applied to realize the proposed Level C bandwidth reduction scheme for transcoding. The detailed contents are shown in the following sections.

(21) 訂正箇所：Page 65 - 70, Section 4.2.1 および 4.2.2 訂正内容：記述の訂正

具体的内容：

4.2.1 節 Performance Analysis と 4.2.2 節 IME Architecture の順序を入れ替えた。同時に、新たな 4.2.1 節 IME Architecture の記述の訂正を行った。

4.2.2 IME Architecture

In H.264/AVC, VBSME is adopted. For one MB, it has 4 block modes (16 × 16, 16

× 8, 8 × 16, 8 × 8). For 8 × 8 mode, it is further divided into 4 modes, namely 8 × 8, 8 × 4, 4×8, 4×4. For each MB, coding costs of all 7 block mode are calculated, and the block mode with the smallest cost is chosen as the MB mode. Compared with fixed block size ME algorithm, VBSME provides higher compression ratio, but it also puts heavy burden on the ME module. In hardware design, partial-SAD reuse methodology is adopted to reduce computation complexity, which means the SAD of smaller blocks are stored and accumulated to get the SAD of bigger ones.

The SAD Tree and Partial SAD architecture are proposed by [57] to realize VBSME.

The SAD Tree is a 2-D intra-level architecture and consists of a 2-D PE array and one 2-D adder tree with propagation registers. Current pixels are stored in each PE, and reference pixels are stored in propagation registers for data reuse. In each cycle, N × N current and reference pixels are inputted to PEs.

Simultaneously, N continuous reference pixels in a row are inputted into propagation registers to update reference pixels. In propagation registers, reference pixels are propagated in the vertical direction row by row. In SAD Tree architecture, all distortions of a searching candidate are generated in the same cycle, and by an adder tree, N × N distortions are accumulated to derive the SAD in one cycle.

The SAD tree architecture is suitable for highly parallelized application and

(19)

can share reference buffer between parallel PEG. But it has long critical path delay, which is 14.1ns based on our implementation. This delay cannot meet the performance requirement according to Section 4.2.1. To reduce delay, 16 12-bit registers can be inserted between SAD4×4 and larger block’s SAD addition to form a 2-stage pipeline. When applying snake-scan processing order, one PEG (256 PE) of SAD Tree architecture needs 16×12+16×17×8 = 2368 bit register.

The Partial SAD architecture is composed of N PE arrays with a 1-D adder tree in the vertical direction. Current pixels are stored in each PE, and two sets of N continuous reference pixels in a row are broadcasted to N PE arrays at the same time. In each PE array with a 1-D adder tree, N distortions are computed and summed by a 1-D adder tree to generate one-row SAD, as shown in Fig. 4.9.

The row SADs are accumulated and propagated with propagation registers in the vertical direction, as shown in the right-hand side of Fig. 4.10. The reference data of searching candidates in the even and odd columns are inputted by Ref.

Pixels 0 and Ref. Pixels 1, respectively. After initial cycles, the SAD of the first searching candidate in the zeroth column is generated, and the SADs of the other searching candidates are sequentially generated in the following cycles. When computing the last searching candidates in each column, the reference data of searching candidates in the next columns begin to be inputted through another reference input. Then, the hardware utilization is 100% except the initial latency. In Propagate Partial SAD, by broadcasting reference pixel rows and propagating partial-row SADs in the vertical direction, it provides the advantages of fewer reference pixel registers and a shorter critical path.

The Partial SAD architecture has smaller gate count and suitable for medium and small resolution videos. Another advantage is that it has shorter critical path delay compared with SAD Tree because partial SAD is stored and propagated by propagation and delay register. If one PEG (256 PE) of Partial SAD is used, it needs 1872 bit register. Therefore 496 bit register can be saved compared to SAD Tree architecture. In this paper, the Partial SAD architecture is chosen to implement IME architecture as shown in Fig. 4.8.

4.2.1 IME Architecture

Fig. 4.7 shows top-level architecture of the proposed transcoder IME for HDTV720p application. There exist four reference pixel memories, each of which is a 47

× 16 bytes single-port SRAM. Actually the search window is 47 × 47 bytes. The applied memories size is for an easy hardware implementation. The memory update and output is controlled by memory input and output control unit, which is controlled by MV smoothness decision unit. The IME block in Fig. 4.7 is implemented with Partial SAD architecture.

The SAD tree architecture is suitable for highly parallelized application and can share reference buffer between parallel PEG. But it has long critical path delay, which is 14.1ns based on our implementation. This delay cannot meet the performance requirement according to Section 4.2.2. To reduce delay, 16 12-bit registers can be inserted between SAD4×4 and larger block’s SAD addition to form a 2-stage pipeline. When applying snake-scan processing order, one PEG (256 PE) of SAD Tree architecture needs 16×12+16×17×8 = 2368 bit register.

(20)

The Partial SAD architecture has smaller gate count and suitable for medium and small resolution videos. Another advantage is that it has shorter critical path delay compared with SAD Tree because partial SAD is stored and propagated by propagation and delay register. If one PEG (256 PE) of Partial SAD is used, it needs 1872 bit register. Therefore 496 bit register can be saved compared to SAD Tree architecture. In this paper, the Partial SAD architecture is chosen to implement IME architecture as shown in Fig. 4.8.

具体的内容：

In transcoding, motion estimation is usually not fully performed in the encoder end of transcoder because of its computational complexity. Instead, motion vectors extracted from the incoming bit stream are reused. However, this simple motion-vector reuse scheme may introduce considerable quality degradation in many applications [59, 60]. Although an optimized motion vector can be obtained

…(中略)…

The new search window SR can be set much smaller than the full-scale window S (e.g., a search range of ±2 pixels instead of 15 pixels or larger) and still produce almost the same video quality as the full-scale motion estimation.

Motion estimation (ME) of encoder end in transcoding is not completely applied because of its huge computation complexity. Alternative approach is directly reuse motion vector (MV) from inputted bit stream. Drect MV reuse approach suffers great video quality degradation in many application scenarios [59, 60].

Improved approach is MV reused as search center with a comparable smaller search range in encoder. A complete ME is not desirable in encoder because of its huge computation complexity although it gives the best compression efficiency.

Video coding technologies achieve high video compression efficiency by utilizing two categories of redundancy: spatial redundancy and temporal redundancy [61].

The frame pixel is transformed into frequency domain by transforms such as DCT.

Only significant coefficients are preserved by utilizing the property of energy compaction of DCT. By this way, spatial redundancy is reduced within a frame.

Temporal redundancy is defined as similarity between adjacent frames which is reduced by motion compensated prediction of current blocks. The best matched reference block is subtracted from current block. The residual signal is fed into DCT to reduce spatial redundancy. The best matched reference block is done by searching a range of position in reference frame under certain criteria. The mostly applied one is sum of absolute differences (SAD) which means the minimum SAD of searching position is the best matched position. The displacement of best matched position in reference frame to current block position is defined as motion vector (MV). The search processing in reference frame is defined as ME.

where (m, n) is a candidate searching position in search window, C(i, j) and R(m+i, n+j) are current pixel and its corresponding reference pixel, (Ix, Iy) is the final MV after ME, SAD(m, n) is SAD at search position (m, n).

(21)

Fig. 5.2 shows the structure of a transcoder which is composed of an encoder and a decoder connected together. In this structure, ME module in encoder is reduced to keep the whole structure simple.

A complete ME can be applied to get the best matched MV for the output bit stream from transcoder which will introduce huge computation. A direct reusing incoming MV will introduce considerable drift error because of lossy video reconstruction procedure. The drift error can be expressed as

In Eq. 5.1.3, drift error is composed of two parts: f (i, j) and s(m + i, n + j). f (i, j) is the reconstruction error introduced by quantization step of front encoder which lies in front of transcoder. s(m+i, n+j) is the reconstruction error introduced by encoder end of transcoder with its specific quantization step. In most cases, the drift error f (i, j) − s(m+i, n+j) is not zero and cannot be neglected which introduced considerable loss of performance in direct reusing incoming MV.

The fullscale ME to obtain the optimal MV is not applied in encoder end of transcoder because of its huge computation requirement. The incoming MV from decoder end of transcoder is often reused because of its compression efficiency which is almost as good as a fullscale ME [62, 63, 64]. There two kinds of MV reusing schemes. One is directly reusing MV in encoder end of transcoder which is suffered from drift error coming from reconstruction inconsistency of lossy video compression. The compression efficiency can be greatly deviated from the optimal one together with the outcoming MV. The other reusing scheme is perfroming a fullscale ME within a smaller search range centered in incoming MV [59, 60]. By this way, drift error is corrected and refined with a comparable smaller computation complexity. The refining ME can be defined as following in which (Ex,Ey) is the outcoming MV from encoder end of transcoder, SADR represents the SAD of refining ME, (inx, iny) is the incoming MV from decoder end. The search window SR can be set to much smaller than normal fullscale ME because of the precision of incoming MV. By this way, a comparable compression efficiency can be achieved compared with fullscale ME in normal encoder.

(22)

2．訂正理由

序論あるいは各章の導入部において、他者の論文あるいは著作からの不適切な引用が認められたため、訂正を指示した。訂正の(1) ～(18)、(22)がこれに対応する。とくに、(12)、(13)は図が不適切に引用されていたもので、キャプションに文献参照を追加するよう指示した。

また、提案の説明部分でも一部不適切な引用が認められた。具体的には「第 4 章 4.2 2 節」において他者の論文の不適切な引用が認められた。当該部分は提案手法が基づく他者の手法を説明する部分で、提案手法そのものの新規性に影響を与えるものではないが、新規性をより明確にするため、4.2.2 節の記述の一部を 4.1.4 節に移動させて提案部が明確になるように訂正させると同時に、訂正後の 4.2.2 節と 4.2.1 節の順番を入れ替えるよう指示した。訂正の(19)～(21)がこれに対応する。

3．訂正を認めた理由

訂正(1)～(18)、(22) については、導入部で既存手法の紹介を行う部分であり、訂正により本博士論文の核となる研究成果が影響を受けることはなく、訂正は妥当であると認める。

また、訂正(19)～(21) については、提案手法で用いられている既存手法の説明部分であり、節の順序の変更を含む記述の訂正により提案手法の新規性がより明確になったと考えられるため、訂正は妥当であると認める。

訂 正 確 認 報 告 書