Frame-Wise Video Summarization - A Study on Interaction between Human and Digital Content

6.3. FRAME-WISE VIDEO SUMMARIZATION

frame

SSD (standardized)

Transition of SSD

Person walking Stable

Figure 6.1: Transition of SSD value in a video clip.

frame. For example, if the object to be captured by a camera is moving fast and can be captured for only a few frames, one frame plays an important role. Removing such a frame destroys object motion continuity. In contrast, if an object is moving very slowly, the moving distance during one frame is suﬃciently short such that the loss of continuity from removing it is lesser than that from removing a frame containing a fast-moving object.

Consequently, we focus on transitions that can handle object motion com-prehensively. In particular, we use the sum of squared diﬀerences (SSD) corresponding to the diﬀerence between adjacent video frames. The SSD values(t)at frametcan be calculated as

s(t) =

height∑

y=1 width∑

x=1

(f_t(x, y)−f_t₋₁(x, y))², (6.1) wheref_t(x, y)is the pixel value at the coordinates(x, y)in framet. Figure 6.1 shows the SSD transition in a video of a person walking past a stable camera twice. In this graph, the SSD value increases sharply at an important event in the video (viz., a person walking by).

In this type of video, frames with stable objects will be preferentially removed when the lower value frames are thinned out. Thus, thinning out frames with low SSD values shortens a video clip, while preserving its content.

To some extent, this method is eﬀective. However, only thinning out the frames with low SSD values sometimes creates new discontinuities. Therefore, new SSD values to be inserted after thinning must also be considered. To take new SSD values into account, we set the thinning costC_videoas the summation of SSD values of the frame and the new SSD value of anteroposterior video frames. Here the cost C_video is normalized to enable combining it with an audio thinning cost later. Our method thins out video frames according to the cost, preserving the continuity of the entire video clip as much as possible. To achieve this aim, SSD values of removed frames must be recalculated after the first round of thinning. At this point, the new SSD value of thinned out frames is not normalized, creating a diﬀerence between normalized costs. To avoid this diﬀerence, we retain the value for the normalizing process and apply the same parameter to rescale the new SSD value. We prohibit thinning out the first or last frame of a video clip to ensure that the thinning cost can always be calculated.

Avidanet al. [5] proposed an image resizing (retargeting) method called seam carving. This method searches for a path that is not important and thins out pixels in a row- or column-wise manner. Rubinstein et al. [77]

expanded this method and proposed an improved seam carving that considers insertion cost in addition to thinning cost. The improved seam carving can be applied to video retargeting. Our method can be considered as a form of seam carving applied to entire video frames for video summarization instead of video retargeting.

6.3.2 Frame-wise Thinning out based on an Audio Transition

The sampling rate of audio is generally 44,100 or 48,000 samples per second, which is very diﬀerent from that of video frames. We designed an audio frame for analysis to adjust the time step between audio and video. When the video frame rate is r, we set the audio frame length to 2/r seconds and the step length for the audio frame to1/r seconds. Hence, the audio and video time steps can be synchronized.

As an audio feature expressing audio continuity, we use spectral flux, which represents local temporal transition of the audio spectrum. It takes a high feature value at the point when an audio transition occurs (e.g., sound onset or oﬀset). We extract spectral flux from an audio part of a video clip using MIRtoolbox 1.5, an audio analysis tool developed by Lartillotet al.[55].

Figure 6.2 shows the transition of spectral flux for an audio sample that includes speech and hand claps. The audio sample is recorded in an indoor environment where particular sounds are not observed other than the speech and claps. In the graph in Figure 6.2, the spectral flux value reflects the auditory events. Therefore, we can detect a section in which an audio event

6.3. FRAME-WISE VIDEO SUMMARIZATION Transition of spectral flux

frame

)dezidradnats( xulf lartcepS

no sound speech hand clap

Figure 6.2: Transition of spectral flux value.

occurs by focusing on the spectral flux value.

Thinning out audio frames with a low spectral flux value shortens the audio without losing the audio event content. Here we use audio frame thinning costC_audioand thin out frames exactly as in Section 6.3.1, considering the new insertion cost in addition to the spectral flux value of a frame. An audio can be shortened by calculating the spectral flux value per thinning round.

6.3.3 Frame-wise Thinning out based on an Audio-Visual Tran-sition

We combine video and audio frame thinning. Video and audio frame thin-ning consider visual and audio continuity, respectively. To consider audio-visual continuity, we design the audio-audio-visual thinning costC(t)for removing frametas

C(t) = αC_video(t) + (1−α)C_audio(t). (6.2) Here the parameter α is a weight for audio-visual balance. Whenα = 0.5, audio and visual continuity is considered equally. BothCvideo andCaudio are normalized in advance to have mean zero and variance one in order to achieve uniformity. Frame-wise video summarization can be achieved by removing a video frame with minimum cost using Eq. (6.2). Figure 6.3 shows the transition of audio-visual thinning cost calculated for the video with a person walking by used in Figure 6.1 in Section 6.3.1. In this video, the visual event is a human walking across the video twice, and the audio event is the sound of footsteps. Before and after the walking, there are sections with only the

Transition of thinning out cost

frame

Cost ()

No specific event Only audio event Visual event

Figure 6.3: Transition of audio-visual thinning out cost.

footsteps can be heard. In such a section, the cost remains high, indicating that the cost reflects not only visual but also audio events. The parameter α is an important parameter to add weight to audio and video, which is highly dependent to the content. For example, auditory cost is not as important in action scenes as it is in conversation scenes. At the current implementation, this parameter is set by a user. Automatic optimization of α is our future work.

ドキュメント内 A Study on Interaction between Human and Digital Content (ページ 84-88)