Conclusion - A Study on Interaction between Human and Digital Content

our method enables shrinking and stretching the video frames between those points. Because synchronizing audio and video is diﬃcult, it is eﬀective for a user to support audio-visual synchronization at the frame level.

Our video summarization method is eﬀective for content-based video re-trieval as well. Since the method shortens the video while preserving the content, content-based video retrieval can be more eﬃcient and fast without considering verbose video frames. In addition, data storage can be used more eﬃciently if the content can be reversible. We can discuss the reversibility of our summarization method by referring the theory of information compres-sion. Our method is related to the theory of information compression, and we will further explore the relation in the future research.

6.7 Conclusion

In this chapter, we present a novel method for summarizing and stretching video clips by focusing on transition and continuity. The proposal of frame-wise removal and insertion and the cost to preserve continuity are our main contributions. The subjective experimental results demonstrated eﬀectiveness in a reasonable manner.

Our method thins out video frames regardless of the meaning of silence or stability. Therefore, if a silence is meaningful, our method may have a negative eﬀect. The method summarizes video considering the content. However, if the silence or stable state is the content, our method eliminates the content. It is a limitation of our method that it is diﬃcult in the current implementation to consider the semantics of the content. Most but not all the semantics included in audio-visual transition can be considered using our method. A further limitation is that our method is not well suited to apply to a video that includes music. Because every audio frame in music has meaning, our method must not remove any frames. In such a case, resizing music using the method proposed by Wenneret al. [99] may be eﬀective, or perhaps, simply thinning out the frames equally would produce a better result.

There are limits to the number of frames that can be removed, while still preserving the content. We will explore the limit further and enable adaptive and automatic setting of parameters such asalphain Eq. (6.2) by quantitatively defining an amount of information for a video clip. The idea of our method is similar to that of video compression. We are planning to apply those theories used in video compression technique to achieve eﬀective frame thinning. In addition, we will evaluate the eﬀectiveness of our method further.

The time we can assign to watching the video content is limited. We can watch the video content eﬃciently by scheduling what and when to watch.

However, we would like to enjoy freely without letting such an annoying schedule restrain our daily lives. Instead of adjusting our schedules to the video content, the video content should be adjusted to our schedules. Our method has the possibility of enabling a user to watch a video regardless of how much time he or she has available for video watching, and we will further explore that possibility.

Chapter 7 Automatic DJ System Considering Beat and Latent Topic Similarity

This chapter presents MusicMixer, computer-aided DJ system that auto-matically mixes songs in a seamless manner. MusicMixer mixes songs based on audio similarity calculated via beat analysis and latent topic analysis of the chromatic signal in the audio. The topic represents latent semantics about how chromatic sounds are generated. Given a list of songs, a DJ selects a song with beat and sounds similar to a specific point of the currently playing song to seamlessly transition between songs. By calculating the similarity of all existing pairs of songs, the proposed system can retrieve the best mixing point from innumerable possibilities. Although it is comparatively easy to calculate beat similarity from audio signals, it has been diﬃcult to consider the semantics of songs as a human DJ considers. To consider such semantics, we propose a method to represent audio signals to construct topic models that acquire latent semantics of audio. The results of a subjective experi-ment demonstrate the eﬀectiveness of the proposed latent semantic analysis method.

7.1 Introduction

Many people enjoy listening to music. The digitalization of music content has made it possible for many people to carry their favorite songs on a digital music player. Opportunities to play such songs are frequent, e.g., at a house party or while driving a car. At some parties, an exclusive DJ performs for

7.1. INTRODUCTION the attendants. DJs never stop playing the music until the party ends. They control the atmosphere of the event by seamlessly mixing songs1. However, it is not always realistic to personally hire a DJ. Thus, we presentMusicMixer, an automatic DJ system that can mix songs for a user.

One of the most important things in a DJ’s performance is to mix songs as naturally as possible. Given a list of songs, a DJ selects a song with beats and sounds that are similar to a specific point in the currently playing song such that the song transition is seamless. Consequently, the songs will be mixed as a consecutive song. The beats are particularly important and should be carefully considered. Maintaining stable beats during song transition is the key to realizing a seamless mix. The time to select the next song is limited and the songs are numerous; therefore, many DJs intuitively select a song to connect. However, this might not be the best song. The innumerable possibilities of mixing songs make performing diﬃcult for the DJ.

Computers are good at searching for the best pairs of beats from innumer-able possibilities. It is possible to solve this problem using a signal processing technique to extract beats and rely on a computer to retrieve a similar beat for eﬀective mixing. However, computers handle audio signals numerically without considering the underlying song semantics; thus, the resulting mix will be mechanical if the system only considers beat similarity. The latent se-mantics of songs must be considered in addition to the beats. The DJ attempts to switch to a new song when the two songs sound similar. To consider the latent semantics, we propose a method to analyze the latent topics of a song from the polyphonic audio signal. These topics represent latent semantics about how chromatic sounds are generated. In addition to beat similarity, the proposed system considers the similarity of latent topics of songs. In particular, by employing a machine learning method called latent Dirichlet allocation (LDA) [9], the proposed system infers latent topics that generate chromatic audio signals. This process corresponds to consideration of how sound is generated from latent topics in a given song. By inferring similarity among song topics, higher level song information can be considered.

MusicMixer takes advantage of computational machine power to retrieve a good mix of songs. To make mixing more seamless as the DJ mix, the system focuses on the similarity of latent topics in addition to beat similarity and realizes natural song mixing (Figure 7.1).

1The word “mix” means gradually changing a song to another song.

・・・

Input songs

Automatically mixed songs