• 検索結果がありません。

Learning-based Adaptive Video Streaming

N/A
N/A
Protected

Academic year: 2023

シェア "Learning-based Adaptive Video Streaming"

Copied!
122
0
0

読み込み中.... (全文を見る)

全文

This allows us to apply learning-based schemes to address the challenges in live and 360-degree video streaming scenarios. We then provide a brief introduction to our proposed learning-based solutions to address the challenges of live and 360-degree video streaming scenarios.

Background for Adaptive Video Streaming

Live Video Streaming

Therefore, tile-based adaptive view streaming accepts elastic control of tile quality levels for 360 videos, thereby increasing QoE for users. Clearly, how to handle VPP and TBS issues plays a vital role in the success of adaptive tile-based rendering for 360 videos.

Figure 1.2: The mapping of a spherical surface into a plane.
Figure 1.2: The mapping of a spherical surface into a plane.

QoE Metrics for Video Streaming

Subjective QoE Metric

Opinion score is defined as “value on a predefined scale that a subject assigns his opinion about the performance of a system” by The International Telecommunication Union (ITU) [13]. In addition, MOS cannot provide the precise criteria to guide the design of a better ABR scheme.

Objective QoE Metric

The most direct way to assess QoE during a video streaming session is to ask a limited number of human subjects to watch videos in a controlled environment and provide opinion scores based on their experience. We introduce some common QoE metrics for two typical video streaming scenarios (live and 360 video streaming) in the following.

Challenges

Universal Challenges for Video Streaming

For example, on cellular connections, where sudden fluctuations can often occur, it is impossible to accurately predict future bandwidth, leading to underutilized networks (lower video quality) or high transfer delays (re-buffering). For example, in bandwidth-constrained networks, keeping the rebuffering rate low results in consistently requesting segments encoded at a lower bit rate, which will sacrifice quality; on the other hand, maximizing quality by transmitting the highest possible bitrate will inevitably result in more rebuffering events.

New Challenges for Live Video Streaming

Unknown information of future videos. Unlike video-on-demand streaming services, where videos are pre-recorded and video information (eg, sizes of all video segments) can be fetched from the server in advance, videos are first created in Streaming Scenarios live video. It is challenging to estimate future video segment sizes for any bitrate level, and inaccurate video segment size predictions can lead to poor video quality or high latency.

New Challenges for 360-degree Video Streaming

Learning as a Solution for Video Streaming

Existing learning [20] or heuristic-based work [21,22] on VoD video streaming scenarios cannot take hybrid actions and lack the component for future video information prediction. The key contribution of this thesis is a series of learning-based systems to optimize users' QoE in both live and 360-degree video streaming scenarios (a quick overview can be found in Figure 1.3).

Organization

ABR algorithms for VoD services

Conversely, buffer-based methods [22,38] make bit rate decisions for future segments based only on the client playback buffer level. Quality-based schemes [20,21] attempt to optimize users' QoE by making bitrate decisions based on estimated future bandwidth and current buffer level.

ABR algorithms for Live streaming

Pensieve [20] uses DRL to train a neural ABR agent to optimize QoE using multiple useful input signals, including past measured bandwidths and buffer level. These algorithms first determine a total bitrate budget for the next segment based on available information (e.g., bandwidth estimate, current buffer level), and then assign bitrates for each tile in the next segment to optimize QoE.

Table 2.1: Summary of Related Work on Live Video Streaming Work Algorithm Latency Control Action Adaptivity
Table 2.1: Summary of Related Work on Live Video Streaming Work Algorithm Latency Control Action Adaptivity

Viewport Prediction

Jiang et al.[51] apply a long-term memory (LSTM)-based model to predict future head rotations. Zhanget al.[36] use an ensemble of three LSTM models to further improve prediction accuracy.

Table 2.3: Summary of Related Work on Viewport Prediction Work Algorithm Available on
Table 2.3: Summary of Related Work on Viewport Prediction Work Algorithm Available on

System Design

Reinforcement Learning for Live Streaming

However, the size of the incoming video may not be directly observed by the CDN buffer, then we need to estimate. After downloading each GOP segment of bit rate 𝑟 of size Action: In this task, we need to take hybrid actions including discrete and continuous actions.

Figure 3.1: The Workflows of RL.
Figure 3.1: The Workflows of RL.

HD2: Dueling DQN with Hybrid Action Space

First, the FC layer𝐶𝑄(𝑠;𝜔) generates the value𝑐𝑎 as the continuous action, then the other two FC layers𝐴(𝑠, 𝑐𝑎, 𝑎;𝜃) (denoting the benefit function of state action pairs, 𝑉) and indicating the state value function) take both the state𝑠 and continuous action value𝑐𝑎as input. 𝐴(𝑠, 𝑐𝑎, 𝑎0;𝜃)) (3.3) Since𝜔 is a subset of𝜃, while updating𝜃, the value of𝜔 will also be updated to generate a better continuous action. The design of HD2 is inspired by [64], but HD2 constitutes a single network architecture (instead of two as in [64]), which leads to much less computational load and can perform better in live streaming tasks.

HD3: Distributed HD2

Follow the same flow as in HD2, get transition(𝑠𝑡, 𝑐𝑎𝑡, 𝑑𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1) and store it in local buffer𝐿𝐵;.

Evaluation

  • QoE Definition
  • Dataset Analysis
  • Comparison Schemes
  • Implementation
  • Results Analysis

To better understand how different granularity levels of latency limit value discretization will affect the final performance, we conducted extensive experiments by uniformly discretizing the range of the latency limit value. Figure 3.7, Figure 3.8, Figure 3.9 and Figure 3.10 show the results of HD3 and other schemes at different granularity levels of discretization of the latency limit value in high, medium, low and oscillatory bandwidth scenarios. The reason for this phenomenon is that, when the granularity level of the latency limit value is large, the total number of possible candidate action combinations will also become large. This will make it harder to train a model to converge to an optimal point or even a good point.

Figure 3.3: CDF of Bandwidth in Four Network Conditions.
Figure 3.3: CDF of Bandwidth in Four Network Conditions.

Summary

In this chapter, we investigate tile-based adaptive field-of-view streaming for 360-degree videos and propose the Plato system to exploit machine learning to solve the inherent problems of VPP and TBS. Since user orientation prediction errors can be detrimental to QoE, we propose a no-field-of-view concept to opportunistically cover actual areas slightly outside the predicted field-of-view areas. We propose a system – Plato, in which VPP and TBS agents are trained to predict the viewport areas and select tile bit rates.

System Design

VPP – The Prediction of Viewport

In this section, we apply the LSTM-based model to predict the orientation of the user's head, and in the meantime discuss how we handle potential prediction errors. We address potential head orientation prediction errors by practically extending the viewport range. Value of

Figure 4.2: LSTM based predictor.
Figure 4.2: LSTM based predictor.

TBS – The Bitrate Selection of Tiles

With such reward information, the TBS agent can then train and improve the policy neural network. In this section, our goal is to show how we train a TBS agent using RL. Policy: Based on the state𝑠𝑡, the TBS broker performs action𝑎𝑡 (i.e. bitrate selections for the three tile regions) in the next video.

Figure 4.4: The workflows of RL.
Figure 4.4: The workflows of RL.

Evaluation

  • Parameter Settings
  • Datasets
  • Comparison Metrics
  • QoE Metrics
  • Reward function design
  • Result Analysis

The logs in the datasets record the device quaternion1 (𝑞0, 𝑞1, 𝑞2, 𝑞3) for the HMD device in each timestamp. According to their algorithm, the lowest quality is assigned to all tiles in the video first. This initial allocation guarantees that all tiles of the video are streamed to the user.

Figure 4.6: CDF of bandwidth
Figure 4.6: CDF of bandwidth

Summary

The ever-increasing demand for immersive multimedia experience has stimulated content providers (eg YouTube and Vimeo) to roll out 360-degree video streaming. Streaming 360-degree videos requires unprecedentedly high bit rates compared to video streaming of fixed viewing directions. Therefore, the ability to adaptively stream 360-degree videos according to the dynamics of network bandwidth is essential for its wide spread.

Figure 5.1: An illustration of tile-based streaming for 360-degree videos in VoD and live scenarios.
Figure 5.1: An illustration of tile-based streaming for 360-degree videos in VoD and live scenarios.

Sinusoids versus Prediction Accuracy

Prediction Error Analysis

Conversion of Degrees to Sinusoid

Why sinusoid is better?

Table 5.1 Shows the smoothness and linearity of normalized degrees, cosine, and sine on three head movement data sets.

Table 5.1: Smoothness and linearity of degree, cosine and sine on yaw direction of the head motion datasets (see Sec
Table 5.1: Smoothness and linearity of degree, cosine and sine on yaw direction of the head motion datasets (see Sec

System Design

Orientation Prediction

Although this method can predict with relatively high accuracy, prediction errors may also occur, especially when the prediction time window is too long or when the user's head movement is too fast. Therefore, in the following, we will present how we handle possible prediction errors in both yawandpitch directions.

Error Handling

Since we are trying to assign higher quality to the view and adjacent areas and lower quality to the outer area, we set the view probability for tiles in each area per frame as follows:𝑝1 for VP tiles,𝑝2 for AD tiles, and >𝑝2 > 𝑝3. Next, we will introduce how to obtain the normalized probability of seeing tiles in a video segment containing 𝑀 frames.

Tile Probability Normalization

Evaluation

Simulation Settings

In the experiments of the CLS comparison scheme, in each video, 90% of the users are randomly selected for training, while the remaining 10% of the users are used for testing. CLS[30]: group users with density-based spatial clustering of noisy applications (DBSCAN) in the server, then on the client side, classify the user into the corresponding cluster with a support vector machine (SVM) classifier, and finally obtain the viewing probability of the cluster. LC[54]: First cluster users based on their quaternion rotations, then classify the target user into the corresponding cluster and estimate the future fixation as the cluster center.

Viewport Prediction Accuracy

Note that the prediction errors in the yaw direction are more pronounced (than in the pitch direction), which in practice requires more accurate viewport prediction designs. In larger time windows (i.e., 4~5 sec), the tile prediction accuracy of SVP decreases due to the accumulated orientation prediction errors. This is because inaccurate viewport predictions of the previous frames inevitably lead to larger prediction errors of the next frame.

Video Quality Assessment

Since LR-G, LSTM, and MLP yield much larger prediction errors than LR, we simply compare SVP with CLS, CUB, LC, and TJ in the following. Obviously, SVP gives a higher effective bit rate than CLS, TJ, LC and CUB for all bandwidth settings as it achieves higher tile prediction accuracy. Obviously, SVP performs better than CLS, CUB, LC, TJ because it achieves higher tile prediction accuracy.

Figure 5.7: The prediction errors on yaw direction in 1-, 3-, 5-sec time window.
Figure 5.7: The prediction errors on yaw direction in 1-, 3-, 5-sec time window.

Summary

Our primary contribution is to develop learning-based systems to address challenges posed by live and 360-degree video streaming services. To address the challenges revealed in 360-degree video streaming scenarios, we develop two learning-based systems for tile bitrate selection and viewport prediction, respectively. Tackling the problem of large action spaces in 360-degree video streaming, we divide the tiles into three classes and then apply a DRL algorithm to teach a neural agent to choose bit rates for each tile class.

Discussion

Limitations

To more accurately predict the future field of view, we propose a sinusoidal field of view prediction (SVP) system to overcome the periodicity problems in the deflection direction. In the current work, we heuristically adjusted the values ​​of the reward function coefficients for 360-degree video streaming. Although DRL-trained neural agents can achieve better performance than heuristic-based schemes, deep neural models typically consume much more computing resources and power, leading to higher costs.

Lessons Learned

Future Perspective

New networking challenges solvable by learning

More specifically, in volumetric video streaming, both future rotation and position values ​​must be predicted; and rate control must consider more information such as distances for each point cloud [99]. In addition to these challenges, an agreed QoE feature design is lacking for volumetric video streaming researchers who still need to find appropriate QoE metrics to represent user-perceived quality while watching volumetric videos. To save bandwidth and reduce transmission delay, applications must resize (change video resolution) and sample video frames [107].

Integrating heuristics with reinforcement learning

Adaptive Video Streaming for Analysis Unlike our previous work, which streamed videos for human viewing, videos can also be streamed for machine analysis. InProceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, pages. Qarc: Video quality-aware rate control for real-time video streaming based on deep reinforcement learning.

An overview of HTTP adaptive video streaming

The mapping of a spherical surface into a plane

The main contribution of this dissertation is to develop three learning-

The Workflows of RL

HD2 Architecture

CDF of Bandwidth in Four Network Conditions

GOP sizes of different bitrates from the same video

Ratios of GOP sizes between adjacent bitrates from the same video

Comparing HD3 with several state of the art DRL algorithms on four

Analyzing performance of different DRL algorithms on different granular-

Analyzing performance of different DRL algorithms on different granular-

Analyzing performance of different DRL algorithms on different granu-

The architecture of Plato

LSTM based predictor

Division of tile areas

The workflows of RL

An illustration of A3C algorithm

CDF of bandwidth

Average QoE

Comparing Plato with existing algorithms by analyzing their performance

An illustration of tile-based streaming for 360-degree videos in VoD and

The prediction errors with respect to various time window lengths on the

Head movement trace on yaw direction. The angle is represented by

An illustration of predicted values of models trained with both the degrees

Figure 1.1: An overview of HTTP adaptive video streaming.
Figure 1.2: The mapping of a spherical surface into a plane.
Figure 1.3: The main contribution of this dissertation is to develop three learning-based solutions (bottom) to address the key challenges (middle) exposed by live and 360-degree video streaming scenarios (top).
Table 2.2: Summary of Related Work on 360-degree Video Streaming Work Algorithm Adaptivity
+7

参照

関連したドキュメント