This allows us to apply learning-based schemes to address the challenges in live and 360-degree video streaming scenarios. We then provide a brief introduction to our proposed learning-based solutions to address the challenges of live and 360-degree video streaming scenarios.
Background for Adaptive Video Streaming
Live Video Streaming
Therefore, tile-based adaptive view streaming accepts elastic control of tile quality levels for 360 videos, thereby increasing QoE for users. Clearly, how to handle VPP and TBS issues plays a vital role in the success of adaptive tile-based rendering for 360 videos.

QoE Metrics for Video Streaming
Subjective QoE Metric
Opinion score is defined as “value on a predefined scale that a subject assigns his opinion about the performance of a system” by The International Telecommunication Union (ITU) [13]. In addition, MOS cannot provide the precise criteria to guide the design of a better ABR scheme.
Objective QoE Metric
The most direct way to assess QoE during a video streaming session is to ask a limited number of human subjects to watch videos in a controlled environment and provide opinion scores based on their experience. We introduce some common QoE metrics for two typical video streaming scenarios (live and 360 video streaming) in the following.
Challenges
Universal Challenges for Video Streaming
For example, on cellular connections, where sudden fluctuations can often occur, it is impossible to accurately predict future bandwidth, leading to underutilized networks (lower video quality) or high transfer delays (re-buffering). For example, in bandwidth-constrained networks, keeping the rebuffering rate low results in consistently requesting segments encoded at a lower bit rate, which will sacrifice quality; on the other hand, maximizing quality by transmitting the highest possible bitrate will inevitably result in more rebuffering events.
New Challenges for Live Video Streaming
Unknown information of future videos. Unlike video-on-demand streaming services, where videos are pre-recorded and video information (eg, sizes of all video segments) can be fetched from the server in advance, videos are first created in Streaming Scenarios live video. It is challenging to estimate future video segment sizes for any bitrate level, and inaccurate video segment size predictions can lead to poor video quality or high latency.
New Challenges for 360-degree Video Streaming
Learning as a Solution for Video Streaming
Existing learning [20] or heuristic-based work [21,22] on VoD video streaming scenarios cannot take hybrid actions and lack the component for future video information prediction. The key contribution of this thesis is a series of learning-based systems to optimize users' QoE in both live and 360-degree video streaming scenarios (a quick overview can be found in Figure 1.3).
Organization
ABR algorithms for VoD services
Conversely, buffer-based methods [22,38] make bit rate decisions for future segments based only on the client playback buffer level. Quality-based schemes [20,21] attempt to optimize users' QoE by making bitrate decisions based on estimated future bandwidth and current buffer level.
ABR algorithms for Live streaming
Pensieve [20] uses DRL to train a neural ABR agent to optimize QoE using multiple useful input signals, including past measured bandwidths and buffer level. These algorithms first determine a total bitrate budget for the next segment based on available information (e.g., bandwidth estimate, current buffer level), and then assign bitrates for each tile in the next segment to optimize QoE.

Viewport Prediction
Jiang et al.[51] apply a long-term memory (LSTM)-based model to predict future head rotations. Zhanget al.[36] use an ensemble of three LSTM models to further improve prediction accuracy.

System Design
Reinforcement Learning for Live Streaming
However, the size of the incoming video may not be directly observed by the CDN buffer, then we need to estimate. After downloading each GOP segment of bit rate 𝑟 of size Action: In this task, we need to take hybrid actions including discrete and continuous actions.

HD2: Dueling DQN with Hybrid Action Space
First, the FC layer𝐶𝑄(𝑠;𝜔) generates the value𝑐𝑎 as the continuous action, then the other two FC layers𝐴(𝑠, 𝑐𝑎, 𝑎;𝜃) (denoting the benefit function of state action pairs, 𝑉) and indicating the state value function) take both the state𝑠 and continuous action value𝑐𝑎as input. 𝐴(𝑠, 𝑐𝑎, 𝑎0;𝜃)) (3.3) Since𝜔 is a subset of𝜃, while updating𝜃, the value of𝜔 will also be updated to generate a better continuous action. The design of HD2 is inspired by [64], but HD2 constitutes a single network architecture (instead of two as in [64]), which leads to much less computational load and can perform better in live streaming tasks.
HD3: Distributed HD2
Follow the same flow as in HD2, get transition(𝑠𝑡, 𝑐𝑎𝑡, 𝑑𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1) and store it in local buffer𝐿𝐵;.
Evaluation
- QoE Definition
- Dataset Analysis
- Comparison Schemes
- Implementation
- Results Analysis
To better understand how different granularity levels of latency limit value discretization will affect the final performance, we conducted extensive experiments by uniformly discretizing the range of the latency limit value. Figure 3.7, Figure 3.8, Figure 3.9 and Figure 3.10 show the results of HD3 and other schemes at different granularity levels of discretization of the latency limit value in high, medium, low and oscillatory bandwidth scenarios. The reason for this phenomenon is that, when the granularity level of the latency limit value is large, the total number of possible candidate action combinations will also become large. This will make it harder to train a model to converge to an optimal point or even a good point.

Summary
In this chapter, we investigate tile-based adaptive field-of-view streaming for 360-degree videos and propose the Plato system to exploit machine learning to solve the inherent problems of VPP and TBS. Since user orientation prediction errors can be detrimental to QoE, we propose a no-field-of-view concept to opportunistically cover actual areas slightly outside the predicted field-of-view areas. We propose a system – Plato, in which VPP and TBS agents are trained to predict the viewport areas and select tile bit rates.
System Design
VPP – The Prediction of Viewport
In this section, we apply the LSTM-based model to predict the orientation of the user's head, and in the meantime discuss how we handle potential prediction errors. We address potential head orientation prediction errors by practically extending the viewport range. Value of

TBS – The Bitrate Selection of Tiles
With such reward information, the TBS agent can then train and improve the policy neural network. In this section, our goal is to show how we train a TBS agent using RL. Policy: Based on the state𝑠𝑡, the TBS broker performs action𝑎𝑡 (i.e. bitrate selections for the three tile regions) in the next video.

Evaluation
- Parameter Settings
- Datasets
- Comparison Metrics
- QoE Metrics
- Reward function design
- Result Analysis
The logs in the datasets record the device quaternion1 (𝑞0, 𝑞1, 𝑞2, 𝑞3) for the HMD device in each timestamp. According to their algorithm, the lowest quality is assigned to all tiles in the video first. This initial allocation guarantees that all tiles of the video are streamed to the user.

Summary
The ever-increasing demand for immersive multimedia experience has stimulated content providers (eg YouTube and Vimeo) to roll out 360-degree video streaming. Streaming 360-degree videos requires unprecedentedly high bit rates compared to video streaming of fixed viewing directions. Therefore, the ability to adaptively stream 360-degree videos according to the dynamics of network bandwidth is essential for its wide spread.

Sinusoids versus Prediction Accuracy
Prediction Error Analysis
Conversion of Degrees to Sinusoid
Why sinusoid is better?
Table 5.1 Shows the smoothness and linearity of normalized degrees, cosine, and sine on three head movement data sets.

System Design
Orientation Prediction
Although this method can predict with relatively high accuracy, prediction errors may also occur, especially when the prediction time window is too long or when the user's head movement is too fast. Therefore, in the following, we will present how we handle possible prediction errors in both yawandpitch directions.
Error Handling
Since we are trying to assign higher quality to the view and adjacent areas and lower quality to the outer area, we set the view probability for tiles in each area per frame as follows:𝑝1 for VP tiles,𝑝2 for AD tiles, and >𝑝2 > 𝑝3. Next, we will introduce how to obtain the normalized probability of seeing tiles in a video segment containing 𝑀 frames.
Tile Probability Normalization
Evaluation
Simulation Settings
In the experiments of the CLS comparison scheme, in each video, 90% of the users are randomly selected for training, while the remaining 10% of the users are used for testing. CLS[30]: group users with density-based spatial clustering of noisy applications (DBSCAN) in the server, then on the client side, classify the user into the corresponding cluster with a support vector machine (SVM) classifier, and finally obtain the viewing probability of the cluster. LC[54]: First cluster users based on their quaternion rotations, then classify the target user into the corresponding cluster and estimate the future fixation as the cluster center.
Viewport Prediction Accuracy
Note that the prediction errors in the yaw direction are more pronounced (than in the pitch direction), which in practice requires more accurate viewport prediction designs. In larger time windows (i.e., 4~5 sec), the tile prediction accuracy of SVP decreases due to the accumulated orientation prediction errors. This is because inaccurate viewport predictions of the previous frames inevitably lead to larger prediction errors of the next frame.
Video Quality Assessment
Since LR-G, LSTM, and MLP yield much larger prediction errors than LR, we simply compare SVP with CLS, CUB, LC, and TJ in the following. Obviously, SVP gives a higher effective bit rate than CLS, TJ, LC and CUB for all bandwidth settings as it achieves higher tile prediction accuracy. Obviously, SVP performs better than CLS, CUB, LC, TJ because it achieves higher tile prediction accuracy.

Summary
Our primary contribution is to develop learning-based systems to address challenges posed by live and 360-degree video streaming services. To address the challenges revealed in 360-degree video streaming scenarios, we develop two learning-based systems for tile bitrate selection and viewport prediction, respectively. Tackling the problem of large action spaces in 360-degree video streaming, we divide the tiles into three classes and then apply a DRL algorithm to teach a neural agent to choose bit rates for each tile class.
Discussion
Limitations
To more accurately predict the future field of view, we propose a sinusoidal field of view prediction (SVP) system to overcome the periodicity problems in the deflection direction. In the current work, we heuristically adjusted the values of the reward function coefficients for 360-degree video streaming. Although DRL-trained neural agents can achieve better performance than heuristic-based schemes, deep neural models typically consume much more computing resources and power, leading to higher costs.
Lessons Learned
Future Perspective
New networking challenges solvable by learning
More specifically, in volumetric video streaming, both future rotation and position values must be predicted; and rate control must consider more information such as distances for each point cloud [99]. In addition to these challenges, an agreed QoE feature design is lacking for volumetric video streaming researchers who still need to find appropriate QoE metrics to represent user-perceived quality while watching volumetric videos. To save bandwidth and reduce transmission delay, applications must resize (change video resolution) and sample video frames [107].
Integrating heuristics with reinforcement learning
Adaptive Video Streaming for Analysis Unlike our previous work, which streamed videos for human viewing, videos can also be streamed for machine analysis. InProceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, pages. Qarc: Video quality-aware rate control for real-time video streaming based on deep reinforcement learning.
An overview of HTTP adaptive video streaming
The mapping of a spherical surface into a plane
The main contribution of this dissertation is to develop three learning-
The Workflows of RL
HD2 Architecture
CDF of Bandwidth in Four Network Conditions
GOP sizes of different bitrates from the same video
Ratios of GOP sizes between adjacent bitrates from the same video
Comparing HD3 with several state of the art DRL algorithms on four
Analyzing performance of different DRL algorithms on different granular-
Analyzing performance of different DRL algorithms on different granular-
Analyzing performance of different DRL algorithms on different granu-
The architecture of Plato
LSTM based predictor
Division of tile areas
The workflows of RL
An illustration of A3C algorithm
CDF of bandwidth
Average QoE
Comparing Plato with existing algorithms by analyzing their performance
An illustration of tile-based streaming for 360-degree videos in VoD and
The prediction errors with respect to various time window lengths on the
Head movement trace on yaw direction. The angle is represented by
An illustration of predicted values of models trained with both the degrees