Video Compression using Neural Weight Step and Huffman Coding Techniques

نزولا ةوطخ مادختساب ويديفلا طغض يبصعلا نامفوه


INTRODUCTION
In last decade, exchanging videos across the World Wide Web (WWW) witnessed wider propagation. About (70-80) % of data traffic in mobile applications is currently consumed, due to Cisco Forecast, by video exchange [1]. Such consumption is increasingly caused by high quality (video resolution). To utilize bandwidth-limited Internet for higher-efficiency videos transmission with high quality, videos must be, necessarily, compressed with higher-performance compression techniques. Over the recent decades, researchers proposed plenty standards for video compression, like H.265 [2] and H.264 [3], and many others [4]. Yet, such compression codecs were traditionally handcrafted, which decreases the opportunity of optimizing them depending on end-to-end manners.
Recent image and video compression studies adopting Deep Learning (DL) expressed considerable potential of DL to improve compression performance regarding rate-distortion [5] [6]. Therefore, it was expectable that interest, in DL based video compressing techniques, is increasing [7][8] [9][10] [11]. For example, an optical flow was adopted in motion compensation, where auto encoders were adopted for the compression of flow and residual [10]. After that, a new video compression technique was proposed using 3D auto-encoder combined by autoregressive prior [11]. Such methods trained their proposed models using one loss function, which was applied to all of the frames. As a result, layers were not considered due to the hierarchical quality, which utilizes frames with high quality in the compression algorithm and then in post-processing other frames.
In this paper, three layers regarding a hierarchical quality are adopted with recurrent enhancement network to build the proposed Hierarchical Video Compression Scheme (HVCS). As Figure 1 illustrates, in the three layers 1, 2 and 3, frames are handled using hierarchical levels of quality high, medium and the low, respectively. Hierarchical-quality compression provides two types of benefits: Firstly, frames with high quality enhance the performance of compression process for other frames in coding stage, and at the same time they provide references with high quality. Secondly, frames with high quality provide advantageous information to enhance the decoding process of other frames with low quality, which can be utilized due to high correlations between adjacent frames.
Such improvements enhance the quality without exaggerating the bit-rate, which improves the performance of rate-distortion. As example in Figure 1, compression of frames F3 and F8 (from layer 3) is accomplished using low quality and bit-rate. After that, the proposed recurrent enhancement network applies significant improvements on their quality, utilizing other highquality frames like frames (f0 and f5). As a result, f3 and f8 provide considerable quality comparing with the frame (f5) from layer 2, but with less bit-rate consumption. This indicates considerable efficiency provided by our video compression scheme.
In this work, a compression method is used for layer 1 compression. Then for layer 2, a predictive video compression technique is proposed (Bi-Directional Compression (BDC)), where compressed frames in layer 1 are considered as references for bi-directional compression. Then, utilizing how adjacent frames are related, this work compresses layer 3 frames using our ‫م‬

1992-0652
compression network (Single Motion Compression (SMC)). Such network adopts single motion maps for detecting a motion within set of frames, which reduces bit-rates to encode motion maps. Finally, this work enhances the proposed algorithm by developing the weighted Recurrent Quality Enhancement (RQEN) network depending on [12], where the authors weighted recurrent cells by quality features for applying information of multiple frames for recurrent enhancement. Experimental results showed that proposed HVCS scheme yields a considerable performance as due to the published state-of-the-art in video compression researches, which provided higher performance than other techniques.

I. Related works
In recent decades, plenty researchers proposed different standards for image compression like Better Portable Graphics (BPG) [13], Joint Photographic Experts Group (JPEG) [2], and JPEG 2000 [14]. Then, Deep Neural Networks (DNNs) were also been significantly adopted to enhance the performance of compression techniques [5] [23]. Several end-to-end DNN schemes were proposed for deep image compression in [5] [6], where they applied the factorized-prior and hyper-prior models respectively for entropy estimation. Besides, other researchers proposed recurrent structures as image compression techniques [16][17] [22]. Then entropy models were proposed for more enhancements in the performance of rate-distortion such as hierarchical prior model and context-adaptive model as in [21] [23] respectively. Due to theirtime state-of-the-art, their compression schemes outperformed published compression schemes. Depending on published standards in image compression, different standards for video compression where proposed like H.265 [2], H.264 [3] MPEG [24] many others [4]. Recently, DNN networks attracted increasing attention to build efficient video compression algorithms. Different researchers tended to modify their approaches traditional video algorithms to DNN ones [25][26] [27][28] [29]. As example, Jiaying et al. [26] modified the motion compensation by utilizing a DNN to enhance the fractional interpolation, while DNNs were utilized in improving in-loop filters [29][8] [25]. Yet, such methods could have performance improvements in each particular module individually, where the ability of optimizing such compression frameworks jointly was unavailable.
Using end-to-end concepts, DNN was also adopted to propose several video compression techniques [1][7] [9][11] [8]. Specifically, Wu et al. [7] predicted the frames by depending on reference frames for interpolation, and they adopted the proposed compression scheme by George, et al [16] for residual compression. Guo et al. [10] proposed a DNN Compression method, which used optical flow for temporal-motion estimate, where they adopted 2 auto-encoders for motion and residual compression. Meanwhile, Zhengxue, et al [9] added energy compaction (spatial temporal) to the loss function to enhance compression-scheme performance. An auto-encoder was proposed for rate-distortion, which handle the video entropy using an autoregressive prior for coding [11]. Due to the best of our knowledge, there was only one published research which adopts hierarchical prediction [7]. Yet we didn't find a research that used learns for video compression regarding a hierarchical quality. As a result, they didn't provide references with high quality to ‫م‬

1992-0652
compress the other frames, which may not utilize the significant information provided by the post processing of multiple frames.
It's inevitable that any algorithm of lossy compression degrades the video quality and artifacts, therefore different researchers focused on improving compressed-video quality[12] [30][31][32] [33][34] [35]. Among these works, the authors of [33][34] [35]. Built their schemes on single-frame issue with one-frame input each time. Then, a multi-frame compression schemes with quality enhancement were proposed by Yang et al. [36][12], which utilized correlation between inter-frames. In addition, Guo, et al [35] proposed using Deep Kalman (DK) filter for compression-artifacts reduction.
Nevertheless, all aforementioned techniques were designed by adopting the post-processing concept as video coding technique. As a result, multi-frame compression techniques [36][12], it was not easy to obtain high accuracy for frame quality, but it was estimated under acceptable prediction error. In the proposed HVCS compression scheme and for each frame, the compression quality is encoded into the result stream of bits, which will be an input to the enhancement network combined by compressed frames. Such that, improvements are supported by the quality of highaccuracy frames and will be utilized by the deep decoder of the compression scheme.

 Subjects a. System Framework
The proposed HVCS framework is shown in Figure 2 applied on the first Picture Group (PG), where it's similarly applied on each other PG. HVCS scheme compresses the frames considering (3) hierarchical layers of quality (layer1, layer2 and layer3) in descending order of quality as in Figure 2.
Layer1. Frames in first layer (single-line gray-blocks) are encoded by image compression method (using BPG [13] and Jooyoung et al. [22]for PSNR and MS-SSIM respectively), and y_i^c represents a compressed frame of order (i). As for "I-frames" in traditional methods of coding [3][2], the highest bit-rates are consumed on compressed frames in layer 1 providing the highest quality for compressed frames. Such that, they provide the ability of stopping error propagation while encoding/ decoding the video. Furthermore, high-quality information provided by such frames supports the enhancement and compression of adjacent frames.
Layer2. In the second layer (2), the frames (double-line light brown-blocks) are found between 2 frames from layer1. A BDC network is proposed to compress frames of layer 2, which considers previous and next frames (compressed from layer1) as bi-directional references. Medium-quality technique is used to compress layers from layer2, and to provide supporting information for enhancing and compressing a low-quality frame from layer 3. Section 3.2 illustrates more detailed information about BDC. Layer3. Frames, not compressed in Layer1 & Layer2, in layer 3 (triple-line green-blocks), will be compressed using low-level quality contributing the lowest level of bit-rate. In recent techniques of DNN video compression, one-motion map (at least) was required to compensate motion for each frame [7] [22]. Due to the motion-correlations between successive frames, therefore redundancy is led to where one motion map is encoded for each related frame. Therefore, an SMC network is proposed in this work, in which a single motion map is applied for describing motions between successive frames, which leads to bit-rate reduction. It's noticeable that the frames (y6 till yn) have the same manner of compression as for (y1 till y4) frames, so they are excluded (hidden) in Figure 2. Section 3.3 introduces the SMC network. manner of compression as for (y1 till y4) frames, so they are excluded (hidden) in Figure 2. Section 3.3 introduces the SMC network.
Because of high correlations between successive video frames [33], a weighted Recurrent Quality Enhancement (RQEN) network were developed and adopted in this work. It weights recurrent cells with quality features for reasonable leverage of multi-frame information. In particular, frame-quality, in layer3, can have significant enhancements by utilizing information from high-quality frames in layer1 and layer2. In this case, additional information is not required to be kept to enhance the quality, which means less bit-rate, specifically when frames have low quality. RQEN is considered as a part of the proposed decoder, where quality information combined with compressed frames (encoded in the bit stream) represent the inputs of decoder side. Section 3.4 discusses more details about the RQEN.

b. Bi-directional Deep Compression (BDC)
The BDC network used to compress layer2 frames is illustrated in Figure 3, where the 1st PG was also used as an example. Firstly, the subnet Motion Estimation (ME) was firstly used in BDC to detect temporal motions between target and reference frames. Since there is long interval between layer1 and layer2 frames (5 frames as in Fig 3), Anurag and Michael [35] was followed for applying a hierarchical network. Such pyramid handles large motions, which also utilizes the wide receptive field. It's noted in this approach that backward warping is used, and therefore backward motions were estimated. In Figure 3, for example, adopted-subnet (ME) outputs are the motions from y5 to 0 (referred to as f5→0) and from y5 to 10 (referred to as f5→10), respectively.
Considering estimated motions, we utilize an auto-encoder for Motion Compression (MC), where similarities between video frames are considered. Because of the existing correlations between motions in different frames, this work proposes concatenating bi-directional motions (referred to as [·, ·, ...]) as inputs to Enj encoder. This leads to input transforming to the latent qj representation. After that, we quantize qj to φj, and then feed φj into the decoder Dj to reach the compressed motion, where Huffman Coding [40] is used to encode φj into bits. Similarly, compressed motions will be defined as £5→0 and £5→10, and we can formulate MC subnet as: In the next, compressed motions are used to wrap reference frames 0 , 10 to the target, and then for motion compensation, the subnet Motion Post-processing (PM) is used to merge warped frames. Where the operation of backward warping is defined as BKW, we can formulate motion compensation as:

1992-0652
Where γ5 represents frame compensation. Finally, the subnet of Residual Compression (RC) is used to compress the residual of raw frame (y5) and compensated frame (γ5). As in MC subnet, the encoder (Enr)/ decoder (Dcr) networks are in RC too. Where φr denotes quantized latent representation, we can write the RC subnet as: Where 5 refers to the compressed (y5) frame. In RC, standard Huffman coding is adopted to encode φr into stream of bits, which, combined by encoded φm, contributes the stream of bits in layer2. Compression quality, referred to as Q5, is computed and involved in the stream of bits, which is more explained in Section 3.4 with proposed deep decoder. This paper evaluates results quality using Signal-to-Noise Ratio (PSNR) and Multi-Scale Structural Similarity (MS-SSIM) adopted by Zhou et al [37].

c. Single Motion Compression (SMC)
SMC network is proposed to compress remaining frames as layer3, where references are used from nearest compressed frames in layers1 and layers2. Proposed SMC network uses single motion map to compress two frames, i. e. as in Figure 2, a single motion map is used to compress two frames y1 and y2 with 0 as reference, and 5 is used as reference to compress y3 and y4.
Proposed architecture for SMC network is explained in Figure 4 using y1 and y2 as examples. In this figure, a DNN is used to firstly compress y2 frame with architecture similar to BDC, which contains the four subnets (RC, MC, ME, and MP); then the compressed 2 frame is obtained. As aforementioned and regarding correlations between multiple adjacent frames, the motion among y1 and 0 2 is predicted using the motion among 0 2 . Accordingly, 0 2 are used as references frames to compress y1 without consuming bits for motion map, which improves the performance of rate distortion.
In the proposed SMC, using the inverse motion is proposed to predict the motion. Practically, we can define a motion map as:  (9), f(n, m) illustrates that, at each (n, m) pixel, the motion is to the new (n+∇n(n, m),m+∇m(n, m)) position, where the value of −f(n, m) is assigned to the new position of (f-1). To be simplified.
As illustrated this approach adopts backward warping, so that y2 is compressed using the motion from y2, to 0 which is (f2→0). Consequently, y1 is compressed using the motion from y1 to 0 from y1 to 0

1992-0652
(£1→0) and the reference frames 0 & 2 . Whereas raw frame y2 is unavailable at decoding stage, recovering f2→0 is not possible. Therefore, it's possible to use compressed motions £2→0 for predicting £1→0 and £1→2. Using £2→0 and f-1 operation £1→0 is predicted as: Similarly, £1→2 is obtained as: Then, similarly to (3 & 4), we warp reference frames with the predicted motions and feed them into the MP subnet in order to generate the frame γ1 (motion compensation). Finally, the compressed frame 1 is obtained by compressing the differences between (y1 − γ 1) by the RC subnet. Here, result bit streams include the co mpressed quality (Q1 & Q2) to utilize them in the proposed RQEN network.

d. Weighted Recurrent Enhancement (RQEN)
At last, at decoding stage, RQEN is used to enhance the quality, which is designed utilizing standard method (QG-ConvLSTM) [12], which use quality-gated cell and a temporal-spatial structure to exploit correlations between adjacent frames.
Unlike the followed technique [12], this work uses residual blocks in reconstruction networks and spatial feature extraction, and skip connections is employed to enhance the performance. Furthermore, as illustrated in [33][12], frame significance, for quality-enhancement of other frames, is significantly related to its relative quality regarding the others. In addition, the significant quality of each frame is not reachable at the decoding stage [33][12]. On the contrary, since we encode compression-quality in the bit stream, proposed compression scheme accesses Qi refreshing previous memory with the current information is accomplished by learning the weights (wi) to reasonably control Mi. In specific, the memory Mi in high-quality frames is supposed to be weighted by a small value to refresh previous information with low quality, where update ( ) weights are supposed to be big, to enhance other frames to support the memory by its information with high quality. On the contrary, big values and small are potential within low quality frames. In addition, where values represent sigmoid-function output ( less than 1 holds), and then previous-frames the information is decreasing in the memory where frame distance is increasing. This supports the, lower correlations for farer frames, fact, which provide less quality enhancement. Therefore, for quality-gated cell, the different-quality frames enhance the memory Mi with different levels of significance. This enables the proposed RQEN network to provide positive influence on multi-frame information in quality enhancement.

e. Training strategy
Within training stage, the density model, as in [5], is used to anticipate the bit-rate to encode φj and φr in (1) and (6), and use R(·) for defining the estimated bit-rate. Then, this work follows [10] [23] in formulating the loss (L): L = ĥ D + R……12 Where ĥ is the hyper-parameter for controlling D (distortion) and (bit-rate) R trade-off. It's clear in (12) that, for trained models, the compression quality depends on ĥ, i.e., less ĥ value provide lower quality and bit-rate. Therefore, for a hierarchical-quality compression in the proposed HVCS scheme, we applied different ĥ values in SMC and BDC networks for layers2 and layers3 compression, respectively. Specifically, given Equation (12) with the estimated R, the loss function is set in the proposed BDC network as: = ĥ + ( 5 , 5 ) + (φ ) + (φ ) … … 13 While for SMC the loss is: = ĥ + ( 1 , 1 ) + ( 2 , 2 ) + (φ ) + (φ 1 ) + (φ 2 ) … … 14 As in Equation (14), φr1 and φr2 represent the RC networks of x1 and x2, respectively. In Equations (13) and (14), Mean Square Error (MSE) is used as a distortion measure, whereas MSE(j, k) represents D(j, k), while training the proposed HVCS scheme for Peak signal-to-noise ratio (PSNR). For the optimization of MS-SSIM, the distortion is applied as: For more importance, ĥBD was set to be larger than ĥSM in (11& 12) Equations to learn proposed compression scheme on compressing layer2-frames with higher quality than layer3 ones, which provide better applying of hierarchical-quality compression. Proposed RQEN network is trained by the minimization of the loss function of the case when N represents RQEN step length. Due to the bi-directional structure of RQEN, bigger N valued results more delay in decoding response leading to more training time. As a result, N value was restricted to 11, to represent frames interval in layer1, in both phases of training and inference.

f. Experimental Settings
The training stage of proposed (BDC & SMC) networks was conducted using standard Vimeo-90k dataset [38], where 142 videos were collected from VQEG [38] and Xiph [39] datasets for the training of proposed RQEN. For testing stage, we adopted UVG [40] dataset combined by Classes B, C and D from standard JCT-VC dataset [41]. Note that training and testing datasets are separated datasets. Different levels of resolution were available such as 1920 × 1080 (UVG and JCT-VC Class B), which is high resolution. Fair resolutions were also adopted like 832×480 and 416 × 240 (JCT-VC Classes C and D) respectively. Following [1] for a comparison, all frames were used to test UVG videos against 100 frames to test JCT-VC videos. PSNR and MS-SSIM were used for quality evaluation. For model training, we set ĥSM values as (256, 512, 1024 & 2048 for PSNR and as (8, 16, 32 & 64) for MS-SSIM. For hierarchical quality, ĥBD was set as (4× ĥSM). For benchmarking HVCS was compared with recent ANNbased compression methods. Furthermore, standards H.264 [3] and H.265 [2] coding techniques were also included in the benchmarking.

a. Results
Tables 1& 2 illustrate representing yielded values for rate-distortion on both video datasets. As aforementioned, PSNR and MS-SSIM are used for quality evaluation, where bitrates are calculated using bits per pixel (bpp). Table 1 illustrates PSNR performance, where they show better PSNR performance for the proposed compression model than other methods such as Chao et al [7]or optimized methods [1]. In addition, they outperform applying H.265 on standard JCT-VC dataset. On the other side, proposed compression scheme yielded better bit-rate performance than applying H.265 on UVG. As in Table 2, the MS-SSIM evaluation provided better performance of proposed scheme than all other learned approaches, where it reached better performance than H.264 and H.265. Due to bit-rate performance on UVG, Lee et al. [11] has comparable performance, and Guo et al [10]   Furthermore, BjꝊntegaard Delta Bit-Rate (BDBR) [42] is also computed depending on H.265. A BDBR measure computes the average difference of bit-rate considering H.265 anchor, where better performance is indicated on lower values of BDBR [43]. In Table 3, BDBR performance is illustrated depending on PSNR and MS-SSIM, in which, bit-rate reduction considering the anchor is indicated by showed negative numbers. Such results outperform H.265 performance, where bold numbers represent best yielded results by learned methods. Table 3 provided a fair comparison on (MS-SSIM & PSNR) optimized techniques DVC [10] considering the anchor H.265. As shown in Table 3, PSNR results of the proposed scheme outperforms other models performance due to MS-SSIM values (average= −6.04). On Class C, proposed-model PSNR shows obvious better performance than MS-SSIM values for optimized method by Cheng et al. [9]. Furthermore, adopted MS-SSIM in this work provided better results than all other learned techniques on MS-SSIM, which provides less bit-rate than H.265 by about (36%) on average. Considerably, our no-RQEN-MS-SSIM yielded significant performance (BDBR = −28%), which still yields significantly better results than all other methods. ‫م‬   Obviously in Table 3, the BDBR is computed by PSNR. On average, yielded bit-rate results are less than H.265 by 4.46%, in addition to the superior results (7.83% bit-rate) over JCT-VC C-C. Over the studied video-dataset (20 videos) HVCS yielded better PSNR results than H.265 for 15 videos. Despite that RQEN provided considerable enhancements, Table 3 shows that no-RQEN--PSNR results are still better than latest PSNR techniques (DVC [10]). As a summary, the general performance of proposed HVCS scheme outperforms all techniques regarding PSNR, and its performance yielded better results than H.265.

b. Ablation studies
Ablation study is generally conducted to discuss the efficiency of inner components of any proposed system [44], and we conducted such study to evaluate the proposed HVCS scheme. In this study, the baseline (bl) model is defined to be the proposed scheme without the principal enhancements such as: i. Hierarchy in frames quality (training models using single ĥ value with each frame) referred to as (HQ) ii. applying Single Motion approach (using same motion map for all frames) referred to as (SM) iii. Proposed enhancement network referred to as (RQEN) .
Then, the performance is analyzed for baseline model with adding the enhancements successively like: i. bl+HQ ii. bl+HQ+SM iii. the total scheme (bl+HQ+SM+RQEN(" In addition, the enhanced non-hierarchical (bl+RQEN) is also discussed. Table 4 illustrates the results of ablation study.  b1+HQ: in Table 4 combining the base-line with HQ yields obviously better results than baseline alone, which indicates the gained enhancement by applying HQ on the compression performance. Furthermore, Table 4 illustrates the differences in bit-rate and PSNR values and their effects on low quality frames (layer3) and high quality frames (layer1 and layer2) between (bl) and (bl+HQ bit-rate both. Applying (b1 +HQ) on layer3 frames yields better PSNR results than (b1) even with small bit-rate. This is a logical result when layer1 and layer2 provide high quality references for the compression of layer 3 frames in (b1+HQ). Since layer3 contains most of video frames, adding HQ to b1 provides significant improvement on scheme performance.  bl+HQ+SM: using SMC strategy to enhance compression performance adds more improvements, as shown in Table 4, to the proposed scheme. As a comparison to (bl+HQ), using (bl+HQ+SM) strategy reduces used bits in the motion maps. As an example, in (bl+HQ), the bit-rate yielded (0.018 bpp as average) for motion information is at ĥ = 256, whereas the total bit-rate yielded 0.097 bpp. Using SMC reduces bits consumption, for motion, to 0.014 bpp, which is (about 23%) lower than (b1+HQ), where total bit-rate was decreased to 0.097 bpp. On the other hand, using (bl+HQ+SM) improves PSNR (29.26 dB to 29.47 dB), because, on residual coding, more bits are allocated. This provides a validation about proposed SMC to reduce the video-motion redundancy, and enhance compression performance.  bl+HQ+SM+RQEN: It's obvious in Table 4 that adding the proposed RQEN network provides significant enhancements to the quality performance of (bl+ HQ+SM) as in Table 4. The enhancements with highest significance were provided to low quality frames, i. e., PSNR enhancements were about (1 dB) for frame3 and frame9. proposed RQEN learns for bigger w i S and smaller w i M on high quality frames, which decreases previous memory to update the significant information, and vice versa for frames with low quality. Moreover, Table 4 visually shows that frame3 suffers from considerable distortions since bit-rate is low, while higher quality frame (6) ha highly correlations with frame 3. Thus, in RQEN, high w i S in frame 6 updates large proportion of the information to the memory. Therefore, there is an ability of recovering lost information of frame 3 providing significant enhancement to the quality. Advantages of Hierarchical Quality: Due to Table 4, enhancements in frame quality from bl to (bl+ RQEN) are considerable less than (bl+HQ+SM) to the proposed HVCS scheme. A major cause of such variation is the similarity in frame quality, which is handled in the hierarchical scheme. Hierarchy utilizes the high quality reference to enhance low quality of other frames. As a summary, all components of the proposed compression scheme yielded significant enhancements the compression framework achieves considerable results benchmarking with other state-of-theart video compression methods.