Digital video facilitates long-distance visual communication over Internet and has enabled a number of new-media services such as HD broadcast, on-demand content streaming, teleconferencing, as well as cloud gaming, and so on. H.264/AVC is the current state-of-art and the mostly widely used video compression standard. It was published back in 2003 and has enjoyed huge commercial success for more than 10 years.
In contrast to analog video, digital video exploits perceptual, spatial, temporal and statistical redundancy to achieve bit rate reduction at the cost of latency and computation. So, the trade-off among
- low bit rate
- high video quality
- low latency
- low computation cost
is indeed one of the most fundamental problems for a video engineer to understand, study, resolve and optimize, when he or she is asked to designs an encoder solution for a given application.
Let’s use live TV broadcast vs. on-demand content streaming (e.g., YouTube) as an example, to explain how different applications give rise to different encoder design objectives.
- Latency: Live TV has an typical overall delay of 3-10 seconds, whose upper limit mainly demanded by live sport broadcast (e.g., TV can’t be too late than radio); YouTube viewers don’t care about delay as long as the video content is played out smoothly (e.g., buffering is a measure of sacrificing delay for smoothness).
- Computation: Live TV program needs to be encoded in real time on dedicated hardware encoder, and low computation cost (which translates to operation cost) is a big design requirement for broadcast engineers; on-demand content is encoded off-line (but is streamed in real time) by software encoder running on a farm, which means computation cost is usually not an concern.
- Low bit rate vs. high video quality: Live TV is typically broadcasted at a bit rate of the lowest acceptable video quality (in order to squeeze as many as channels into a given bandwidth); YouTube/NetFlix streams at the highest possible bit rate, the biggest reliable bandwidth between the streaming server and the client, to achieve the highest possible video quality. However, this doesn’t mean TV broadcast has lower video quality than on-demand streaming. On the contrary, live TV is often encoded by the most sophisticated ASIC-based professional encoders, which make the best video quality at any certain bit rate.
Cloud gaming again is very different from both live TV and YouTube. For cloud gaming, ultra low latency (<100ms) is the no.1 must-have that can't be compromised in any way – if the game is not responsive, it is just not playable. Also, the game content is encoded in real time and one encoder is dedicated to one client, therefore the computation/operation cost is another critical requirement to make a cloud-gaming business profitable. Although the video quality is a relative lower priority issue, it is becoming one of the major differentiators in the competition nowadays.

Introduction of HEVC/H.265
The High Efficiency Video Coding (HEVC) standard is the latest video compression standard, a successor to the H.264/MPEG-4 AVC (Advanced Video Coding). Similar to MPEG-2 and H.264, HEVC is also developed by a joint team from the two major international standardization organizations of ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG). Therefore, HEVC is often called by another name, H.265.
The standardization has been a long process: a dozen of meetings were held since the first meeting of MPEG’s & VCEG’s Joint Collaborative Team on Video Coding (JCT-VC) in April 2010. Finally the standard was approved and published by both ISO/IEC and ITU-T in 2013[1][2]. In January 2014, the MPEG LA announced a HEVC patent portfolio license that is currently supported by 25 patent holders. However, quite a number of prominent H.264 licensors are still missing from the HEVC list, including Panasonic, Sony, Dolby Laboratories, Mitsubishi, Toshiba, Sharp, and Samsung. This means codec vendors (or cloud-gaming service providers who plan to develop their own HEVC implementations) might have to enter complicated multiple HEVC licensing agreements.
The HEVC is designed to double the coding efficiency compared to H.264 and significantly increase the usage of parallel processing architecture. According to the JCT-VC’s published evaluation result[5], HEVC can achieve an average bit rate reduction of around 35% at equal objective video quality (measured by Peak Signal-to-Noise Ratio, PSNR), or even 50% at equal subjective video quality (measured by ITU-R Rec. BT. 500: Methodology for the Subjective Assessment of the Quality of Television Pictures). This can be a very attractive feature for broadcasters, on-demand content provider, cloud gaming companies, etc.: image only half of a bandwidth is required to achieve similar video quality, or video quality becomes much better at the same bandwidth. However, there is no such thing as a free lunch, on which we will discuss at the end of this article.
HEVC’s New High-level Syntax Feature for Low-Latency Applications
H.264 allows a picture to be divided into multiple regions consecutively in the raster scan order. Such a region is called slice and slice is self-decodeable (doesn’t reference macroblocks in other slices of the picture). Slice was designed for the robustness purpose – part of a picture is still reconstructable even if some slices of the picture are lost in transmission. However, encoding a frame in multiple slices is accompanied with bit cost overhead: each slice has its own slice header and restarts the CABAC context.
Slice Segment (a.k.a. “Dependent Slice”) is a great feature of HEVC that is particularly designed for low-latency applications. One dependent slice inherits the slice header and the CABAC context from its previous dependent slice, which make it possible to packetize/stream part of a picture without incurring bit cost penalty.
Why to be able to stream a partial picture is useful to reduce latency? Let’s assume the network channel is inelastic, then the overall latency from the first pixel enters the encoder to the last pixel received by the decoder is
encode time + wait-to-transmit time + network latency = 2 frame time + network latency
where the wait-to-transmit time is the time for the one-frame-worth bit stream fully gets into the network channel.
Now if we can transmit part of a picture, while the rest of a picture is still being encoded (the second case of the diagram below). Now the overall latency becomes
encode time (of a dependent slice) + wait-to-transmit time + network latency = 1.x frame time + network latency
where 0.x can be as low as 1/number of dependent slices in a picture (but often higher in practice). That is because the encoding time of a slice is proportionally to the number of macroblocks (so is fixed), but the wait-to-transmit time is variable depending on the number of encoded bits (usually the complexities of different regions of a picture are different). Nevertheless, the 0.x is still a serious reduction from 1.
If at 60fps, even saving 0.5 frame time (8.33ms) is very significant, based on the fact that the overall latency budget is only 100ms.

One thing worth noting here is the HEVC’s parallel processing consideration: tiles (CTB rectangles) and Wavefront Parall Processing (WPP) units (CTB rows). Tiles and WPP can be utilized in implementing codec on multiple-core hardware platforms, so as to shorten the encoding/decoding time. This might be important to realize real-time encoding/decoding of for example ultra HD, but is more or less transparent for cloud gaming companies who don’t implement their own HEVC codec.
HEVC’s New Techniques for Higher Compression Efficiency
The biggest selling point of HEVC is its doubling the compression ratio compared to H.264, which is very appealing for HD (1280×720 or 1920×1080) or Ultra HD (3840×2160) content. A large number of new techniques contribute to HEVC’s superior coding efficiency, and the most useful technical features of HEVC, according to my understanding, are
1. Coding Tree Block (CTB)
CTB is just like MPEG-2’s or H.264’s Macroblock (16×16), but can be 16×16, 32×32 or 64×64. A CTB contains a quadtree of smaller Coding Blocks (CB), as illustrated below. A CB can be independently divided into multiple square or rectangle Prediction Units (PU) – Intra or Inter, and a quadtree of Transform Units (TU). The maximum PU and TU sizes can be 64×64 and 32×32 respectively, which are significantly larger than H.264’s 16×16 and 8×8. Large block sizes are particularly beneficial for compressing HD or Ultra HD content.

2. 35 modes of Intra prediction
HEVC has total 35 modes, among which DC and Planar are similar to H.264, and the rest 33 directional modes are for supporting large PU. H.264 has only 8 directional Intra prediction modes.

3. Fractional sample interpolation
Both HEVC and H.264 use quarter-pixel motion vectors. However, the fractional sample interpolation for luma samples in HEVC uses separable application of an 8-tap filter for the half-sample positions and a 7-tap filter for the quarter-sample positions. This is in contrast to the process used in H.264 which applies a two-stage interpolation process by first generating the values of one or two neighboring samples at half-sample positions using 6-tap filtering, rounding the intermediate results, and then averaging two values at integer or half-sample positions.

4. Advanced Motion Vector Prediction (AMVP) and Merge mode
Different from H.264’s single Motion Vector Predictor (MVP), HEVC’s inter-PU has a list of MV candidates and uses the index encoded in the bitstream to select the final MVP. To construct the MV candidate list, either the AMVP or the Merge mode can be used. In AMVP, the MV candidates are derived from neighboring PUs, co-located PUs in reference pictures, as well as 0-MV. The Merge mode is very similar to H.264’s Skip or Direct (spatial and temporal), i.e., the MV candidates are inherited from neighboring PUs, co-located PUs in reference pictures, also 0-MV.
5. Sample Adaptive Offset (SAO)
After the Deblocking Filter applied on the 8×8 grid (H.264 also has DF but is on 4×4), an additional filter called Sample Adaptive Offset is applied on a per-CTB basis. For each CTB, the bitstream codes a filter type (band or edge) and four offset values. The purpose of SAO is to make the CTB more closely match the source image, based on this additional filtering whose parameters can be determined by histogram analysis at the encoder side.
Commercial Reality of Computation Cost and Bit-rate Reduction
Due to HEVC’s claimed 50% bit rate reduction, majority of broadcast technology companies have invested in HEVC for several years. In fact, many have stopped the development of H.264 and are completely focusing their R&D on HEVC.
Since 2012, I have seen more and more HEVC codec products and demos, at the National Association of Broadcasters (NAB) show – the world’s largest trade show of broadcast technologies. At the recent NAB 2014 in April, most of the HEVC encoders are still software-based and Engineering effort has been spent on building real-time Ultra-HD encoders (by utilizing GPU acceleration, multi-core parallel processing, smart searching algorithms, and so on).
However, the video quality of HEVC has not yet reached a level that the standard has promised. From my conversations with a number of vendors, the current typical HEVC bit reduction is around 25-40% for 1080p and 2160p, but more importantly at a computation cost of 5-10x as H.264’s. Such a huge computation cost requires dedicating a top-tier-CPU-equipped server to one client, which makes the server and operation cost way too high to be acceptable for end users. At such a monster computation cost of software encoder, even H.264 can deliver much better video quality as it does right now. Therefore, HEVC is not useful at all for cloud gaming services, at this current status.
Conclusion
HEVC of course has a number of interesting features and I believe it will eventually become a technology that every cloud gaming company must commit Engineering resource to experiment and prototype. Particularly when main-stream high-end games start to do Ultra HD, then HEVC is the only compression standard that can deliver bit stream at today’s home Internet speed (HEVC can encode 2160p30 at < 15mbs, but H.264 needs > 30mbs). On the other hand, HEVC still has a long way ahead to mature and at this stage is not too meaningful for the cloud gaming business. The current two major kills are the way-too-high computation cost and lack of low-power ASIC implementation. However, from our history experience for MPEG-2 and H.264, I feel practical HEVC encoding solutions will emerge and will finally dominate the cloud-gaming industry, probably in a few year time.