Future Compression Technologies

  

Over the last couple of years there has been a great increase in the use of video in digital form due to the popularity of the Internet. We can see video segments in Web pages, we have DVD`s to store video and HDTV will use a video format for broadcast. To understand the video formats, we need to understand the characteristics of the video and how they are used in defining the format.

Video is a sequence of images which are displayed in order. Each of these images is called a frame. We cannot notice small changes in the frames like a slight difference of colour so video compression standards do not encode all the details in the video, some of the details are lost. This is called lossy compression. It is possible to get very high compression ratios when lossy compression is used. Typically 24 to 30 frames are displayed on the screen every second. There will be lots of information repeated in the consecutive frames. If a tree is displayed for one second then 30 frames contain that tree. This information can be used in the compression and the frames can be defined based on previous frames. So consecutive frames can have information like "move this part of the tree to this place". Frames can be compressed using only the information in that frame (intraframe) or using information in other frames as well (interframe). Intraframe coding allows random access operations like fast forward and provides fault tolerance. If a part of a frame is lost, the next intraframe and the frames after that can be displayed because they only depend on the intraframe.

Every colour can be represented as a combination of red, green and blue. Images can also be represented using this colour space. However this colour space called RGB is not suitable for compression since it does not consider the perception of humans. YUV colour space where only Y gives the greyscale image. Human eye is more sensitive to changes is Y and this is used in compression. YUV is also used by the NTSC, PAL, SECAM composite colour TV standards.

Compression ratio is the ratio of the size of the original video to the size of the compressed video. To get better compression ratios, pixels are predicted based on other pixels. In spatial prediction, a pixel can be obtained from pixels of the same image, in temporal prediction, the prediction of a pixel is obtained from a previously transmitted image. Hybrid coding consist if a prediction in the temporal dimension with a suitable decorrelation technique in the spatial domain. Motion compensation establishes a correspondence between elements of nearby images in the video sequence. The main application of motion compensation is providing a useful prediction for a given image from a reference image.

DCT (Discrete Cosine Transform) is used in almost all of the standardised video coding algorithms. DCT is typically done on each 8x8 block. 1-D DCT requires 64 multiplications and for an 8x8 block 8 1-D DCTs are needed. 2-D DCT requires 54 multiplications and 468 additions and shifts. 2-D DCT is used in MPEG, there is also hardware available to do DCT. When DCT is performed, the top left corner has the highest coefficients and bottom right has the lowest, this makes compression easier. The coefficients are numbered in a zig zag order from the top left to bottom right so that there will be many small coefficients at the end. The DCT coefficients are then divided by the integer quantisation value to reduce precision. After this division it is possible to loose the lower coefficients if they are much smaller than the quantisation. The coefficients are multiplied by the quantisation value before IDCT (inverse DCT).

 

MPEG-2


MPEG-2 is designed for diverse applications which require a bit rate of up to 100Mbps. Digital high-definition TV (HDTV), DVD, interactive storage media (ISM), cable TV (CATV) are sample applications. Multiple video formats can be used in MPEG-2 coding to support these diverse applications. MPEG-2 has bitstream scalability: it is possible to extract a lower bitstream to get lower resolution or frame rate. Decoding MPEG-2 is a costly process, bitstream scalability allows flexibility in the required processing power for decoding. MPEG-2 is upward, downward, forward and backward compatible. Upward compatibility means the decoder can decode the pictures generated by a lower resolution encoder. Downward compatibility implies that a decoder can decode the pictures generated by a higher resolution encoder. In a forward compatible system, a new generation decoder can decode the pictures generated by an existing encoder and in a backward compatible system, existing decoders can decode the pictures generated by new encoders.

In MPEG-2 the input data is interlaced since it is more oriented towards television applications. Video sequence layers are similar to MPEG-1 the only improvements are field/frame motion compensation and DCT processing, scalability. Macroblocks in MPEG-2 has 2 additional chrominance blocks when 4:2:2 input format is used. 8x8 block size is retained in MPEG-2, in scaled format blocks can be 1x1, 2x2, 4x4 for resolution enhancement. P and B frames have frame and field motion vectors.

MPEG-4


Success of digital television, interactive graphics applications and interactive multimedia encouraged MPEG group to design MPEG-4 which allows the user to interact with the objects in the scene within the limits set by the author. It also brings multimedia to low bitrate networks.

MPEG-4 uses media objects to represent aural, visual or audiovisual content. Media objects can be synthetic like in interactive graphics applications or natural like in digital television. These media objects can be combined to form compound media objects. MPEG-4 multiplexes and synchronises the media objects before transmission to provide quality of service and it allows interaction with the constructed scene at receiver`s machine.

MPEG-4 organises the media objects in a hierarchical fashion where the lowest level has primitive media objects like still images, video objects, audio objects. MPEG-4 has a number of primitive media objects which can be used to represent 2 or 3-dimensional media objects. MPEG-4 also defines a coded representation of objects for text, graphics, synthetic sound, talking synthetic heads.

MPEG-4 provides a standardised way to describe a scene. Media objects can be places anywhere in the coordinate system. Transformations can be used to change the geometrical or acoustical appearance of a media object. Primitive media objects can be grouped to form compound media objects. Streamed data can be applied to media objects to modify their attributes and the user`s viewing and listening points can be changed to anywhere in the scene.

The visual part of the MPEG-4 standard describes methods for compression of images and video, compression of textures for texture mapping of 2-D and 3-D meshes, compression of implicit 2-D meshes, compression of time-varying geometry streams that animate meshes. It also provides algorithms for random access to all types of visual objects as well as algorithms for spatial, temporal and quality scalability, content-based scalability of textures, images and video. Algorithms for error robustness and resilience in error prone environments are also part of the standard.

For synthetic objects MPEG-4 has parametric descriptions of human face and body, parametric descriptions for animation streams of the face and body. MPEG-4 also describes static and dynamic mesh coding with texture mapping, texture coding with view dependent applications.

MPEG-4 supports coding of video objects with spatial and temporal scalability. Scalability allows decoding a part of a stream and construct images with reduced decoder complexity (reduced quality), reduced spatial resolution, reduced temporal resolution., or with equal temporal and spatial resolution but reduced quality. Scalability is desired when video is sent over heterogeneous networks, or the receiver can not display the video at full resolution.

Robustness in error prone environments is an important issue for mobile communications. MPEG-4 has 3 groups of tools for this. Resynchronisation tools enables the resynchronisation of the bitstream and the decoder when an error has been detected. After synchronisation data recovery tools are used to recover the lost data. These tools are techniques that encode the data in an error resilient way. Error concealment tools are used to conceal the lost data. Efficient resynchronisation is key to good data recovery and error concealment.

Fractal-Based coding


Fractal coding is a new and promising technique. In an image values of pixels that are close are correlated. Transform coding takes advantage of this observation. Fractal compression takes advantage of the observation that some image features like straight edges and constant regions are invariant when rescaled. Representing straight edges and constant regions efficiently using fractal coding is important because transform coders cannot take advantage of these types of spatial structures. Fractal coding tries to reconstruct the image by representing the regions as geometrically transformed versions of other regions in the same image.

See Fractal Compression

 

Model-based Video Coding


Model based schemes define three dimensional space structural models of the scene. Coder and decoder use an object model. The same model is used by coder to analyse the image, and by decoder to generate the image. Traditionally research in model based video coding focuses on head modeling, head tracking, local motion tracking, and expression analysis, synthesis. Model based video coding have bean mainly used for video conferencing and video telephony since mostly the human head is modeled. Model based video coding has concentrated in modeling of images like the head and shoulders because it is impossible to model every object that may be in the scene. There is lots of interest in applications such as speech driven image animation of talking heads and virtual space teleconferencing.

In model-based approaches a parametrised model is used for each object in the scene. Coding and transmission is done using the parameters associated with the objects. Tools from image analysis and computer vision is used to analyse the images and find the parameters. This analysis provides information on several parameters like size, location, and motion of the objects in the scene. Results have shown that it is possible to get good visual quality at rates as low as 16kbps.

Scalable Video Coding

 

Multimedia communication systems may have nodes with limited computation power to be used for decoding and heterogeneous networks such as combination of wired and wireless networks. In these cases we need the ability to decode at a variety of bit rates. Scalable coders have this property. Layered multicast has been proposed as a way to provide scalability in video communication systems.

MPEG-2 has basic mechanisms to achieve scalability but it is limited. Spatiotemporal resolution pyramids is a promising approach to provide scaleable video coding. Open loop and closed loop pyramid coders both provide efficient video coding and inclusion of multiscale motion compensation. Simple filters can be used for spatial downsampling and interpolation operations and fast and efficient codecs can be implemented. Morphological filters can also be used to improve image quality.

Pyramid coders have multistage quantisation scheme. Bit allocation to the various quantisers depending on the image is important to get efficient compression. Optimal bit allocation is optimally computationally infeasible when pyramids with more than two layers are used. Closed loop pyramid coders are better suited for practical applications then open loop pyramid coders since they are less sensitive to suboptimal bit allocations and simple heuristics can be used.

There are several ways to utilise multistage motion compensation. Efficiently computing motion vectors and then encoding them by hierarchical group estimation is one way. When video is sent over heterogeneous networks scalability is utilised by offering a way to reduce the bit rate of video data in case of congestion. By using priorities the network layer can reduce bitrate without knowing the content of the packet or informing the sender.

 

Wavelet-based Coding


Wavelet transform techniques have been investigated for low bit rate coding. Wavelet based coding has better performance than traditional DCT based coding. Much lower bit rate and reasonable performance are reported based on the application of these techniques to still images. A combination of wavelet transform and vector quantisation gives better performance. Wavelet transform decomposes the image into a multi frequency channel representation, each component of which has its own frequency characteristics and spatial orientation features that can be efficiently used for coding. Wavelet based coding has two main advantages: it is highly scaleable and a fully embedded bitstream can be easily generated. The main advantage over standard techniques such as MPEG is that video construction is achieved in a fully embedded fashion. Encoding and decoding process can stop at a predetermined bit rate. The encoded stream can be scaled to produce the desired spatial resolution and frame rate as well as the required bit rate. Vector quantisation makes use of the correlation and the redundancy between nearby pixels or between frequency bands. Wavelet transform with vector quantisation exploits the residual correlation among different layers if the wavelet transform domain using block rearrangement to improve the coding efficiency. Further improvements can also be made by developing the adaptive threshold techniques for classification based on the contrast sensitivity characteristics of the human visual system. Joint coding of the wavelet transform with trellis coded quantisation as a joint source channel coding is an area to be considered.

Additional video coding research applying the wavelet tranform on a very low bit rate commmunication channel is performed. The efficiency of motion compensated prediction can be improved by overlapped motion compensation in which the candidate regions from the previous frame are windowed to obtain a pixel value in the predicted frame. Since the wavelet transform generates multiple frequency bands, multifrequency motion estimation is available for the transformed frame. It also provides a representation of the global motion structure. Also, the motion vectors in lower frequency bands are predicted with the more specific details of higher frequency bands. This hierarchical motion estimation can also be implemented with the segmentation technique that utilises edge boundaries from the zero crossing points in the wavelet transform domain. Each frequency band can be classified as temporal activity macroblocks or no temporal activity macroblocks. The lowest band may be coded using DCT, and the other bands may be coded using vector quantisation or trellis coded quantisation.

Hardware | Software | The Internet