|
|
Future Compression Technologies |
|
Over the last couple of years there has been a great increase in the use of video in digital form due to the popularity of the Internet. We can see video segments in Web pages, we have DVD`s to store video and HDTV will use a video format for broadcast. To understand the video formats, we need to understand the characteristics of the video and how they are used in defining the format. Video is a sequence of images which are displayed in order. Each of these images is called a frame. We cannot notice small changes in the frames like a slight difference of colour so video compression standards do not encode all the details in the video, some of the details are lost. This is called lossy compression. It is possible to get very high compression ratios when lossy compression is used. Typically 24 to 30 frames are displayed on the screen every second. There will be lots of information repeated in the consecutive frames. If a tree is displayed for one second then 30 frames contain that tree. This information can be used in the compression and the frames can be defined based on previous frames. So consecutive frames can have information like "move this part of the tree to this place". Frames can be compressed using only the information in that frame (intraframe) or using information in other frames as well (interframe). Intraframe coding allows random access operations like fast forward and provides fault tolerance. If a part of a frame is lost, the next intraframe and the frames after that can be displayed because they only depend on the intraframe. Every colour can be represented as a combination of red, green and blue. Images can also be represented using this colour space. However this colour space called RGB is not suitable for compression since it does not consider the perception of humans. YUV colour space where only Y gives the greyscale image. Human eye is more sensitive to changes is Y and this is used in compression. YUV is also used by the NTSC, PAL, SECAM composite colour TV standards. Compression ratio is the ratio of the size of the original video to the size of the compressed video. To get better compression ratios, pixels are predicted based on other pixels. In spatial prediction, a pixel can be obtained from pixels of the same image, in temporal prediction, the prediction of a pixel is obtained from a previously transmitted image. Hybrid coding consist if a prediction in the temporal dimension with a suitable decorrelation technique in the spatial domain. Motion compensation establishes a correspondence between elements of nearby images in the video sequence. The main application of motion compensation is providing a useful prediction for a given image from a reference image. DCT (Discrete Cosine Transform) is used in almost all of the standardised video coding algorithms. DCT is typically done on each 8x8 block. 1-D DCT requires 64 multiplications and for an 8x8 block 8 1-D DCTs are needed. 2-D DCT requires 54 multiplications and 468 additions and shifts. 2-D DCT is used in MPEG, there is also hardware available to do DCT. When DCT is performed, the top left corner has the highest coefficients and bottom right has the lowest, this makes compression easier. The coefficients are numbered in a zig zag order from the top left to bottom right so that there will be many small coefficients at the end. The DCT coefficients are then divided by the integer quantisation value to reduce precision. After this division it is possible to loose the lower coefficients if they are much smaller than the quantisation. The coefficients are multiplied by the quantisation value before IDCT (inverse DCT). MPEG-2
In MPEG-2 the input data is interlaced since it is more oriented towards television applications. Video sequence layers are similar to MPEG-1 the only improvements are field/frame motion compensation and DCT processing, scalability. Macroblocks in MPEG-2 has 2 additional chrominance blocks when 4:2:2 input format is used. 8x8 block size is retained in MPEG-2, in scaled format blocks can be 1x1, 2x2, 4x4 for resolution enhancement. P and B frames have frame and field motion vectors. MPEG-4
MPEG-4 uses media objects to represent aural, visual or audiovisual content. Media objects can be synthetic like in interactive graphics applications or natural like in digital television. These media objects can be combined to form compound media objects. MPEG-4 multiplexes and synchronises the media objects before transmission to provide quality of service and it allows interaction with the constructed scene at receiver`s machine. MPEG-4 organises the media objects in a hierarchical fashion where the lowest level has primitive media objects like still images, video objects, audio objects. MPEG-4 has a number of primitive media objects which can be used to represent 2 or 3-dimensional media objects. MPEG-4 also defines a coded representation of objects for text, graphics, synthetic sound, talking synthetic heads. MPEG-4 provides a standardised way to describe a scene. Media objects can be places anywhere in the coordinate system. Transformations can be used to change the geometrical or acoustical appearance of a media object. Primitive media objects can be grouped to form compound media objects. Streamed data can be applied to media objects to modify their attributes and the user`s viewing and listening points can be changed to anywhere in the scene. The visual part of the MPEG-4 standard describes methods for compression of images and video, compression of textures for texture mapping of 2-D and 3-D meshes, compression of implicit 2-D meshes, compression of time-varying geometry streams that animate meshes. It also provides algorithms for random access to all types of visual objects as well as algorithms for spatial, temporal and quality scalability, content-based scalability of textures, images and video. Algorithms for error robustness and resilience in error prone environments are also part of the standard. For synthetic objects MPEG-4 has parametric descriptions of human face and body, parametric descriptions for animation streams of the face and body. MPEG-4 also describes static and dynamic mesh coding with texture mapping, texture coding with view dependent applications. MPEG-4 supports coding of video objects with spatial and temporal scalability. Scalability allows decoding a part of a stream and construct images with reduced decoder complexity (reduced quality), reduced spatial resolution, reduced temporal resolution., or with equal temporal and spatial resolution but reduced quality. Scalability is desired when video is sent over heterogeneous networks, or the receiver can not display the video at full resolution. Robustness in error prone environments is an important issue for mobile communications. MPEG-4 has 3 groups of tools for this. Resynchronisation tools enables the resynchronisation of the bitstream and the decoder when an error has been detected. After synchronisation data recovery tools are used to recover the lost data. These tools are techniques that encode the data in an error resilient way. Error concealment tools are used to conceal the lost data. Efficient resynchronisation is key to good data recovery and error concealment. Fractal-Based coding
Model-based Video Coding
In model-based approaches a parametrised model is used for each object in the scene. Coding and transmission is done using the parameters associated with the objects. Tools from image analysis and computer vision is used to analyse the images and find the parameters. This analysis provides information on several parameters like size, location, and motion of the objects in the scene. Results have shown that it is possible to get good visual quality at rates as low as 16kbps. Scalable Video Coding
Multimedia communication systems may have nodes with limited computation power to be used for decoding and heterogeneous networks such as combination of wired and wireless networks. In these cases we need the ability to decode at a variety of bit rates. Scalable coders have this property. Layered multicast has been proposed as a way to provide scalability in video communication systems. MPEG-2 has basic mechanisms to achieve scalability but it is limited. Spatiotemporal resolution pyramids is a promising approach to provide scaleable video coding. Open loop and closed loop pyramid coders both provide efficient video coding and inclusion of multiscale motion compensation. Simple filters can be used for spatial downsampling and interpolation operations and fast and efficient codecs can be implemented. Morphological filters can also be used to improve image quality. Pyramid coders have multistage quantisation scheme. Bit allocation to the various quantisers depending on the image is important to get efficient compression. Optimal bit allocation is optimally computationally infeasible when pyramids with more than two layers are used. Closed loop pyramid coders are better suited for practical applications then open loop pyramid coders since they are less sensitive to suboptimal bit allocations and simple heuristics can be used. There are several ways to utilise multistage motion compensation. Efficiently computing motion vectors and then encoding them by hierarchical group estimation is one way. When video is sent over heterogeneous networks scalability is utilised by offering a way to reduce the bit rate of video data in case of congestion. By using priorities the network layer can reduce bitrate without knowing the content of the packet or informing the sender. Wavelet-based Coding
Additional video coding research applying the wavelet tranform on a very low bit rate commmunication channel is performed. The efficiency of motion compensated prediction can be improved by overlapped motion compensation in which the candidate regions from the previous frame are windowed to obtain a pixel value in the predicted frame. Since the wavelet transform generates multiple frequency bands, multifrequency motion estimation is available for the transformed frame. It also provides a representation of the global motion structure. Also, the motion vectors in lower frequency bands are predicted with the more specific details of higher frequency bands. This hierarchical motion estimation can also be implemented with the segmentation technique that utilises edge boundaries from the zero crossing points in the wavelet transform domain. Each frequency band can be classified as temporal activity macroblocks or no temporal activity macroblocks. The lowest band may be coded using DCT, and the other bands may be coded using vector quantisation or trellis coded quantisation. |
|
|