Carriage of AV1 in MPEG-2 TS

AOM Working Group Draft,

This version:
https://AOMediaCodec.github.io/av1-mpeg2-ts
Issue Tracking:
GitHub
Editors:
Kieran Kunhya
Thibaud Biatek
Warning

This specification is still at draft stage and should not be referenced other than as a working draft.

Copyright 2021, AOM

Licensing information is available at http://aomedia.org/license/

The MATERIALS ARE PROVIDED “AS IS.” The Alliance for Open Media, its members, and its contributors expressly disclaim any warranties (express, implied, or otherwise), including implied warranties of merchantability, non-infringement, fitness for a particular purpose, or title, related to the materials. The entire risk as to implementing or otherwise using the materials is assumed by the implementer and user. IN NO EVENT WILL THE ALLIANCE FOR OPEN MEDIA, ITS MEMBERS, OR CONTRIBUTORS BE LIABLE TO ANY OTHER PARTY FOR LOST PROFITS OR ANY FORM OF INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY CHARACTER FROM ANY CAUSES OF ACTION OF ANY KIND WITH RESPECT TO THIS DELIVERABLE OR ITS GOVERNING AGREEMENT, WHETHER BASED ON BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE), OR OTHERWISE, AND WHETHER OR NOT THE OTHER MEMBER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


Abstract

This document specifies how to carry [AV1] video elementary streams in the MPEG-2 Transport Stream format.

1. Introduction

This document specifies how to carry [AV1] video elementary streams in the MPEG-2 Transport Stream format [MPEG-2-TS]. It does not specify the presentation of AV1 streams in the context of a program stream.

This document defines the carriage of AV1 in a single PID, assuming buffer model info from the first operating point. It may not be optimal for layered streams or streams with multiple operating points. Future versions may incorporate this capability.

In the present document "shall", "shall not", "should", "should not", "may", "need not", "will", "will not", "can" and "cannot" are to be interpreted as described in clause 3.2 of the ETSI Drafting Rules (Verbal forms for the expression of provisions).

1.2. Definition of mnemonics and syntax function

In the present document the mnemonics, the syntax functions, and the syntax descriptors are to be interpreted as described in [MPEG-2-TS]. The uimsbf and bslbf mnemonics are defined in Section 2.2.6 of [MPEG-2-TS]. The nextbits() function is interpreted as in [MPEG-2-TS].

2. Identifying AV1 streams in MPEG-2 TS

2.1. AV1 registration descriptor

The presence of a Registration Descriptor, as defined in [MPEG-2-TS], is mandatory with the format_identifier field set to 'AV01' (A-V-0-1). The Registration Descriptor shall be the first in the PMT loop and included before the AV1 video descriptor.

2.1.1. Syntax

Syntax No. Of bits Mnemonic
registration_descriptor() {
descriptor_tag 8 uimsbf
descriptor_length 8 uimsbf
format_identifier 32 uimsbf
}

2.1.2. Semantics

descriptor_tag - This value shall be set to 0x05.

descriptor_length - This value shall be set to 4.

format_identifier - This value shall be set to 'AV01' (A-V-0-1).

2.2. AV1 video descriptor

The AV1 video descriptor provides basic information for identifying coding parameters, such as profile and level parameters of an AV1 video stream. The same data structure as AV1CodecConfigurationRecord in ISOBMFF is used to aid conversion between the two formats, EXCEPT that two of the reserved bits are used for HDR/WCG identification. The syntax and semantics for this descriptor appears in the table below and in the subsequent text.

If an AV1 video descriptor is associated with an AV1 video stream, then this descriptor shall be conveyed in the descriptor loop for the respective elementary stream entry in the program map table.

2.2.1. Syntax

Syntax No. Of bits Mnemonic
AV1_video_descriptor() {
descriptor_tag 8 uimsbf
descriptor_length 8 uimsbf
marker 1 bslbf
version 7 uimsbf
seq_profile 3 uimsbf
seq_level_idx_0 5 uimsbf
seq_tier_0 1 bslbf
high_bitdepth 1 bslbf
twelve_bit 1 bslbf
monochrome 1 bslbf
chroma_subsampling_x 1 bslbf
chroma_subsampling_y 1 bslbf
chroma_sample_position 2 uimsbf
hdr_wcg_idc 2 uimsbf
reserved_zeros 1 bslbf
initial_presentation_delay_present 1 bslbf
if (initial_presentation_delay_present) {
initial_presentation_delay_minus_one 4 uimsbf
} else {
reserved_zeros 4 uimsbf
}
}

2.2.2. Semantics

descriptor_tag - This value shall be set to 0x80.

descriptor_length - This value shall be set to 4.

marker - This value shall be set to 1.

version - This field indicates the version of the AV1_video_descriptor. This value shall be set to 1.

seq_profile, seq_level_idx_0 and high_bitdepth - These fields shall be coded according to the semantics defined in [AV1]. If these fields are not coded in the Sequence Header OBU in the AV1 video stream, the inferred values are coded in the descriptor.

seq_tier_0, twelve_bit, monochrome, chroma_subsampling_x, chroma_subsampling_y, chroma_sample_position - These fields shall be coded according to the same semantics when they are present. If they are not present, they will be coded using the value inferred by the semantics.

hdr_wcg_idc - The value of this syntax element indicates the presence or absence of high dynamic range (HDR) and/or wide color gamut (WCG) video components in the associated PID according to the table below. HDR is defined to be video that has high dynamic range if the video stream EOTF is higher than the reference EOTF defined in [BT-1886]. WCG is defined to be video that is coded using colour primaries with a colour gamut not contained within [BT-709].

hdr_wcg_idc Description
0 SDR, i.e., video is based on the reference EOTF defined in [BT-1886] with a color gamut that is contained within [BT-709] with a [BT-709] container
1 WCG only, i.e., video color gamut in a [BT-2020] container that exceeds [BT-709]
2 Both HDR and WCG are to be indicated in the stream
3 No indication made regarding HDR/WCG or SDR characteristics of the stream

reserved_zeros - Will be set to zeroes.

initial_presentation_delay_present - Indicates initial_presentation_delay_minus_one field is present.

initial_presentation_delay_minus_one - Ignored for [MPEG-2-TS] use, included only to aid conversion to/from ISOBMFF.

3. Constraints on AV1 streams in MPEG-2 TS

3.1. General constraints

For AV1 video streams, the following constraints apply:

In addition, a start code insertion and emulation prevention process shall be performed on the AV1 Bitstream prior to its PES encapsulation. This process is described in section 3.2.

3.2. Start-code based format

Prior to carriage into PES, the AV1 open_bitstream_unit() is encapsulated into ts_open_bitstream_unit(). This is required to provide direct access to OBU through a start-code mechanism inserted prior to each OBU. The following syntax describes how to retrieve the open_bitstream_unit() from the ts_open_bitstream_unit() (tsOBU).

Syntax No. Of bits Mnemonic
ts_open_bitstream_unit(NumBytesInTsObu) {
obu_start_code /* equal to 0x01 */ 24 uimsbf
NumBytesInObu = 0
for( i = 2; i < NumBytesInTsObu; i++ ) {
if( i + 2 < NumBytesInTsObu && nextbits(24) == 0x000003 ) {
open_bitstream_unit[NumBytesInObu++] 8 uimsbf
open_bitstream_unit[NumBytesInObu++] 8 uimsbf
i += 2
emulation_prevention_three_byte /* equal to 0x03 */ 8 uimsbf
} else
open_bitstream_unit[NumBytesInObu++] 8 uimsbf
}

obu_start_code - This value shall be set to 0x000001.

open_bitstream_unit[i] - i-th byte of the AV1 open bitstream unit (As defined in section 5.3 of [AV1]).

It is the responsability of the TS muxer to prevent start code emulation by escaping all the forbidden three-byte sequences using the emulation_prevention_three_byte (always equal to 0x03). The forbidden sequences are defined below.

Within the ts_open_bitstream_unit() payload, the following three-byte sequences shall not occur at any byte-aligned position :

Within the ts_open_bitstream_unit() payload, any four-byte sequence that starts with 0x000003 other than the following sequences shall not occur at any byte-aligned position :

3.3. The AV1 Access Unit

An AV1 Access Unit consists of all OBUs, including headers, between the end of the last OBU associated with the previous frame, and the end of the last OBU associated with the current frame. With this definition, an Access Unit sometimes maps with a Decodable Frame Group (DFG) as defined in Annex E of [AV1] and some other times to a Temporal Unit (TU) as defined in [AV1], or both, as illustrated in the figure below. An illustration is provided in the figure below for a group of pictures with frames predicted as follows :

Practical example of an AV1 Access Unit split

Practical example of an AV1 Access Unit split

3.4. Use of PES packets

AV1 video encapsulated as defined in clause 4.2 is carried in PES packets as PES_packet_data_bytes, using the stream_id 0xBD (private_stream_id_1).

A PES shall encapsulate one, and only one, AV1 access unit as defined in clause 4.3. All the PES shall have data_alignment_indicator set to 1. Usage of *data_stream_alignment_descriptor* is not specified and the only allowed *alignment_type* is 1 (Access unit level).

The highest level that may occur in an AV1 video stream, as well as a profile and tier that the entire stream conforms to, shall be signalled using the AV1 video descriptor.

3.5. Assignment of DTS and PTS

For AV1 video stream multiplexed into [MPEG-2-TS], the *decoder_model_info* may not be present. If the *decoder_model_info* is present, then the STD model shall match with the decoder model defined in Annex E of [AV1].

For synchronization and STD management, PTSs and, when appropriate, DTSs are encoded in the header of the PES packet that carries the AV1 video stream data setting the PTS_DTS_flags to '01' or '11'. For PTS and DTS encoding, the constraints and semantics apply as defined in the PES Header and associated constraints on timestamp intervals.

There are cases in AV1 bitstreams where information about a frame is sent multiple times. For example, first to be decoded, and subsequently to be displayed. In the case of a frame being decoded but not displayed, it is desired to assign a valid DTS but without need for a PTS. However, the MPEG2-TS specification prevents a DTS from being transmitted without a PTS. Hence, a PTS is always assigned for AV1 access units and its value is not relevant for frames being decoded but not displayed.

To achieve consistency between the STD model and the buffer model defined in Annex E of [AV1], the following PTS and DTS assignment rules shall be applied :

show_existing_frame show_frame showable_frame PTS DTS
0 0 0 ScheduledRemovalTiming[dfg] ScheduledRemovalTiming[dfg]
0 0 1 ScheduledRemovalTiming[dfg] ScheduledRemovalTiming[dfg]
0 1 n/a PresentationTime[frame] ScheduledRemovalTiming[dfg]
1 n/a n/a PresentationTime[frame] ScheduledRemovalTiming[dfg]

Note : The ScheduleRemovalTiming[] and PresentationTime[] are defined in the Annex E of [AV1].

3.6. Buffer considerations

3.6.1. Buffer pool management

Carriage of an AV1 video stream over [MPEG-2-TS] does not impact the size of the Buffer Pool.

For decoding of an AV1 video stream in the STD, the size of the Buffer Pool is as defined in [AV1]. The Buffer Pool shall be managed as specified in Annex E of [AV1].

A decoded AV1 access unit enters the Buffer Pool instantaneously upon decoding the AV1 access unit, hence at the Scheduled Removal Timing of the AV1 access unit. A decoded AV1 access unit is presented at the Presentation Time.

If the AV1 video stream provides insufficient information to determine the Scheduled Removal Timing and the Presentation Time of AV1 access units, then these time instants shall be determined in the STD model from PTS and DTS timestamps as follows:

  1. The Scheduled Removal Timing of AV1 access unit n is the instant in time indicated by DTS(n) where DTS(n) is the DTS value of AV1 access unit n.

  2. The Presentation Time of AV1 access unit n is the instant in time indicated by PTS(n) where PTS(n) is the PTS value of AV1 access unit n.

3.6.2. T-STD Extensions for AV1

When there is an AV1 video stream in an [MPEG-2-TS] program, the T-STD model as described in the section "Transport stream system target decoder" is extended as specified below.

T-STD Extensions for AV1

T-STD Extensions for AV1
3.6.2.1. TBn, MBn, EBn buffer management

The following additional notations are used to describe the T-STD extensions and are illustrated in the figure above.

Notation Definition
t(i) indicates the time in seconds at which the i-th byte of the transport stream enters the system target decoder
TBn is the transport buffer for elementary stream n
TBS is the size of the transport buffer TBn, measured in bytes
MBn is the multiplexing buffer for elementary stream n
MBSn is the size of the multiplexing buffer MBn, measured in bytes
EBn is the elementary stream buffer for the AV1 video stream
EBSn is the size of the multiplexing buffer MBn, measured in bytes
j is an index to the AV1 access unit of the AV1 video stream
An(j) is the j-th access unit of the AV1 video bitstream
tdn (j) is the decoding time of An(j), measured in seconds, in the system target decoder
Rxn is the transfer rate from the transport buffer TBn to the multiplex buffer MBn as specified below.
Rbxn is the transfer rate from the multiplex buffer MBn to the elementary stream buffer EBn as specified below

The following apply:

If there is PES packet payload data in MBn, and buffer EBn is not full, the PES packet payload is transferred from MBn to EBn at a rate equal to Rbxn. If EBn is full, data are not removed from MBn. When a byte of data is transferred from MBn to EBn, all PES packet header bytes that are in MBn and precede that byte are instantaneously removed and discarded. When there is no PES packet payload data present in MBn, no data is removed from MBn. All data that enters MBn leaves it. All PES packet payload data bytes enter EBn instantaneously upon leaving MBn.

3.6.2.2. STD delay

The STD delay of any AV1 video through the system target decoders buffers TBn, MBn, and EBn shall be constrained by tdn(j) – t(i) ≤ 10 seconds for all j, and all bytes i in access unit An(j).

3.6.2.3. Buffer management conditions

Transport streams shall be constructed so that the following conditions for buffer management are satisfied:

4. Acknowledgements and previous authors

A previous draft of this specification has been produced by VideoLAN, with inputs from different authors (Jean Baptiste Kempf, Kieran Kunhya, Adrien Maglo, Christophe Massiot, Mathieu Monnier and Mickael Raulet) from the following companies: ATEME, OpenHeadend, Open Broadcast Systems, Videolabs under the direction of VideoLAN.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

References

Normative References

[AV1]
Peter de Rivaz; Jack Haughton. AV1 Bitstream & Decoding Process Specification. 8 January 2019. Standard. URL: https://aomediacodec.github.io/av1-spec/av1-spec.pdf
[BT-1886]
Recommendation BT.1886 : Reference electro-optical transfer function for flat panel displays used in HDTV studio production. Recommendation. URL: https://www.itu.int/rec/R-REC-BT.1886
[BT-2020]
Recommendation BT.2020 : Parameter values for ultra-high definition television systems for production and international programme exchange. Recommendation. URL: https://www.itu.int/rec/R-REC-BT.2020
[BT-709]
Recommendation BT.709 : Parameter values for the HDTV standards for production and international programme exchange. Recommendation. URL: https://www.itu.int/rec/R-REC-BT.709
[MPEG-2-TS]
Information technology — Generic coding of moving pictures and associated audio information — Part 1: Systems. Standard. URL: https://www.iso.org/standard/83239.html
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119