Immersive Audio Container

AOM Working Group Draft,

This version:
https://aomediacodec.github.io/iac/
Issue Tracking:
GitHub
Editors:
(Samsung)
(Google)
Warning

This specification is still at draft stage and should not be referenced other than as a working draft.

Copyright 2022, AOM

Licensing information is available at http://aomedia.org/license/

The MATERIALS ARE PROVIDED “AS IS.” The Alliance for Open Media, its members, and its contributors expressly disclaim any warranties (express, implied, or otherwise), including implied warranties of merchantability, non-infringement, fitness for a particular purpose, or title, related to the materials. The entire risk as to implementing or otherwise using the materials is assumed by the implementer and user. IN NO EVENT WILL THE ALLIANCE FOR OPEN MEDIA, ITS MEMBERS, OR CONTRIBUTORS BE LIABLE TO ANY OTHER PARTY FOR LOST PROFITS OR ANY FORM OF INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY CHARACTER FROM ANY CAUSES OF ACTION OF ANY KIND WITH RESPECT TO THIS DELIVERABLE OR ITS GOVERNING AGREEMENT, WHETHER BASED ON BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE), OR OTHERWISE, AND WHETHER OR NOT THE OTHER MEMBER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


Abstract

This document specifies the immersive audio (IA) bitstream format and the container format for the IA bitstream in one single [ISOBMFF] track.

1. Convention

1.1. Syntax Description

All of obu syntax is described in class which is a structure of C++ program language.

1.2. Syntax Elements

All of syntax elements is composed of type and followed by syntaxName.

leb128() syntaxName

leb128() indicates the type of an unsigned integer with that its size in bits is 8 x the variable number of bytes of little-endian bytes by [LEB128], variable length code compression.

syntaxName is an unsigned integer which shall be encoded by [LEB128].

sleb128() syntaxName

sleb128() indicates the type of a signed integer with that its size in bits is 8 x the variable number of bytes of little-endian bytes by [LEB128], variable length code compression.

syntaxName is a signed integer which shall be encoded by [LEB128].

string syntaxName

string indicates the type of a string with that its size in bits is 8 x the number of bytes by byte representation of syntaxName.

syntaxName is a human readable lable whose byte representation shall consists of primary language subtag and region subtag which are connected by hyphen("-") and followed by lable. Where, language tag shall conform to [BCP47]].

1.3. Mathemetical functions

Clip3(x, y, z)

min if z < x, y if z > y, z otherwise.

2. Introduction

The IA bitstream is designed to represent immersive audio for presentation on a wide range of devices in both dynamic streaming and offline applications. These applications include internet audio streaming, multicasting/broadcasting services, file download, gaming, communication, virtual and augmented reality, and others. In these applications, audio may be played back on a wide range of devices, e.g. headsets, mobile phones, tablets, TVs, sound bars, home theater systems and big screen.

The bitstream comprises a number of coded audio substreams and the metadata that describes how to decode, render and mix the substreams to generate an audio signal for playback. The bitstream format itself is codec-agnostic; any supported audio codec may be used to code the audio substreams.

The immersive audio container (IAC) is the storage format for immersive audio (IA) bitstream in one single [ISOBMFF] track.

The figure below shows the conceptual IAC architecture.

Conceptual IAC Architecture

For a given input 3D audio,

The rest of this specification is formulated as follows:

3. Overview

3.1. IA Bitstream Components

The IA bitstream includes one or more audio elements, each of which consists of one or more audio substreams. The IA bitstream further include mix presentations and parameters.

The figure below shows the relationship between the audio substreams, audio elements and mix presentations and the processing flow to obtain the immersive audio playback.

Processing flow to decode, reconstruct, render and mix the audio signals for immersive audio playback.

3.2. Use of OBU Syntax

3.2.1. Descriptors

The descriptor OBUS contains all the information that is required to setup and configure the decoders, reconstruction algorithms, renderers and mixers.

3.2.2. Data

The data OBUs contain the actual time-varying data that is required in the generation of the final audio output.

3.2.3. Logistics

The IA bitstream supports the description of multiple audio substreams and algorithms, which may have different metadata update rates to each other. The update rate for the audio substreams and audio elements is governed by the frame rates of the audio codec used. Since a single bitstream may support multiple codecs, this may lead to multiple different frame rates. The algorithms for rendering and mixing may have parameters that update at different rates to each other and to the audio frame rates.

Therefore, the IA bitstream contains information to facilitate the synchronization of the different IA metadata. The synchronizing information in each metadata indicates the offset from a reference point and the duration for which it is valid.

4. Open Bitstream Unit (OBU) Syntax and Semantics

4.1. Top Level OBU Syntax and Semantics

The IA bitstream uses the OBU syntax. IA bitstream shall be composed of descriptor OBUs and followed by one or more temporal units. Sync OBUs may be present between two adjacent temporal units or between descriptor OBUs and following temporal unit

This section specifies the top-level OBU syntax elements and their semantics.

4.1.1. Audio OBU Syntax and Semantics

Syntax

class audio_open_bitstream_unit() {
  obu_header();

  if (obu_type == OBU_IA_Start_Code)
    start_code_obu();
  else if (obu_type == OBU_IA_Codec_Config)
    codec_config_obu();
  else if (obu_type == OBU_IA_Audio_Element)
    audio_element_obu();
  else if (obu_type == OBU_IA_Mix_Presentation)
    mix_presentation_obu();
  else if (obu_type == OBU_IA_Parameter_Block)
    parameter_block_obu();
  else if (obu_type == OBU_IA_Audio_Frame)
    audio_frame_obu();
  else if (obu_type == OBU_IA_Temporal_Delimiter)
    temporal_delimiter_obu();
  else if (obu_type == OBU_IA_Sync)
    sync_obu();
  else
    reserved_obu()

  byte_alignment():
}

Semantics

If the syntax element obu_type is equal to OBU_IA_Start_Code, an ordered series of OBUs is presented to the decoding process as a string of bytes.

OBU data shall start on the first (most significant) bit and shall end on the last bit of the given bytes. The payload of an OBU shall lie between the first bit of the given bytes and the last bit before the first zero bit of the byte_alignment().

4.1.2. OBU Header Syntax and Semantics

Syntax

class obu_header() {
  unsigned int (4) obu_type;
  unsigned int (1) obu_id_flag;
  unsigned int (1) obu_sync_flag;
  unsigned int (1) obu_duration_flag;
  unsigned int (1) obu_counter_flag;
  unsigned int (2) obu_trimming_status;
  unsigned int (1) obu_extension_flag;
  unsigned int (5) obu_reserved_5bit;

  leb128() obu_size;

  if (obu_id_flag == 1)
    leb128() obu_id;
  if (obu_sync_flag == 1)
    sleb128() obu_sync;
  if (obu_duration_flag == 1)
    leb128() obu_duration;
  if (obu_counter_flag == 1)
    leb128() obu_counter;
  if (obu_trimming_status == 1)
    leb128() num_samples_to_trim_at_end;
  if (obu_trimming_status == 2)
    leb128() num_samples_to_trim_at_start;
  if (obu_extension_flag == 1)
    leb128() extension_header_size;
}

Semantics

OBUs are structured with a header and a payload.

obu_type specifies the type of data structure contained in the OBU payload.

obu_type: Name of obu_type
   0    : OBU_IA_Codec_Config
   1    : OBU_IA_Audio_Element
   2    : OBU_IA_Mix_Presentation
   3    : OBU_IA_Parameter_Block
   4    : OBU_IA_Audio_Frame
   5    : OBU_IA_Temporal_Delimiter
   6    : OBU_IA_Sync
  7~14  : Reserved
   15   : OBU_IA_Start_Code

obu_id_flag indicates whether obu_id field presents or not. If it set to 0, obu_id field shall not be present. Otherwise, obu_id field shall be present.

obu_sync_flag indicates whether obu_sync field presents or not. If it set to 0, obu_sync field shall not be present. Otherwise, obu_sync field shall be present.

obu_duration_flag indicates whether obu_duration field presents or not. If it set to 0, obu_duration field shall not be present. Otherwise, obu_duration field shall be present.

obu_counter_flag indicates whether obu_counter field presents or not. If it set to 0, obu_counter field shall not be present. Otherwise, obu_counter field shall be present.

obu_trimming_status indicates whether this OBU has audio samples to be trimmed or not.

obu_extension_flag indicates whether extension_header_size field presents or not. If it set to 0, extension_header_size field shall not be present. Otherwise, extension_header_size field shall be present.

This flag shall be set to 0 for the current version of the specification (i.e. version = 0). The IAC-OBU parse, which conformants to the current version of the specification, shall be able to parse this flag and extension_header_size.

NOTE: For a future version of specification may use this flag to have extension header field by setting obu_extension_flag = 1 and setting the size of extended header to extension_header_size.

obu_size shall indicate the size in bytes of the OBU not including the bytes within obu_header of the preceding fields, i.e. obu_type, the various OBU flags and obu_reserved_7bit.

obu_id indicates a unique ID according to the obu_type.

The below figure shows the linking scheme among obu_ids in obu_header and ids in obu payload.

ID Linking Scheme

In the above figure,

obu_sync shall indicate the offset from a reference point in the IA bitstream for which the OBU is valid and applicable. The reference point used depends on the IA Profile (See Profiles Section).

obu_duration shall indicate the duration for which the OBU is valid and applicable. This field shall only be valid when obu_type = OBU_IA_Parameter_Block.

obu_counter shall increment when its payload is different to the previous OBU of the same obu_type. If the payload is identical to the previous OBU of the same obu_type, i.e. it was redundantly copied or repeated in the bitstream, the value of obu_counter shall remain unchanged.

num_samples_to_trim_at_start shall indicate the number of samples that needs to be trimmed from the start of the samples in this Audio Frame OBU.

num_samples_to_trim_at_end shall indicate the number of samples that needs to be trimmed from the end of the samples in this Audio Frame OBU.

extension_header_size shall indicate the size in bytes of the extension header including this field.

obu_reserved_5bit shall be set to 0. Reserved units are for future use and shall be ignored by an IAC-OBU parser.

4.1.3. Byte Alignment Syntax and Semantics

Syntax

class byte_alignment() {
  while (get_position() & 7)
    unsigned int (1) zero_bit;
}

Semantics

zero_bit shall be equal to 0 and shall be inserted into the bitstream to align the bit position to a multiple of 8 bits.

4.1.4. Reserved OBU Syntax and Semantics

The reserved OBU allows the extension of this specification with additional OBU types in a way that allows IAC-OBU parsers compliant to this version of specification to ignore them.

4.1.5. Start Code OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Start_Code.

For this obu, the obu header (3 bytes) shall be set to 0xF07F06.

Syntax

class start_code_obu() {
  unsigned int (32) ia_code;
  unsigned int (8) version;
  unsigned in t(8) profile_version
}

Semantics

ia_code shall be a ‘four-character code’ (4CC) to identify the start of the IA bitstream. It shall be aiac.

version shall indicate the version of an IA bitstream. It shall be set to 0 for this version of the specification. Implementations should treat IA bitstreams where the MSB four bits of the version number match that of a recognized specification as backwards compatible with that specification. That is, the version number can be split into "major" and "minor" version sub-fields, with changes to the minor sub-field (in the LSB four bits) signaling compatible changes. For example, an implementation of this specification should accept any stream with a version number of ’15’ or less, and should assume any stream with a version number ’16’ or greater is incompatible.

profile_version shall indicate the profile of an IA sequence. The MSB four bits shall indicate the profile of an IA sequence. Implementations should treat IA bitstreams where the MSB four bits of the version number match that of a recognized profile as backwards compatible with that specification. That is, the version number can be split into "profile major" and "profile minor" version sub-fields, with changes to the minor sub-field (in the LSB four bits) signaling compatible changes with the profile major version. The semantic of this field shall be only valid when the MSB four bits of version = 0.

4.1.6. Codec Config OBU Syntax and Semantics

NOTE: This section is udpated to specify one Codec Config OBU per one codec config.

This section specifies obu payload of OBU_IA_Codec_Config.

For this obu, all of obu_sync_flag, obu_duration_flag and obu_counter_flag shall be set to 0.

Syntax

class codec_config_obu() {
  codec_config();
  leb128() num_audio_elements;
  for (i = 0; i < num_audio_elements; i++) {
    leb128() audio_element_id;
  }
}

class codec_config() {
  unsigned int (32) codec_id;
  decoder_config(codec_id);
  leb128() num_samples_per_frame;
  signed int (16) roll_distance;
}

Semantics

codec_id shall be a ‘four-character code’ (4CC) to identify the codec used to generate the audio substreams. It shall be opus for IAC-OPUS, mp4a for IAC-AAC-LC, fLaC for IAC-FLAC and lpcm for IAC-LPCM.

For ISOBMFF encapsulation, it shall be the same as the boxtype of its AduioSampleEntry if exist.

decoder_config() specifies the set of codec parameters required to decode an audio substream for the given codec_id. It shall be byte aligned.

num_samples_per_frame shall indicate the frame length, in samples, of the raw coded audio provided in by audio_frame_obu().

roll_distance is a signed integer that gives the number of frames that need to be decoded in order for a frame to be decoded correctly. A negative value indicates the number of frames before the frame to be decoded corrently.

num_audio_elements shall specify the number of audio elements that are applying the codec_config() in this OBU.

audio_element_id shall specify the unique ID of the audio element that is applying the codec_config() in this OBU.

4.1.7. Audio Element OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Audio_Element.

For this obu, both of obu_sync_flag and obu_duratrion_flag shall be set to 0.

Syntax

class audio_element_obu() {
  unsigned int (3) audio_element_type;
  unsigned int (5) reserved;

  leb128() num_substreams;
  for (i = 0; i < num_substreams; i++) {
    leb128() audio_substream_id;
  }
  
  leb128() num_parameters;
  for (i = 0; i < num_parameters; i++) {
    leb128() parameter_id;
    leb128() parameter_name;
  }

  if (audio_element_type == CHANNEL_BASED) {
    scalable_channel_layout_config();
  } else if (audio_element_type == SCENE_BASED) {
    ambisonics_config();
  }
  
  
}

Semantics

audio_element_type shall specify the audio representation of this audio element which is constructed from one or more audio substreams.

audio_element_type: The type of audio representation.
   0    : CHANNEL_BASED
   1    : SCENE_BASED
  2~7   : Reserved

num_substreams shall specify the number of audio substreams that are used to reconstruct this audio element.

audio_substream_id shall specify the unique ID of the audio substream that is used to reconstruct this audio element.

num_parameters shall specify the number of parameters that are used by the algorithms specified in this audio element.

parameter_id shall be a unique ID in IA bitstream for a parameter that is used by the algorithm specified in this audio element. It shall be same as obu_id of parameter_block_obu for the following parameter_name.

parameter_name shall specify the name of the parameter.

parameter_name : Parameter name.
       0       : SCALABLE_CHANNEL_LAYOUT_DEMIXING_INFO
       1       : SCALABLE_CHANNEL_LAYOUT_RECON_GAIN_INFO
   the others  : reserved

scalable_channel_layout_config() is a class that provides the metadata required for combining the substreams identified here in order to reconstruct a scalable channel layout.

ambisonics_config() is a class that provides the metadata required for combining the substreams identified here in order to reconstruct an Ambisonics layout.

4.1.8. Mix Presentation OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Mix_Presentation.

For this obu, both of obu_sync_flag and obu_duratrion_flag shall be set to 0.

The metadata in mix_presentation() specifies how to render and process one or more audio elements. The processed audio elements shall then be summed to generate a single mixed audio signal. Finally, any additional processing specified in the mix_bus_config() metadata shall be applied to the single mixed audio signal in order to generate the final output audio for playback.

Syntax

class mix_presentation_obu() {
  string mix_presentation_friendly_label;
  unsigned int (4) mix_target_layout;
  unsigned int (4) reserved;

  leb128() num_audio_elements;
  for (i = 0; i < num_audio_elements; i++) {
    string audio_element_friendly_label;
    leb128() audio_element_id_ref;
    rendering_config();
    element_mix_config();
  }

  mix_loudness_info();
  mix_bus_config();
}

Semantics

mix_presentation_friendly_label shall specify a human-friendly label to describe this mix presentation.

mix_target_layout shall specify the target playback layout that all referenced audio elements shall be rendered for.

Mix Target Layout (4 bits) :  Channel Layout  : Loudspeaker Location Ordering
            0000           :       Mono       : C
            0001           :      Stereo      : L/R
            0010           :      5.1ch       : L/C/R/Ls/Rs/LFE
            0011           :     5.1.2ch      : L/C/R/Ls/Rs/Ltf/Rtf/LFE
            0100           :     5.1.4ch      : L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE
            0101           :      7.1ch       : L/C/R/Lss/Rss/Lrs/Rrs/LFE
            0110           :     7.1.2ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE
            0111           :     7.1.4ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE
            1000           :     3.1.2ch      : L/C/R//Ltf/Rtf/LFE
           others          :     reserved     :
Where, C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround, 
Rs: Right Surround, Rss: Right Side Surround, 
Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear, 
Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects

An IA bitstream may have one or more mix_presentation() specified, each with different mix_target_layout values. In this case, the IA decoder shall select the mix presentation that matches the physical playback layout. If there is no match, the IA decoder should select the closest specified layout and apply up or down-mixing appropriately. Sections § 10.2.2 Down-mix Mechanism and § 9.5 Down-mix Matrix provide example dynamic and static down-mixing matrices for some common layouts that may be used by the IA decoder.

num_audio_elements shall specify the number of audio elements that are used in this mix presentation to generate the final output audio signal for playback.

audio_element_friendly_label shall specify a human-friendly label to describe the referenced audio element.

audio_element_id_ref shall be the obu_id specified in obu header of audio element obu specifying the audio element that is used in this mix presentation.

rendering_config() is a class that provides the metadata required for rendering the referenced audio element.

element_mix_config() is a class that provides the metadata required for applying any processing to the referenced and rendered audio element before being summed with other processed audio elements.

mix_loudness_info() is a class that provides the loudness information and statistics for the final output audio signal.

mix_bus_config() is a class that provides the metadata required for applying any post-processing to the mixed audio signal to generate the final output audio signal for playback.

4.1.9. Parameter Block OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Parameter_Block.

The metadata specified in this OBU defines the parameters for an algorithm for an indicated duration, including any animation of the parameter values over this duration.

Syntax

class parameter_block_obu() {
  leb128() parameter_type;

  if (parameter_type == CONSTANT) {
  }

  if (parameter_type == STEP) {
    leb128() parameter_sample_rate;
  }

  if (parameter_type == LINEAR or EXPONENTIAL) {
    leb128() parameter_sample_rate;
    leb128() smoothing_duration;
  }

  if (parameter_type == PROCEDURAL) {
  }

  param_config(obu_id);
}

Semantics

parameter_type shall specify the type of parameter.

parameter_type : Parameter type.
       0       : CONSTANT
       1       : STEP
       2       : LINEAR
       3       : EXPONENTIAL
       4       : PROCEDURAL

If parameter_type is equal to CONSTANT, this shall indicate that a single parameter value is provided, and is intended to be applied to the audio samples.

If parameter_type is equal to STEP, LINEAR or EXPONENTIAL, this shall indicate that a series of parameter values will be provided as a 1D signal, and are intended to be applied to the audio samples. The rate at which these values are provided does not need to match the audio sample rate.

If parameter_type is equal to PROCEDURAL, this shall indicate that the parameter values provided are intended to parameterize some function, which also governs how the resulting values are applied to the audio samples.

parameter_sample_rate shall specify the rate at which the parameters are provided. This value may be different from the audio sample rate.

smoothing_duration shall specify the duration over which the parameter is interpolated from its previous value.

param_config() is a class that provides the actual parameter values and any additional metadata that may be required by the algorithm to specify how the parameter values are applied to the audio samples. This will be different for each algorithm.

4.1.10. Audio Frame OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Audio_Frame.

For this obu, both of obu_duratrion_flag and obu_counter_flag shall be set to 0.

Syntax

class audio_frame_obu() {
  unsigned int (8*coded_frame_size) audio_frame();
}

Semantics

coded_frame_size is the size of audio_frame() in byte units.

audio_frame() is the raw coded audio data for the frame. It shall be opus packet of [RFC6716] for IAC-OPUS, raw_data_block() of [AAC] for IAC-AAC-LC, FRAME of [FLAC] for IAC-FLAC and audio samples for IAC-LPCM

4.1.11. Temporal Delimiter OBU Syntax and Semantics

This section specifies temporal delimiter obu.

For this obu, all of obu_sync_flag, obu_duratrion_flag, obu_counter_flag and obu_trimming_status shall be set to 0.

Syntax

class temporal_delimiter_obu() {
}

NOTE: Temporal delimiter obu has an empty payload.

4.1.12. Sync OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Sync.

For this obu, obu_sync_flag shall be set to 1 and all of obu_id_flag, obu_duratrion_flag and obu_counter_flag shall be set to 0.

Syntax

class sync_obu() {
}

NOTE: sync_obu() has the empty payload

4.2. Detailed OBU Syntax and Semantics

4.2.1. Scalable Channel Layout Config Syntax and Semantics

scalable_channel_layout_config() contains information regarding the configuration of scalable channel audio.

Syntax

class scalable_channel_layout_config() {
  unsigned int (3) num_layers;
  unsigned int (5) reserved;
  for (i = 1; i <= num_layers; i++) {
    channel_audio_layer_config(i);
  }
}

class channel_audio_layer_config(i) {
  unsigned int (4) loudspeaker_layout(i);
  unsigned int (1) output_gain_is_present_flag(i);
  unsigned int (1) recon_gain_is_present_flag(i);
  unsigned int (2) reserved;
  unsigned int (8) substream_count(i);
  unsigned int (8) coupled_substream_count(i);
  signed int (16) loudness(i);
  if (output_gain_is_present_flag(i) == 1) {
    unsigned int (6) output_gain_flag(i);
    unsigned int (2) reserved;
    signed int (16) output_gain(i);
  }
}

When an audio element is composed of G(r) number of substreams, scalable channel audio for the audio element shall be layered into num_layers = r number of ChannelGroups.

Immersive Audio Bitstream with scalable channel audio (before OBU packing)

Semantics

num_layers shall indicate the number of ChannelGroups for scalable channel audio. It shall not be set to zero and its maximum number shall be limited to 6.

channel_audio_layer_config() is a class that provides the information regarding the configuration of ChannelGroup for scalable channel audio. channel_audio_layer_config(i) shall provide information regarding the configuaration of ChannelGroup #i.

loudspeaker_layout shall indicate the channel layout for the channels to be reconstructed from the precedent ChannelGroups and the current ChannelGroup among ChannelGroups for scalable channel audio.

In the current version of the specification, loudspeaker_layout shall indicate one of 9 channel layouts including Mono, Stereo, 5.1ch, 5.1.2ch, 5.1.4ch, 7.1ch, 7.1.2ch, 7.1.4ch and 3.1.2ch. Where,

Loudspeaker Layout (4 bits) :  Channel Layout  : Loudspeaker Location Ordering
             0000           :       Mono       : C
             0001           :      Stereo      : L/R
             0010           :      5.1ch       : L/C/R/Ls/Rs/LFE
             0011           :     5.1.2ch      : L/C/R/Ls/Rs/Ltf/Rtf/LFE
             0100           :     5.1.4ch      : L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE
             0101           :      7.1ch       : L/C/R/Lss/Rss/Lrs/Rrs/LFE
             0110           :     7.1.2ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE
             0111           :     7.1.4ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE
             1000           :     3.1.2ch      : L/C/R//Ltf/Rtf/LFE
            others          :     reserved     :
Where, C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround, 
Rs: Right Surround, Rss: Right Side Surround, 
Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear, 
Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects

output_gain_is_present_flag shall indicate if output_gain information fields for the ChannelGroup presents .

recon_gain_is_present_flag shall indicate if recon_gain information fields for the ChannelGroup presents in Recon_Gain_Info().

loudness shall indicate the loudness value of the downmixed channels, for the channel layout which is indicated by loudspeaker_layout, from the original channel audio. It shall be stored in fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]) and should be LKFS based on [ITU1770-4], so it shall be to represent zero or negative value.

output_gain_flags shall indicate the channels which output_gian is applied to. If a bit set to 1, output_gain shall be applied to the channel. Otherwise, output_gain shall not be applied to the channel.

Bit position : Channel Name
    b5(MSB)  : Left channel (L1, L2, L3)
      b4     : Right channel (R2, R3)
      b3     : Left Surround channel (Ls5)
      b2     : Right Surround channel (Rs5)
      b1     : Left Top Front channel (Ltf)
      b0     : Rigth Top Front channel (Rtf)

output_gain shall indicate the gain value to be applied to the mixed channels which are indicated by output_gain_flags. It is 20*log10 of the factor by which to scale the mixed channels. It is stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]). Where, each mixed channel is generated by downmixing two or more input channels.

4.2.2. Ambisonics Config Syntax and Semantics

ambisonics_config() contains information regarding the configuration of Ambisonics.

Syntax

class ambisonics_config() {
  leb128() ambisonics_mode;
  if (ambisonics_mode == MONO) {
    ambisonics_mono_config();
  } else if (ambisonics_mode == PROJECTION) {
    ambisonics_projection_config();
  }
}

class ambisonics_mono_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8 * C) channel_mapping;
}

class ambisonics_projection_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8) coupled_substream_count (M);
  unsigned int (16 * (N + M) * C) demixing_matrix;
}

Semantics

ambisonics_mode shall specify the method of coding Ambisonics.

ambiosnics_mode: Method of coding Ambisonics.
   0    : MONO
   1    : PROJECTION

If ambisonics_mode is equal to MONO, this shall indicate that the Ambisonics channels are coded as individual mono substreams.

If ambisonics_mode is equal to PROJECTION, this shall indicate that the Ambisonics channels are first linearly projected onto another subspace before coding as a mix of coupled stereo and mono substreams.

output_channel_count shall be the same as channel count in [[!RFC8486].

substream_count shall specify the number of audio substreams. It must be the same as num_substreams in its corresponding audio_element().

channel_mapping shall be the same as the one for ChannelMappingFamily = 2 in [RFC8486].

coupled_substream_count shall specify the number of referenced substreams that are coded as coupled stereo channels, where M <= N.

demixing_matrix shall be the same as the one for ChannelMappingFamily = 3 in [RFC8486].

4.2.3. Demixing Info Syntax and Semantics

demixing_info() specifies demixing parameter mode to be used to reconstruct output channel audio according to its loudspeaker_layout.

Syntax

class demixing_info() {
  unsigned int (3) dmixp_mode;
  unsigned int (5) reserved;
}

Semantics

dmixp_mode shall indicate a mode of pre-defined combinations of five demix parameters.

alpha and beta shall be gain values used for S7to5 down-mixer, gamma for T4to2 down-mixer, delta for S5to3 down-mixer and w_idx_offset shall be the offset to generate a gain value w used for T2toTF2 down-mixer.

IA Down-mix Mechanism

4.2.4. Recon Gain Info Syntax and Semantics

recon_gain_info() contains recon gain values for demixed channels.

Syntax

class recon_gain_info() {
  for (i=0; i< channel_audio_layer; i++) {
    if (recon_gain_is_present_flag(i) == 1) {
      leb128() recon_gain_flags(i);
      for (j=0; j< n(i); j++) {
        if (recon_gain_flag(i)(j) == 1)
          unsigned int (8) recon_gain;
      }
    }
  }
}

Semantics

recon_gain_flags shall indicate the channels which recon_gain is applied to.

recon_gain shall indicate the gain value to be applied to the channel, which is indicated by recon_gain_flags, after decoding of the following associated frames.

4.2.5. Rendering Config Syntax and Semantics

Need to review the matrix-based rendering mechanisms.

rendering_config() provides a rendering matrix to be applied to the audio element.

The matrix shall be composed of targetChannelCount x baseChannelCount.

Syntax

class rendering_config() {
  unsigned int (2) rendering_mode;
  unsigned int (6) reserved;
  if (rendering_mode == 0) {
    //No rendering
  }
  else if (rendering_mode == 1){
    for (i = 1; i <= targetChannelCount; i++) {
      for (j = 1; j <= baseChannelCount; j++) {
        signed int (16) rendering_coefficient;
      }
    }
  }
  else if (rendering_mode == 2){
    leb() num_nonzero_rows;
    for (i = 1; i <= num_nonzero_rows; i++) {
      leb() row_index(i);
      leb() num_nonzero_coefficients;
      for (j = 1; j <= num_nonzero_coefficients; j++) {
        leb() column_index(j)
        signed int (16) rendering_coefficient;
      }
    }
  }

}

Semantics

redering_mode shall indicate a rendering mode which is applied to the audio element.

rendering_coefficient is a value in dBs. It shall be stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]).

This value of -128 dB shall indicate the real number 0.

num_nonzero_rows is the nubmer of rows, each of rows has at least one non-zero coefficient.

row_index(i) is the row number of the ith row which has at least one non-zero coefficient, in targetChannelCount x baseChannelCount matrix.

num_nonzero_coefficients is the nubmer of non-zero coefficients of row_index(i).

column_index(j) is the column number of the jth non-zero coefficient of the given row_index(i).

4.2.6. Element Mix Config Syntax and Semantics

Need to review the mix mechanisms.

element_mix_config() provides a gain value to be applied to the audio element.

Syntax

class element_mix_config() {
  signed int (16) mix_gain;
}

Semantics

mix_gain is a value in dBs. For mixing of the audio element, this gain shall be applied to all of channels of the audio element.

It shall be stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]).

4.2.7. Mix Loudness Info Syntax and Semantics

Syntax

class mix_loudness_info() {
  signed int (16) mix_loudness
}

Semantics

mix_loudness shall indicate the loudness value of the mixed channels, for the mix_target_layout, from the audio elements which are specified in the mix_presentation_obu(). It is stored in fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]) and the value should be LKFS based on [ITU1770-4], so it shall be to represent zero or negative value.

4.2.8. Mix Bus Config Syntax and Semantics

Syntax

class mix_bus_config() {
  drc_config();
}

class drc_config() { }

Semantics

NOTE: drc_config() has an empty payload.

4.3. Codec Specific

This section defines codec specific information for Codec_Specific_Info and Substream.

For legacy codecs, Decoder_Config() shall be exactly the same information as the conventional file parser feeds to the codec decoders for decoding of the substream. For future codecs, Decoder_Config() shall include all of decoding parameters which are required to decode Substreams.

4.3.1. IAC-OPUS Specific

Codec_Specific_Info for IAC-OPUS shall conform to ID Header with ChannelMappingFamily = 0 of [RFC7845] with following constraints:

Substream format shall be opus packet of [RFC6716] which contains only one single frame of mono or stereo channels and which has non-delimiting frame structure.

4.3.2. IAC-AAC-LC Specific

Codec_ID shall be mp4a.

Decoder_Config() for IAC-AAC-LC shall be DecoderConfigDescriptor() of [MP4-Systems], which is a subset of ESDBox for [MP4-Audio], with following constraints:

Substream format shall be one single raw_data_block() of [AAC] which contains only one single frame of mono or stereo channels.

4.3.3. IAC-FLAC Specific

Codec_ID shall be fLaC, the FLAC stream marker in ASCII, meaning byte 0 of the stream is 0x66, followed by 0x4C 0x61 0x43.

Decoder_Config() for IAC-FLAC shall be METADATA_BLOCK of [FLAC].

Substream format shall be FRAME of [FLAC], which is composed of FRAME_HEADER, followd by SUBFRAME(s) (one SUBFRAME per channel) and followed by FRAME_FOOTER.

4.3.4. IAC-LPCM Specific

Codec_ID shall be lpcm.

Decoder_Config() for IAC-LPCM shall be as follows:

class decoder_config(lpcm) {
  unsigned int (32) sample_rate;
}

sample_rate shall indicate the sample rate of the input audio in Hz. It shall be little endian.

Substream format shall be the audio samples for the frame size.

5. Profiles

The IA Profiles define a set of capabilities that are required to decode the coresponding IA bitstream.

Additionally, the following restrictions shall apply to all profiles:

  1. IA bitstream shall be composed of a series of OBUs with starting the descriptor OBUs that make up the global descriptor (namely one single OBU_Start_Code, followed by one or more OBU_IA_Codec_Configs, followed by one or more OBU_IA_Audio_Elements and followed by zero or more OBU_Mix_Presentations) and followed by zero or more OBU_IA_Parameter_Blocks and one or more OBU_IA_Audio_Frames.

    • The number of Codec Config OBUs shall be equal or less than the number of Audio Element OBUs.

  2. The descriptor OBUs shall identify the configuration of following one or more audio elements.

  3. If OBU_IA_Parameter_Block is present, the global descriptor shall be followed by one or more parameter blocks. The one or more parameter blocks may redundantly contain information specified by previous parameter blocks that are still valid for the time after the descriptor OBUs.

  4. Substream IDs may be omitted if the audio frames from the various audio substreams are ordered in the bitstream in well-defined way that remains unchanged, such that an implicit ordering can be reliably inferred. In a bitstream, the substream IDs shall be either included for all substreams or omitted for all substreams; the bitstream shall not have a combination of both.

    • There may be OBU_IA_Temporal_Delimiter present.

    • If every audio frame OBU has its own obu_id reprsenting its substream_id, then temporal delmiter OBU shall not be present.

  5. Below table indicates the usage of flags in OBU header according to OBU type.

OBU Type / Flags  : obu_id_flag : obu_sync_flag : obu_duration_flag : obu_counter_flag
------------------------------------------------------------------------------------—   Start Code OBU   :    No Use   :     No Use    :       No Use      :      No Use     
 Codec Config OBU :    No Use   :     No Use    :       No Use      :       Use       
 Audio Elem. OBU  :      Use    :     No Use    :       No Use      :      No Use     
 Mix Present. OBU :      Use    :     No Use    :       No Use      :      No Use     
 Audio Frame OBU  :      Use    :       Use     :       No Use      :      No Use     
 Para. Block OBU  :      Use    :       Use     :         Use       :       Use       
  Temp. Del. OBU  :      Use    :     No Use    :       No Use      :      No Use 
     Sync OBU     :      Use    :       Use     :       No Use      :      No Use     

5.1. IA Simple Profile

This section defines the simple profile for IA bitstream and IA decoder and specifies the conformance points of this profile.

Restrictions on IA bitstream:

OBU Type / Flags  : obu_id_flag : obu_sync_flag : obu_duration_flag : obu_counter_flag
------------------------------------------------------------------------------------—   Start Code OBU   :    No Use   :     No Use    :       No Use      :      No Use
 Codec Config OBU :    No Use   :     No Use    :       No Use      :      No Use 
 Audio Elem. OBU  :    No Use   :     No Use    :       No Use      :      No Use 
 Mix Present. OBU :     Use     :     No Use    :       No Use      :      No Use 
 Audio Frame OBU  :    No Use   :     No Use    :       No Use      :      No Use 
 Para. Block OBU  :     Use     :     No Use    :       No Use      :      No Use 

Capabilities of IA decoder:

5.2. IA Base Profile

This section defines the base profile for IA bitstream and IA decoder and specifies the conformance points of this profile.

Restrictions on IA bitstream:

OBU Type / Flags  : obu_id_flag : obu_sync_flag : obu_duration_flag : obu_counter_flag
------------------------------------------------------------------------------------—   Start Code OBU   :    No Use   :     No Use    :       No Use      :      No Use
 Codec Config OBU :    No Use   :     No Use    :       No Use      :      No Use 
 Audio Elem. OBU  :      Use    :     No Use    :       No Use      :      No Use 
 Mix Present. OBU :      Use    :     No Use    :       No Use      :      No Use 
 Audio Frame OBU  :    No Use   :     No Use    :       No Use      :      No Use 
 Para. Block OBU  :     Use     :     No Use    :       No Use      :       Use 
  Temp. Del. OBU  :      Use    :     No Use    :       No Use      :      No Use 

Capabilities of IA decoder:

5.3. IA Enhanced Profile

This section defines the enhanced profile for IA sequence and IA decoder and specifies the conformance points of this profile.

Restrictions on IA sequence:

OBU Type / Flags  : obu_id_flag : obu_sync_flag : obu_duration_flag : obu_counter_flag
------------------------------------------------------------------------------------—   Start Code OBU   :    No Use   :     No Use    :       No Use      :      No Use
 Codec Config OBU :    No Use   :     No Use    :       No Use      :       Use 
 Audio Elem. OBU  :      Use    :     No Use    :       No Use      :      No Use 
 Mix Present. OBU :      Use    :     No Use    :       No Use      :      No Use 
 Audio Frame OBU  :      Use    :       Use     :       No Use      :      No Use 
 Para. Block OBU  :      Use    :       Use     :         Use       :       Use 
  Temp. Del. OBU  :      Use    :     No Use    :       No Use      :      No Use 
     Sync OBU     :      Use    :       Use     :       No Use      :      No Use 

Capabilities of IA decoder:

6. Standalone IAC Representation

Needs a lot more details.

Global descriptors to be repeated frequently as needed to enable joining mid-stream. This must be followed by parameter OBUs that redundantly copy previous parameter OBUs that are still valid for the time after the global descriptors. I.e. decoders joining mid-stream that encounters a Start Code OBU knows that it will receive information in the next OBUs that give it complete information to start decoding following audio frames.

Sync OBUs may be placed as frequently as needed in the bitstream.

Parameter blocks may be placed as frequently as needed in the bitstream.

TODO: need to include more information about packing order of OBUs, timing and synchronizing the OBUs.

Below is copy-pasted from old version ("Immersive Audio Bitstream Definition"). TODO: refactor and update.

6.1. Immersive Audio Bitstream Definition

An immersive audio (IA) sequence shall include one or more audio elements, each of which shall consist of one or more audio substreams. IA sequence shall start with IA_Stream_Indicator and followed by a sequence of IA bitstreams.

Each IA bitstream shall be self-decodable and shall be composed of one single global descriptors and followed by one or more temporal units.

There can be two types of IA bitstreams. The first is an IA bitstream with one single frame size and the second is an IA bitstream with two or more different frame sizes. The conceptual diagrams are shown in the two figures below.

Immersive Audio Bitstream with one single frame size (before OBU packing).

In the first type of IA bitstream with one single frame size, all audio substreams in the same IA bitstream shall have the same audio_substream_config().

After OBU packing, temporal_delimiter_obu may be present at the front of every temporal unit to indicate the start of the temporal unit. If present, it shall be present at the front of every temporal unit.

In this case, audio frames depicted in the diagram above is a set of audio substream frames with the same sync offsets.

Immersive Audio Bitstream with two frame sizes (before OBU packing).

In the second type of IA bitstream with two or more frame sizes, the audio substreams in the same IA bitstream may have different audio_substream_config() with different num_samples_per_frame.

The diagram above depicts an example case where there are two frame sizes in the same IA bitstream. In this case, audio frames is a set of audio substream frames with the same sync offsets and the same frame size.

6.2. Bitstream Packing

All metadata within an IA bitstream contains synchronization information that includes the sync offset and the duration for which it is valid. This is used when determining the order in which the metadata is packed in the bitstream.

As an illustrative example, consider a bitstream that contains the following:

  1. Two substreams, one coded with Codec A and the second coded with Codec B.

    • Codec A has a frame size of 20 ms.

    • Codec B has a frame size of 30 ms.

  2. Three parameters with different update rates.

    • Parameter A has an update rate of 10 ms.

    • Parameter B has an update rate of 40 ms.

    • Parameter C has a variable update rate.

The figure below shows the metadata and substreams, and the bitstream packing for this example.

Example of how substreams and parameter metadata with different update rates should be packed in the bitstream.

7. ISOBMFF IAC Encapsulation

7.1. General Requirements & Brands

A file conformant to this specification satisfies the following:

Parsers shall support the structures required by the iso6 brand and MAY support structures required by further ISOBMFF structural brands.

7.2. ISOBMFF IAC Encapsulation with single track

This section describes the basic data structures used to signal encapsulation of IA sequence in [ISOBMFF] containers.

7.2.1. Requirement of IA bitstream

IA bitstream shall comply with the bitstream which is specified in [#profiles-simple] or [#profiles-base] for eacapsulation of ISOBMFF with single track.

7.2.2. Encapsulation Scheme ### {isobmff-singletrack-basicencapsulationscheme}

During encapsulation process, OBUs of IA bistream are encapsulated into [ISOBMFF] as follows:

IAC Encapsulation Scheme

7.2.3. IA Sample Entry

Sample Entry Type: aiac
Container:         Sample Description Box ('stsd')
Mandatory:         Yes
Quantity:          One or more.

The IASampleEntry identifies that the track contains IA Samples, and uses one single codec specific box.

Syntax

class IASampleEntry extends AudioSampleEntry('aiac') {
  unsigned int (8) version;
  unsigned int (8) profile_version;
  CodecSpecificBox config;
}

No optional boxes of AudioSampleEntry shall present.

Sematics

Both channelcount and samplerate fields of AudioSampleEntry shall be ignored.

version and profile_version shall be the same as the ones in start_code_obu.

7.2.4. Codec Specific Box

This section describes a codec specific box for the decoding parameters, which is defined by codec_id of audio_substream_config(), to decode one single substream of IA bitstream. aiac shall contain only one single codec specific box regardless of the number of substreams in IA bitstream. So, the codec specific box is applied to all of substreams in sample data.

7.2.4.1. OPUS Specific Box

This shal be OpusSpecificBox (dOps) for opus audiosampleentry which is specified in [OPUS-IN-ISOBMFF].

Box Type:  dOps
Container: IA Sample Entry ('aiac')
Mandatory: Yes
Quantity:  One

This box shall be for one single substream.

Syntax

It shall be the same as dOps box for opus with that ChannelMappingFamily shall be set to 0.

Sematics

It shall be the same as the semantics in [OPUS-IN-ISOBMFF] except followings:

7.2.4.2. MP4A Specific Box

This shall be ESDBox (esds) for mp4a which is specified in [MP4].

Box Type:  esds
Container: IA Sample Entry ('aiac')
Mandatory: Yes
Quantity:  One of more

This box shall be for one single Substream.

Syntax

It shall be the same as esds box for Low Complexity Profile of [AAC] (AAC-LC).

Semantics

It shall be the same as the semantics in esds except followings:

We need to add specific boxes for FLAC and LPCM.

7.2.5. IA Sample Format

For tracks using the IASampleEntry, an IA Sample has the following constraints:

7.2.6. IA Sample Group

7.2.6.1. Global Descriptor Sample Group

During encapsulation process, global descriptor shall be discarded from IA bistream. A new sample group for global descriptor shall be defined by using sgpd and sbgp boxes with following requirements:

7.2.6.2. Demixing Info Sample Group

During encapsulation process, parameter_block_obu for demixing_info shall be discarded from IA bitstream. A new sample group for demixing_info() shall be defined by using sgpd' and sbgp boxes with following requirements:

7.3. Common Encryption

TBA

7.4. Codecs Parameter String

DASH and other applications require defined values for the Codecs parameter specified in [RFC6381] for ISO Media tracks. The codecs parameter string for the AOM IA codec shall be:
aiac.IAC-specific-needs.Opus
aiac.IAC-specific-needs.mp4a.40.2
aiac.IAC-specific-needs.fLaC
aiac.IAC-specific-needs.lpcm

IAC-specific-needs shall be V.PV as follows:

For example, for this version of the specification

aiac.0000.0000.Opus
aiac.0000.0100.mp4a.40.2

8. ISOBMFF IAC Decapsulation

8.1. ISOBMFF IAC Decapsulation with single track

This section provides a guideline for IAC parser to reconstruct IA bitstreams from IAC file.

When IAC parser feeds the reconstructed IA bitstreams to IAC-OBU parser, descriptor OBUs shall be placed at the first and followed by Temoral Units.

Below figure shows the mirroring process of the encapsulation scheme of IA bitstream specified in § 7 ISOBMFF IAC Encapsulation.

IAC Decapsulation Guideline

During decapsulation process, IAC file is decapsulated into IA bitstreams which conform to § 4 Open Bitstream Unit (OBU) Syntax and Semantics as follows:

codec_id and decoder_config() for IAC-OPUS is generated as follows:

codec_id and decoder_config() for IAC-AAC-LC is generated as follows:

9. IAC processing

This section provides a guideline for IA decoding for a given IA bitstream.

IA decoding can be done by using the combination of following decoding processing.

Abmisonics decoding, it shall conform to [RFC8486] except codec specific processing and shall output Ambisonics channels in ACN (Ambisonics Channel Number) order.

Scalable Channel Audio decoding, it shall output the channel audio (e.g. 3.1.2ch or 7.1.4ch) for the target channel layout.

IA decoder is composed of OBU parser, Codec decoder, Audio Element Renderer and Post-processor as depicted in below figure.

IA Decoder Configuration

9.1. Ambisonics decoding

This section describes the decoding of Ambisonics.

Below figure shows the decoding flowchart of Ambisonics decoding.

Ambisonics Decoding Flowchart

9.2. Scalable Channel Audio decoding

This section describes the decoding of Scalable Channe Audio.

Below figure shows the decoding flowchart of the decoding for Scalable Channel Audio.

Scalable Channel Audio Decoding Flowchart

For a given loudspeaker layout (i.e. CL #i) among the list of loudspeaker_layout in scalable channel layout config,

Following sections, § 9.2.1 Gain, § 9.2.2 De-mixer and § 9.2.3 Recon Gain are only needed for decoding of scalable audio with num_layers > 1.

9.2.1. Gain

Gain module is the mirror process of Attenuation module. It recovers the reduced sample values using Output_Gain when its flag for ChannelGroup #j is on. When its flag is off, then this module shall be bypassed for ChannelGroup #j. Output_Gain(j) for ChannelGroup #j shall be applied to all samples of the mixed channels in the ChannelGroup #j. Where, mixed channels means the mixed channels from an input channel audio (i.e. a channel audio for CL #n).

To apply the gain, an implementation MUST use the following:

Sample *= pow(10, Output_Gain(j) / (20.0*256))

Where, Output_Gain(j) is the raw 16-bit value for jth layer which is specified in channel audio layer config.

9.2.2. De-mixer

For scalable channel audio with num_layers > 1, some channels of down-mixed audio for CL #i are delivered as is but the rest are mixed with other channels for CL #i-1.

De-mixer module reconstructs the rest of the down-mixed audio for CL #i from the mixed channels, which is passed by Gain module, and its relevant non-mixed channels using its relevant demixing parameters.

De-mixing for down-mixed audio for CL #i shall comply with the result by the combination of following surround and top de-mixers:

Initially, wIdx(0) = 0 and the value of wIdx(k) shall be derived as follows:

Mapping of wIdx(k) to w(k) should be as follows:

wIdx(k) :   w(k)
   0    :    0
   1    :  0.0179
   2    :  0.0391
   3    :  0.0658
   4    :  0.1038
   5    :  0.25
   6    :  0.3962
   7    :  0.4342
   8    :  0.4609
   9    :  0.4821
   10    : 0.5

When D_set = { x | S1 < x ≤ Si and x is an integer},

When Ti = 2,

When Ti = 4,

For example, when CL #1 = 2ch, CL #2 = 3.1.2ch, CL #3 = 5.1.2ch and CL #4 = 7.1.4ch. To reconstruct the rest (i.e. Ls5/Rs5/Ltf/Rtf) of th down-mixed 5.1.2ch,

Ls5 = 1/δ(k) × (L2 - 0.707 × C - L5) and Rs5 = 1/δ(k) × (R2 - 0.707 × C - R5).
Ltf = Ltf3 - w(k) x (L2 - 0.707 x C - L5) and Rtf = Rtf3 - w(k) x (R2 - 0.707 x C - R5).

9.2.3. Recon Gain

Recon_Gain shall be only applied to all of audio samples of the de-mixed channels from De-mixer module.

Below figure shows the smoothing scheme of Recon_Gain.

Smoothing Scheme of Recon Gain

Recommend values for specific codecs are as follows

9.3. Mix Presentation

//To Do: Fill in the text

9.3.1. Rendering for Audio Element

This section provide a guideline by the rendering_config() which is specified in mix presentation OBU.

//To Do: Fill in rendering method for scene-based audio element if any

//To Do: Fill in rendering method for channel-based audio element if any

9.3.2. Mixing for Audio Elements

This section provide a guideline by the element_mix_config() which is specified in mix presentation OBU.

When the output channel audio of scene-based audio element or channel-based audio element does not match with the loudspeaker layout which is indicated by mix_target_layout in mix presentation OBU.

Down-mixing matrics, which are specified in § 9.5 Down-mix Matrix, are recommended for down-mixing of the output channel audio.

When multiple audio elements are mixed into one channel audio:

After relevant processing, multiple audio elements are mixed into one channel audio according to the target loudspeaker layout with the target sampling rate by considering the synchronization in audio sample by audio sample among them.

//To Do: Fill in the text based on element_mix_config()

9.4. Post Processing

9.4.1. Loudness Normalization

Loudness normalization is done by adjusting a loudness level to -24 LKFS based on the loudness value of the target channel layout (i.e. CL #i) which is signaled in Channel_Audio_Layer_Config() or the loudness value in mix presentation OBU.

Real implementations for § 9.4.1 Loudness Normalization, § 9.4.2 DRC Control and § 9.4.3 Limiter are soly dependent on implementers (i.e., out of scope of this specification) unless mix presentation OBU provide algorithms for those. This specification only recommends the principles for the former.

9.4.2. DRC Control

In this specification, DRC control can be guided by a pre-defined DRC or by the algorithm specified in mix presentation OBU.

For the pre-defined DRC, it is assumed an input loudness of -24 LKFS and targets an output loudness of -16 LKFS and DRC control module applies the pre-defined DRC compression by assuming a target loudness is adjusted to -16 LKFS as follows:

Below figure shows the schematic diagram of the pre-defined DRC compression.

Pre-defined DRC Compression Scheme

The below is the equation that represents the pre-defined DRC compression scheme.

Y = D_T(i) + (X - T(i)) / R(i). Where,
X ∈ Seg(i) and D_T (i) = T(0) + ∑ ((T(k+1) - T(k)) / R(k)) (k = 0 to i-1).
Seg(i): ith Segment
 T(i) : Threshold vlaue in dBFS for Seg(i), where T(0) = -96.33
 R(i) : Ratio value for Seg(i)
D_T(i): Threshold value in dBFS for Seg(i) after DRC compression, where D_T(0) = T(0)
  X   : Input sample value in dBFS
  Y   : Output sample value in dBFS

9.4.3. Limiter

This module limits the true peak of input signal at -1dB. The definition of thr true peak is base on [ITU1770-4].

Below is a recommended loudness normalization and DRC control principle according to application.

NOTE: The definitions of AV, TV and Mobile applications are as follows: .AV application: Sound devices with external speakers such as Soundbar, AV receiver, HiFi speaker etc.. .TV application: Television with built-in speakers such as LCD/OLED slim TV. .Mobile application: Handheld devices with built-in speakers such as smartphone, tablet etc..

9.5. Down-mix Matrix

9.5.1. Static Down-mix Matrix {#processing-downmixmatrix-static}

This section recommends static down-mix matrices.

IAC players need to support any valid channel layout, even if the number of channels does not match the physically connected audio hardware. Players need to perform channel mixing to increase or reduce the number of channels as needed.

Implementations can use the matrices below to implement down-mixing from the output channel audio, which are known to give acceptable results for stereo, 5.1ch, 7.1ch and 3.1.2ch.

Down-mixing can be done directly by using one of the matrices below or a combination of them. For example, stereo down-mixing for 7.1.4ch can be done by the combination of the 7.1ch down-mix matrix for 7.1.4ch, 5.1ch down-mix matrix for 7.1ch and stereo down-mix matrix for 5.1ch.

The figures below shows recommended static down-mix matrices to stereo, 5.1ch and 7.1ch.

7.1ch Down-mix matrix for 7.1.4ch
7.1ch Down-mix matrix for 7.1.2ch
5.1ch Down-mix matrix for 5.1.4ch
5.1ch Down-mix matrix for 5.1.2ch
5.1ch Down-mix matrix for 7.1ch
Stereo Down-mix matrix for 5.1ch
Stereo Down-mix matrix for 3.1.2ch

The figures below show static down-mix matrices to 3.1.2ch.

3.1.2ch Down-mix matrix for 5.1.2ch
3.1.2ch Down-mix matrix for 5.1.4ch
3.1.2ch Down-mix matrix for 7.1.2ch
3.1.2ch Down-mix matrix for 7.1.4ch

Where, p1 = 0.707 and p2 = 0.3535. Implementations may use limiter defined in § 9.4.3 Limiter to preserve energy of audio signals instead of normalization factors.

9.5.2. Dynamic Down-mix Matrix {#processing-downmixmatrix-dynamic}

This section recommends dynamic down-mixing matrics.

The dynamix down-mixing matrics shall comply with the down-mixing mechanisam which is specified in § 10.2.2 Down-mix Mechanism.

10. IAC Generation Process

This section provides a guideline for IA encoding for a given input audio format.

Recommended input audio format for IA encoding is as follows:

For a given input audio and user inputs, IA encoder shall output IA bitstream which conforms to § 4 Open Bitstream Unit (OBU) Syntax and Semantics.

Input audio shall be one of followings:

User inputs are:

IA encoding can be done by using the combination of following generation processing.

The below figure shows IA encoder configuration for one single audio element.

The IA encoder is composed of Pre-processor, Codec encoder and OBU packetizer.

IA Encoder Configuration

The order of substreams in each ChannelGroup shall be as follows:

Where, non-coupled substream is a coded substream from one of non-coupled channels.

10.1. Ambisonics Encoding

For Ambisonics encoding:

10.2. Scalable Channel Audio Encoding

For Scalable Channel Audio encoding:

Below figure shows IA encoding flowchart for Scalable Channel Audio.

IA Encoding Flowchart for Scalable Channel Audio

Following sections, § 10.2.1 Down-mix parameter and Loudness, § 10.2.2 Down-mix Mechanism, § 10.2.3 Channel Layout Generation Rule, § 10.2.4 Recon Gain Generation and § 10.2.5 ChannelGroup Generation Rule do not needed for non-scalable channel audio (i.e., when Channel_Audio_Layer of IA_Static_Meta is set to 1).

10.2.1. Down-mix parameter and Loudness

This section describes how to generate down-mix parameters and loudness level for a given channel audio and a given list of channel layouts for scalability.

Below figure shows a block diagram for down-mix parameter and loudness module including down-mixer.

IA Down-mix Parameter and Loudness

For a given Channel Audio (e.g. 7.1.4ch) and a given list of channel layouts based on the Channel Audio,

10.2.2. Down-mix Mechanism

This section specifies the down-mixing mechanism to generate down-mixed audio for scalable channel audio.

For a given Channel Audio which conforms to [[=loudspeaker_layout]], the surround and top channels (if any) are separately down-mixed and especially step by step until to get a target channels.

Implementors may use another method to get the down-mixed audio from the given channel audio, but the down-mixed audio shall comply with that by this section.

Therefore, a down-mixer based on the down-mix mechanisam is a combination of following surround down-mixer(s) and top down-mixer(s) as depicted in below figure.

S7to5 enc.: Ls5 = α(k) x Lss7 + β(k) x Lrs7 and Rs5 = α(k) x Rss7 + β(k) x Rrs7.
S5to3 enc.: L3 = L5 + δ(k) x Ls5 and R3 = R5 + δ(k) x Rs5
S3to2 enc.: L2 = L3 + 0.707 x C and R2 = R3 + 0.707 x C
S2to1 enc.: Mono = 0.5 x (L2 + R2)
T4to2 enc.: Ltf2 = Ltf4 + γ(k) x Ltb4  and Rtf2 = Rtf4 + γ(k) x Rtb4.
T2toTF2 enc.: Ltf3 = Ltf2 + w(k) x δ(k) x Ls5 and Rtf3 = Rtf2 + w(k) x δ(k) x Rs5.
IA Down-mix Mechanism
For example, to get down-mixed 3.1.2ch from 7.1.4ch:
- S3 of 3.1.2ch is generated by using S7to5 and S5to3 encs.
- TF2 of 3.1.2ch is generated by using T4to2 and T2toTF2 encs.

10.2.3. Channel Layout Generation Rule

This section describes the generation rule for channel layouts for scalable channel audio.

For a given channel layout (CL #n) of input Channel Audio, any list of CLs ({CL #i: i = 1, 2, ..., n}) for a scalable channel audio shall comform with following rules:

Down-mix paths, which conform to the above rule, shall be only allowed for scalable channel audio with num_layers > 1 as depicted in below figure.

IA Down-mix Path

10.2.4. Recon Gain Generation

This section describes how to generate Recon_Gain.

Recon_Gain needs to be applied to de-mixed channels. For this, IA encoder needs to deliver it to IA decoders.

Let’s define followings:

If 10*log10(level Ok / maxL^2) is less than the first threshold value (e.g. -80dB), Recon_Gain (k, i) = 0. Where, maxL = 32767 for 16bits.

If 10*log10(level Ok / level Mk ) is less than the second threshold value (e.g. -6dB), Recon_Gain (k, i) is set to the value which makes level Ok = Recon_Gain (k, i)^2 x level Dk. Otherwise, Recon_Gain (k, i) = 1. Actual value to be delivered is floor(255*Recon_Gain).

For example, if we assume CL #i = 7.1.4ch and CL #i-1 = 5.1.2ch, then de-mixed channels are D_Lrs7, D_Rrs7, D_Ltb4 and D_Rtb4.
- D_Lrs7 and D_Rrs7 are de-mixed from Ls5 and Rs5 in the (i-1)th ChanngelGroup by using Lss7 and Rss7 in the ith ChannelGroup and its relevant demixing parameters (i.e., α(k) and β(k)) , respectively.
- D_Ltb4 and D_Rtb4 are de-mixed from Ltf2 and Rtf2 in the (i-1)th ChanngelGroup by using Ltf4 and Rtf4 in the ith ChannelGroup and its relevant demixing parameter (i.e., γ(k)), respectively.

Recon_Gain for D_Lrs7:
- Level Ok is the signal power for the frame #k of Lrs7 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Ls5 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Lrs7.
Recon_Gain for D_Rrs7:
- Level Ok is the signal power for the frame #k of Rrs7 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Rs5 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Rrs7.
Recon_Gain for D_Ltb4:
- Level Ok is the signal power for the frame #k of Ltf4 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Ltf2 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Ltb4.
Recon_Gain for D_Rtb4:
- Level Ok is the signal power for the frame #k of Rtf4 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Rtf2 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Rtb4.

10.2.5. ChannelGroup Generation Rule

This section describes the generation rule for ChannelGroup.

For a given Channel Audio and the list of CLs ({CL #i: i = 1, 2, ..., n}), CG Generation module outputs the transformed audio (i.e. ChannelGroups) which shall conform to following rules:

Below figure shows one example of transformation matrix with 4 CGs (2ch/3.1.2ch/5.1.2ch/7.1.4ch).

Example of Transformation Matrix with 4 CGs

10.2.6. Mix Presentation Encoding

//To Do: Fill in the text

10.2.6.1. Rendering Config

This section provide a guideline to generate rendering_config().

//To Do: Fill in how to generate rendering_config() for scene-based audio element

//To Do: Fill in how to generate rendering_config() for channel-based audio element

10.2.6.2. Element Mix Config

This section provide a guideline to generate element_mix_config().

//To Do: Fill in how to generate element_mix_config() for scene-based audio element

//To Do: Fill in how to generate element_mix_config() for channel-based audio element

10.2.7. Multiple Audio Elements Encoding

This section provide a guideline to generate IA bitstream having multiple audio elements

10.2.7.1. Multiple Audio Elements with One Codec Config

This section provides a way how to generate IA bitstream having multiple audio elements with the same codec config OBU. However, the result shall comply with the base profile of IA bitstream.

Step1: Descriptor OBUs are generated as follows:

Step2: ith Frame is generated as follows:

Step3: Generate IA bitstream which start descritpr OBUs and followed by temporal units in order.

10.2.7.2. Multiple Audio Elements with Multiple Codec Config

Step1: Descriptor OBUs are generated as follows:

Step2: Data OBUs are generated as follows:

Step3: Generate IA bitstream which start descritpr OBUs and followed by Temporal Units in order.

10.2.8. Post Processing

This section provides a guideline to generate algorithms for post processing.

10.2.8.1. Loudness Config

This section provide a guideline to generate loudness_config().

//To Do: Fill in how to generate loudness_config()

10.2.8.2. DRC Config

This section provide a guideline to generate drc_config().

//To Do: Fill in how to generate drc_config()

11. Consumption of IAC bitstream

TODO. Fill in example workflows.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[AAC]
Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC). Standard. URL: https://www.iso.org/standard/43345.html
[BCP47]
BCP 47. Best Practice. URL: https://www.rfc-editor.org/info/bcp47
[FLAC]
Free Lossless Audio Codec. Best Practice. URL: https://xiph.org/flac/format.html
[ISOBMFF]
Information technology — Coding of audio-visual objects — Part 12: ISO Base Media File Format. December 2015. International Standard. URL: http://standards.iso.org/ittf/PubliclyAvailableStandards/c068960_ISO_IEC_14496-12_2015.zip
[ITU1770-4]
Algorithms to measure audio programme loudness and true-peak audio level. Standard. URL: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf
[ITU2051-3]
Advance sound system for programme production. Standard. URL: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf
[LEB128]
Little Endian Base 128. Best Practice. URL: https://en.wikipedia.org/wiki/LEB128
[MP4]
Information technology — Coding of audio-visual objects — Part 14: MP4 file format. January 2020. Published. URL: https://www.iso.org/standard/79110.html
[MP4-Audio]
Information technology — Coding of audio-visual objects — Part 3: Audio. Standard. URL: https://www.iso.org/standard/76383.html
[MP4-Systems]
Information technology — Coding of audio-visual objects — Part 1: Systems. Standard. URL: https://www.iso.org/standard/55688.html
[OPUS-IN-ISOBMFF]
Encapsulation of Opus in ISO Base Media File Format. Best Practice. URL: https://opus-codec.org/docs/opus_in_isobmff.html
[Q-Format]
Q (number format). Best Practice. URL: https://en.wikipedia.org/wiki/Q_(number_format)
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[RFC6381]
R. Gellens; D. Singer; P. Frojdh. The 'Codecs' and 'Profiles' Parameters for "Bucket" Media Types. August 2011. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc6381
[RFC6716]
JM. Valin; K. Vos; T. Terriberry. Definition of the Opus Audio Codec. September 2012. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc6716
[RFC7845]
T. Terriberry; R. Lee; R. Giles. Ogg Encapsulation for the Opus Audio Codec. April 2016. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc7845
[RFC8486]
J. Skoglund; M. Graczyk. Ambisonics in an Ogg Opus Container. October 2018. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc8486

Informative References

[AI-CAD-Mixing]
AI 3D immersive audio codec based on content-adaptive dynamic down-mixing and up-mixing framework. Paper. URL: https://www.aes.org/e-lib/browse.cfm?elib=21489