AirTunes v2 UDP streaming protocol

The information contained in this document was obtained by reverse-engineering the protocol from network dumps obtained on a Mac OS X host running iTunes. Trial & error is the second source of information for this document.

Some things described hereafter may be slightly inaccurate, and some details remain unknown at the date of writing. If you happen to uncover some of these unknowns, please let me know and I’ll update this document with your findings.

Please maintain proper attribution on all copies, partial or otherwise, of this document.

First published: 2010-05-02.

Revised:

  • 2010-07-30: reports of AppleTV using unencrypted audio packets

Julien BLACHE


Overview

AirTunes v2 is a UDP-based streaming protocol with built-in synchronization mechanisms to ensure playback on all ApEx devices and iTunes itself remains as closely synchronized as possible.

A dialect of RTSP is still used as the overall control protocol; not much has changed here compared to AirTunes v1.

Synchronization between iTunes and the ApEx devices is maintained using a shared clock, iTunes being the master and the devices being slaves. Devices periodically resync their clock to the master.

In addition, iTunes periodically sends a control message to the devices, all at the same time, indicating the current position, the current time on the master clock and the position of the next audio packet being sent to them.

All communication happens through UDP, on 3 distinct ports:

  • server_port:   audio stream
  • control_port:  playback sync
  • timing_port:   clock sync

Port numbers can be different on the host and on the devices; both sides advertise their port numbers during the RTSP SETUP exchange (the host doesn’t advertise server_port, it’s of no interest to the device). On the device side, server_port is 6000, control_port is 6001 and timing_port is 6002. On the host side, iTunes uses a randomly allocated port for server_port and uses 6001 for control_port and 6002 for timing_port just like the devices.

Data is big endian.

Time synchronization

About every 2 seconds, the devices will query the master clock to sync up to it.

The master clock is a monotonic clock maintained by iTunes; it is not the wall clock. Chances are it’s a monotonic system clock, like the one available through clock_gettime(CLOCK_MONOTONIC, …).

On the device side, similarly, the clock is maintained by the AirTunes software and is not related to the system clock or wall time.

Timestamps exchanged are 64bit fixed point values, with 32 bits for the integer part. They’re actually NTP timestamps, as described in RFC 1305. Despite not being used to exchange wall time, they include the NTP Epoch to Unix Epoch delta, so when the master clock reads 0 seconds, the timestamp reads 2208988800.

Time sync packets are 32 bytes long and exchanged to and from timing_port on both ends. The example below is the first time sync exchange between an ApEx device that has just booted up and the iTunes host that has been up for a couple hours; appreciate the delta between their clocks.

Time sync query

0000-0007: 0x80 0xd2 0x00 0x07  0x00 0x00 0x00 0x00
0008-000f: 0x00 0x00 0x00 0x00  0x00 0x00 0x00 0x00
0010-0017: 0x00 0x00 0x00 0x00  0x00 0x00 0x00 0x00
0018-001f: 0x83 0xaa 0x7e 0x80  0xa9 0x85 0x61 0x56
  • 0000-0001: packet type (0×80 0xd2)
  • 0002-0003: always 0×00 0×07
  • 0004-0007: all 0
  • 0008-000f: all 0 (see Time sync reply below)
  • 0010-0017: all 0 (see Time sync reply below)
  • 0018-001f: device transmit timestamp

Time sync reply

0000-0007: 0x80 0xd3 0x00 0x07  0x00 0x00 0x00 0x00
0008-000f: 0x83 0xaa 0x7e 0x80  0xa9 0x85 0x61 0x56
0010-0017: 0x83 0xaa 0x93 0x1f  0xe9 0x7e 0x4a 0x19
0018-001f: 0x83 0xaa 0x93 0x1f  0xe9 0x7f 0x93 0x9a
  • 0000-0001: packet type (0×80 0xd3)
  • 0002-0003: always 0×00 0×07
  • 0004-0007: all 0
  • 0008-000f: device transmit timestamp (query bytes 18-1f, see above)
  • 0010-0017: master receive timestamp
  • 0018-001f: master transmit timestamp

Audio stream

Audio is streamed on the server_port in packets of 352 (0×160) samples, with a 12-byte header. At 44.1 kHz, packet duration is ca 7.981 ms.

Packets are sent out regularly at this interval, to all devices at the same time. All devices receive the exact same packet. iTunes uses a separate socket for each client for the audio stream.

The audio data itself is AES-encrypted in 16 bytes blocks, while the 12-byte header is left unencrypted.

For compressed ALAC audio, the total packet length varies depending on the compression ratio.

In the uncompressed case, the total packet length is 12 + 3 + 1408 bytes, corresponding to the AirTunes v2 packet header, ALAC header and audio data.

Note that the AppleTV uses unencrypted audio packets, contrary to the AirPort Express.

Audio packet

0000-0007: 0x80 0x60 0xb5 0xe3  0xbd 0x81 0xd7 0x1c
0008-000f: 0x30 0x9f 0xdc 0x88  [Audio data starts]
...
  • 0000-0001: packet type; 0×80 0×60 during streaming, 0×80 0xe0 on first packet
  • 0002-0003: RTP sequence number
  • 0004-0007: RTP time
  • 0008-000b: unknown; constant throughout the session, same for all devices

The initial RTP sequence number and RTP timestamp are sent to the device in the RTP-Info field of the RTSP RECORD request:

RTP-Info: seq=46562;rtptime=3179402684

The first packet in the stream bears the RTP sequence number and RTP timestamp given in the RTP-Info field. Subsequent packets increment the RTP sequence number by one and the RTP timestamp by 352 (the number of samples contained in one packet) at every packet.

Playback synchronization

A playback synchronization packet is sent by the host to the devices, all at the same time, after every 126 audio packets, and immediately precedes the next audio packet. The packet is the same for all devices.

126×352 = 44352, so playback sync packets are effectively sent out roughly every second for 44.1 kHz audio.

Playback sync packets are 20 bytes long and exchanged to and from control_port on both ends.

Playback sync packet

0000-0007: 0x80 0xd4 0x00 0x07  0xbd 0x81 0x2a 0x74
0008-000f: 0x83 0xaa 0x93 0x20  0xed 0x82 0x72 0xa5
0010-0013: 0xbd 0x82 0x84 0x5c
  • 0000-0001: packet type; 0×80 0xd4 during streaming, 0×90 0xd4 on first sync packet
  • 0002-0003: always 0×00 0×07
  • 0004-0007: current RTP timestamp (playback position)
  • 0008-000f: current time
  • 0010-0013: next packet RTP timestamp

The audio packet immediately following the playback sync packet bears the RTP timestamp given in bytes 0010-0013.

Unknown control packet

I’ve observed the following control packet while sending bad audio packets to an ApEx device. I’m not sure what the message is here, if the ApEx is asking for a packet retransmission or merely informing the host that something isn’t right with the audio stream, or something else entirely.

I’ve not been able to reproduce this with iTunes yet, so what reply or action is needed on the host side is unknown at the moment.

The packet is sent by the ApEx from its control port to the host’s control port; it is 18 bytes long.

0000-0007: 0x80 0xd5 0x00 0x01  0x01 0xbc 0x00 0x01
0008-000f: 0x00 0x00 0x00 0x00  0x00 0x00 0x00 0x00
0010-0011: 0x00 0x00
  • 0000-0001: packet type (0×80 0xd5)
  • 0002-0003: observed 0×00 0×01
  • 0004-0005: RTP sequence number of a previous audio packet
  • 0006-0007: observed 0×00 0×01
  • 0008-0011: all 0

Given the null bytes, it probably calls for a reply (similar to time sync requests).

Playback startup sequence

This section documents the playback startup sequence, starting with the RTSP RECORD request that initiates the streaming session.

1. RTSP RECORD request

The host sends the RTSP RECORD requests to start the streaming sessions.

2. Time synchronization

The devices perform 3 time synchronizations back to back after they receive the RTSP RECORD request from the host.

3. RTSP RECORD reply

The devices reply to the RTSP RECORD request, including the Audio-Latency header.

4. Playback synchronization

The host sends out a first playback synchronization packet (header 0×90 0xd4).

In this packet, the next packet RTP timestamp is equal to the RTP timestamp given in the RECORD request. The current RTP timestamp (playback position) is equal to the RTP timestamp given in the RECORD request minus 88200.

That leaves us with a 2-second buffer in the ApEx devices for an audio stream at 44.1 kHz.

5. Audio stream

The host starts the audio stream. The first packet RTP sequence number and RTP timestamp are equal to those given in the RECORD request.

For an ALAC stream, the header on the first packet is 0×80 0xe0.

Playback shutdown sequence

Wait until playback reaches the last sample and send an RTSP TEARDOWN request to all devices.

Pausing playback, jumping forward/backward, next/previous file

When playback needs to be interrupted for pausing, seeking in the current file or jumping to another file, the audio stream is interrupted. Devices are then instructed to stop playing.

1. RTSP FLUSH request

The host sends RTSP FLUSH requests. The RTP-Info field sent with this request contains the RTP sequence number and RTP timestamp of the packet that will resume playback.

2. RTSP FLUSH reply

The devices reply to the request; the RTP-Info field contains the RTP timestamp of the last sample played by the device.

3. Resume audio stream

The stream resumes with an initial sync packet (0×90 0xd4), followed by the audio packet bearing the RTP sequence number and RTP timestamp given in the RTSP FLUSH request that interrupted the output. The current position given in the initial sync packet is the RTP timestamp of the first packet – 88200.

See Playback startup sequence above for the details.

Note that the content of the devices’ buffers is lost. If pausing, playback needs to seek back by 2 seconds to pick up at the right place.

Adding a device to the set

To add a device to the set of active devices while streaming, a standard RTSP startup sequence is performed for this device.

In the RTSP RECORD request, the RTP sequence number and RTP timestamp values given to the device are the RTP sequence number and RTP timestamp current at the time the RTSP ANNOUNCE request is sent.

After the device has synced up to the master clock and replied to the RTSP RECORD request, it starts receiving the same audio stream all other devices receive.

There is no initial playback sync and no particular procedure to pre-fill the device’s internal buffer.

Removing a device from the set

To remove a device from the set while streaming, an RTSP TEARDOWN request is issued and the host stops sending audio data to the device.

There is no RTSP FLUSH request issued prior to the TEARDOWN in this case.