In the first post of this series, Networking Basics for WebRTC: Delivery and Addresses, we introduced networking protocols and ports, learned about LANs, WANs and NAT, and explained the difference between TCP and UDP. Today, we will talk about two key moments in WebRTC traffic: signaling and media exchange.
Memories of Young Love: Signaling
When I was young, before all the social media and instant messaging revolution (I’m getting old, I know) the infallible method to talk with that brown eyed girl I liked was exchanging notes. This usually happened through our mutual friend Sigmund.
“Tell her I can offer her ice cream,” I would tell Sigmund while also passing a note where I suggested the canteen as the candidate for our encounter. After some time he was back with her answer: “You’re lucky, I really love ice cream.”
And after that brief negotiation we finally met. A short, but passionate teenage love began. However, you’re not here to read about my dating history. You’re here to learn about WebRTC networking, right? Actually, I told you the story to help explain Signaling.
Offer and Answer Mechanisms
As in my story about the brown eyed girl, when two peers want to connect through WebRTC, they first need to set up a negotiation through a “mutual friend.” In other words, a third node that is able to reach both of them in the same way Sigmund was known by both of us.
Both peers will exchange information through the third node, a.k.a the Signaling Server.
The peer who wants to initiate the connection uses the Session Description Protocol (SDP) to create an Offer. The Offer is sent through the Signaling channel.
Instead of an invitation to ice cream, the Offer in a WebRTC connection will include information about the supported audio and video codecs, ICE candidates (we will talk more about this in a second) and encryption details in a text-based format as defined by the SDP.
Once the other peer receives the Offer, it creates an Answer that contains its own information. The Answer is sent back to the first peer, again through the Signaling channel.
WebRTC doesn’t define, nor enforce, a signaling mechanism. Any technology that allows exchanging information can be used, provided that such a channel is accessible to both parties.
After both peers have all the required information, they can establish a connection and media traffic begins to flow. Before getting into the details of media traffic, let’s also explain an important topic for Network Address Translation (NAT) traversal: the Interactive Connectivity Establishment (ICE) framework.
Traversing NATs Using ICE
In a current real world scenario, peers connecting through WebRTC live on different Local Area Networks (LANs). Connection takes place in a collection of LANs called a WAN (Wide Area Network.) This requires media traffic to go through multiple NAT devices, which is what we call NAT traversal.
As we discussed in the previous post, NAT is what enables connection to the internet. At the same time NAT represents a major challenge when it comes to connecting two peers directly. ICE provides a solution to that problem enabling peers to gather and exchange candidate pairs.
In the same way that I suggested the canteen for my encounter with the brown eyed girl, a candidate pair let the other peer know “where they can meet”. More precisely, it provides a pair of an IP address and a port where a peer can be found.
Such information would be added to Offers and Answers. Then, each peer’s ICE agent would perform dynamic checks to all the candidate pairs in order to determine the best route and establish the connection.
Initially, ICE starts by gathering the host ice candidates. These use the IP address assigned to the network interface of the user’s device, which is a local address that only makes sense within the same LAN. Host ICE candidates are only useful for LAN connections.
Next, the ICE framework would search for the IP address and port provided by the NAT device that allows the peer to connect to the internet. This is done by making a request to an external server known as STUN server.
STUN stands for Session Traversal Utilities for NAT. It’s a protocol that provides a way for a peer to know its public information. A peer will make a STUN request to a STUN server. The server will reply with the public IP address where the request came from.
A STUN server will usually reply with both UDP and TCP types of candidate pairs, although UDP should be preferred. These ICE candidates can be of type server reflexive (srflx) or peer reflexive (prflx), the latter being a variant of the former.
There is a third type of ICE candidate that is used when a direct connection is not possible. For instance, when the type of NAT doesn’t allow it or when there are firewall restrictions. These are the scenarios where relay ICE candidates are useful.
A relay ICE candidate is the IP address of a Traversal Using Relays around NAT (TURN) server. A TURN server will relay the media traffic from one peer to the other.
ICE gathering and connectivity checks take some time. This leaves the two peers waiting for a connection to be established. Fortunately, there is a more optimal approach known as Trickle ICE.
Love in the Time of Instant Messaging: Trickle ICE
Sigmund is still popular with people and even now I rely on him to find my Fermina. However, times have changed and now there are better ways to communicate. While I could still ask him to give me a hand setting a place to meet with the people he introduced me to, I can now propose candidate places independently via instant messaging without waiting on him.
Similarly, instead of waiting for collecting all of the ICE candidates before adding them to SDP Offer and Answer, we can trickle them to the signaling channel as they become available. This reduces the time peers are waiting to connect.
Enough Negotiation, It’s Media Time: Media & Data Exchange
Once signaling is over and both peers have enough information to connect directly (or relay traffic via TURN) then they can start sending media and data packets. WebRTC mandates that media traffic should be secured. To do so, it uses DTLS.
DTLS (Datagram Transport Layer Security) is a protocol that enables security over UDP. This is done in the same way that TLS (Transport Layer Security) does for TCP connections.
Having DTLS in the mix ensures that data is not eavesdropped or tampered by a non-authorized party.
The audio and video data is transmitted using the secured version of the Real-Time Transport Protocol (RTP), Secure RTP (SRTP). RTP is the protocol used to transmit data in real-time. It is widely used on communication systems such as VoIP. Secure RTP is used in conjunction with DTLS for key management.
A sometimes overlooked but great WebRTC capability is Data Channels. Data channels allow sending arbitrary data through UDP. The protocol used for this is Stream Control Transmission Protocol SCTP, built on top of DTLS for security.
In this post, we covered the most important protocols that are used throughout the process of establishing a WebRTC connection, from formatting the Offer & Answer and gathering and checking ICE candidates to optimizing ICE and sending voice, video and data packets. The next step is to see everything in practice. Stay tuned for the next part.
If the theory is enough for you and you want to rely on the experts for putting it into practice, contact us and take advantage of WebRTC capabilities for your specific use case. Send us your inquiry!
Posts in this series: