Speeding up HTTPS API calls with SNI

A while back Google Chrome started to treat all non-HTTPS websites as unsafe. Well, they are unsafe. Other browsers joined in. Let's Encrypt gained popularity. And everything on the web was now on HTTPS.

API calls between HTTP based services were also moved over to HTTPS. Granted, most of the popular services have always been on HTTPS. Easiest upgrade ever. Just add an "s" to some URL somewhere and be done. Not so fast!

This got me thinking. The Engineering Triangle1 concept surely applies here. We are giving something up for all this security, right? Usually, we trade convenience for security but in this case we are trading performance. Read on and I will explain how. And then show you how to minimise this loss.

HTTP vs HTTPS

Before we proceed I will assume you at least know that HTTPS is a secure form of HTTP and this security is provided by an SSL/TLS2 extension. The details of how these extensions work are beyond the scope of this blog post. The thing I want to highlight is that TLS is a successor to SSL in providing the "Secure" in HTTPS. Henceforth I will be discussing TLS. When given a choice between HTTP and HTTPS always go with HTTPS. A safer web for everyone, yay!

TLS

In a client-server scheme, the oversimplified explanation is that TLS gives the client and the server certificates to identify themselves. The auth process can be one way, where only the client verifies the server, or two way, where the server also verifies the client. In most schemes on the web we go with one way. When you connect to the Facebook graph API your client verifies that you are connected to the real Facebook but Facebook does not verify your client.

For TLS to work, Facebook gets a TLS certificate for graph.facebook.com and puts it on the server. Your client then asks the server for this certificate and uses a third party to verify that the certificate can be trusted. This third party varies from client to client, in most cases it is the OS hosting your client.

The above explanation is a gross simplification of a somewhat complex process called the TLS handshake. Perhaps on a later blog we can dive into this.

Performance Issue

There is a catch! TLS certificates are created against domain names, not IP addresses. So does that mean https://216.58.223.46 as a replacement for https://google.com is not a thing? Does this mean caching DNS lookup results is thrown away with TLS? Maybe, read on.

Please note that all of the following code is written in Python 3.

As a python developer I assumed that on a Linux/Unix OS if you run the following code, the first call could be slow but the subsequent calls will be faster. I mean, on Chrome it sure seems faster.

# simple.py
import requests
import time

def call_google():
  requests.get('https://www.google.com')

for i in range(10):
  start = time.time()
  call_google()
  print(time.time() - start)

I was surprised and shocked to learn that this is not strictly the case.

Results from running simple.py


Wireshark revealed that every iteration of the for loop causes a DNS lookup. Why is Linux not caching this? I found out that there is no OS level DNS caching on LinuxSO. Caching is a really hard problem after-all. So are we doomed to secure but super slow communications on TSL?

SNI

Okay, if you were attentive you probably called BS on the above because computers always use IP addresses to talk to each other. https://google.com always gets translated to https://216.58.223.46. So then how does TLS work in that case, since the certificate is against google.com and not the IP?

Enter SNI. This extension to TLS allows the client to call an IP over TLS but use the underlying protocol to specify the server/domain name it is trying to connect to. So calls can be made to https://216.58.223.46 and google.com can be passed in the HTTP headers so that the responding server knows which certificate to respond with. The rest of the process can then proceed as if the call was made to https://google.com. Good news! This also means that even though Linux does not, we can still cache our DNS lookup results, use SNI, and hopefully speed things up.

Here is how in python:

# better.py
import requests, time, random
import dns.resolver
from requests_toolbelt.adapters import host_header_ssl


class GoogleClient(object):
  def __init__(self):
    self.cache_ttl = 60
    self.cache_expiry = 0
    self.cached_ips = []
    self.domain_name = 'www.google.com'

    self.session = requests.Session()
    self.session.mount('https://', host_header_ssl.HostHeaderSSLAdapter())
    self.headers = {'Host': self.domain_name}

  def refresh_dns(self):
    self.cached_ips = [str(ip) for ip in dns.resolver.query(self.domain_name)]
    self.cache_expiry = time.time() + self.cache_ttl

  def get_endpoint(self):
    if self.cache_expiry < time.time():
      self.refresh_dns()
    return 'https://{}'.format(random.choice(self.cached_ips))

  def call_google(self):
    self.session.get(self.get_endpoint(), headers=self.headers)

client = GoogleClient()
for i in range(10):
  start = time.time()
  client.call_google()
  print(time.time() - start)

Okay, okay, I know. That is a big change. I will not dissect the code line by line. I will just highlight a few important things.

The first call causes a DNS lookup and the results are cached with a TTL of 60 seconds. Then a random IP from the list of possible IPs is used to make the call while passing the domain name in the Host header of HTTP. Subsequent calls keep reusing this list of possible calls as long as they are within 60 seconds of each other. If some calls fall out of this 60 seconds margin, a second DNS would be needed to refresh the list of IPs.

Results from running better.py

But as you can see, the calls are much faster.

For the better code to work note that I had to install the dnspython and requests-toolbelt packages into my environment:

pip install dnspython requests-toolbelt

Also note that the first call of the better code is relatively slower than the average call of the simpler code. This is because of the DNS lookup and caching step.

Be Warned

The above code is a simplified version of code I use in a production system. Caching is not a simple thing in systems. If Google changed IPs 5 seconds after your DNS lookup, you would be stuck with breaking calls for 55 seconds. So a better strategy would be to refresh DNS after every connection failure or any other unexpected response. I will not go into further details on this, save that for a later post.

Echo

If you, like me, thought that the OS has your back in caching DNS lookups, think again. I also tested this on Windows and found similar results. Meaning Windows also does not cache DNS lookups at the OS level.

Your language might have a client that does this for you automatically. It is worth a check. If you have any questions or comments, ping me on Twitter for a chat.

Caio for now!

[1]: The concept of the Engineering Triangle is the phenomenon that in engineering most problems are solved by trading between 3 desirable properties. e.g Security, Convenience and Performance are always at odds with one another in systems.

[2]: SSL - Secure Sockets Layer. TLS - Transport Layer Security. Protocols for providing secure communications over otherwise insecure protocols like IP and HTTP