あどけない話

Internet technologies

QUIC and Linux capabilities

For security reasons, the typical boot process of HTTPS servers is as follows:

  • Executed by a root.
  • Reading a TLS private key and open a listen socket on TCP port 443.
  • Switching the root user to nobody (or something).

Since accept() can create connected sockets bound to TCP port 443 even with non-root privilege, servers can accept connections.

Let's consider the case of QUIC servers which uses UDP. Chrome does not allow Alt-Svc to go across the privileged boundary (i.e. 1024). For instance, `Alt-Svc: h3=":4433" provided on TCP port 443 does not work. QUIC servers should provide QUIC on UDP port 443.

Some QUIC servers make use of connected UDP socket. As I described in Implementation status of QUIC in Haskell, the following procedure can be used to create a connected UDP socket when a packet is received on a kind of listen socket of 192.0.2.1:443:

  1. Create a new UDP socket with SO_REUSEADDR
  2. Bind it to *:443
  3. Connect it to 203.0.113.0:3456. This also binds the local address to 192.0.2.1.

For 2), the root privilege or the CAP_NET_BIND_SERVICE capability is necessary on Linux. The easiest way to implement secure QUIC servers is to use the setcap command:

% sudo setcap "CAP_DAC_READ_SEARCH,CAP_NET_BIND_SERVICE+epi" quic-server
% sudo -u nobody -g nobody ./quic-server

CAP_DAC_READ_SEARCH is necessary to read a TLS private key. Since the capability is not dropped, this server can read any files. Yes, still insecure.

To keep only CAP_NET_BIND_SERVICE, the following code should be run after reading the private key:

  /* root */
  /* inherits all capabilities */
  prctl(PR_SET_SECUREBITS, SECBIT_KEEP_CAPS, 0L, 0L, 0L);

  setuid(99);
  /* nobody */

  /* drop capabilities except CAP_NET_BIND_SERVICE */
  cap_t caps = cap_from_text("cap_net_bind_service=ipe");
  cap_set_proc(caps);
  cap_free(caps);

This probably works for most QUIC servers. However, this is not the case for Haskell. The Linux capability is per-thread. GHC threaded RTS spawns some native threads then runs Haskell programs. If I understand correctly, there is no way to set SECBIT_KEEP_CAPS for all native threads.

The manual page of capabilities says:

Neither glibc, nor the Linux kernel honors POSIX semantics for setting capabilities and securebits in the presence of pthreads. That is, changing capability sets, by default, only affect the running thread. To be meaningfully secure, however, the capability sets should be mirrored by all threads within a common program because threads are not memory isolated. As a workaround for this, libcap is packaged with a separate POSIX semantics system call library: libpsx. If your program uses POSIX threads, to achieve meaningful POSIX semantics capability manipulation, you should link your program with:

ld ... -lcap -lpsx -lpthread --wrap=pthread_create

or,

gcc ... -lcap -lpsx -lpthread -Wl,-wrap,pthread_create

This workaround cannot apply to Haskell. In my opinion, the securebits capability of Linux should be per-process.

Testing QUIC servers with h3spec

h2spec is an excellent test tool to check if HTTP/2 servers can handle error cases correctly. When I was developing HTTP/2 server library in Haskell, I used to utilize Firefox and Chrome for normal cases and h2spec for error cases. h2spec much helped me to improve the quality of the library. What is surprised is that the author, Moto Ishizawa, is still enhancing h2spec!

Since the QUIC library in Haskell supports both the client side and the server side, normal cases are tested by itself. I wanted to test error cases, too. After considering for two weeks, it appeared that QUIC error cases can be made easily. The key idea is providing hooks to covert data structures:

data Hooks = Hooks {
    onPlainCreated  :: EncryptionLevel -> Plain -> Plain
  , onTransportParametersCreated :: Parameters -> Parameters
  , onTLSExtensionCreated :: [ExtensionRaw] -> [ExtensionRaw]
  }

If we want to test an error case where a unknown frame is included a QUIC packet, we can provide the following function for onPlainCreated:

unknownFrame :: EncryptionLevel -> Plain -> Plain
unknownFrame lvl plain
  | lvl == RTT1Level = plain { plainFrames = UnknownFrame 0x20 : plainFrames plain }
  | otherwise        = plain

Hooks is passed to the client/server runners via configuration. I used our lovely hspec to run the test:

transportSpec :: ClientConfig -> SpecWith a
transportSpec cc0 = do
    describe "QUIC servers" $ do
        it "MUST send FRAME_ENCODING_ERROR if a frame of unknown type is received [Transport 12.4]" $ \_ -> do
            let cc = addHook cc0 $ setOnPlainCreated unknownFrame
            runC cc waitEstablished `shouldThrow` transportError FrameEncodingError

transportError :: TransportError -> QUICError -> Bool
transportError te (TransportErrorOccurs te' _) = te == te'
transportError _  _                            = False

The notation of `shouldThrow` transportError FrameEncodingError is really cool, isn't it?

After adding some error cases, I hit upon an idea of a command line tool with this error cases reused by adding another main function. Moto agreed that I take the name of h3spec. Here is an example of execution of h3spec:

% h3spec -v
h3spec 0.0.4
% h3spec mew.org 443

QUIC servers

   MUST send TRANSPORT_PARAMETER_ERROR if initial_source_connection_id is missing [Transport 7.3]
  MUST send TRANSPORT_PARAMETER_ERROR if original_destination_connection_id is received [Transport 18.2]
  MUST send TRANSPORT_PARAMETER_ERROR if preferred_address, is received [Transport 18.2]
  MUST send TRANSPORT_PARAMETER_ERROR if retry_source_connection_id is received [Transport 18.2]
  MUST send TRANSPORT_PARAMETER_ERROR if stateless_reset_token is received [Transport 18.2]
  MUST send TRANSPORT_PARAMETER_ERROR if max_udp_payload_size is invalid [Transport 7.4 and 18.2]
  MUST send FRAME_ENCODING_ERROR if a frame of unknown type is received [Transport 12.4]
  MUST send PROTOCOL_VIOLATION on no frames [Transport 12.4]
  MUST send PROTOCOL_VIOLATION if reserved bits in Handshake are non-zero [Transport 17.2]
  MUST send PROTOCOL_VIOLATION if reserved bits in Short are non-zero [Transport 17.2]
  MUST send PROTOCOL_VIOLATION if NEW_TOKEN is received [Transport 19.7]
  MUST send STREAM_STATE_ERROR if MAX_STREAM_DATA is received for a non-existing stream [Transport 19.9] FAILED [1]
  MUST send PROTOCOL_VIOLATION if HANDSHAKE_DONE is received [Transport 19.20]
  MUST send no_application_protocol TLS alert if no application protocols are supported [TLS 8.1]
  MUST the send missing_extension TLS alert if the quic_transport_parameters extension does not included [TLS 8.2]

Failures:

  Transport.hs:60:13:
  1) QUIC servers MUST send STREAM_STATE_ERROR if MAX_STREAM_DATA is received for a non-existing stream [Transport 19.9]
       did not get expected exception: QUICError

  To rerun use: --match "/QUIC servers/MUST send STREAM_STATE_ERROR if MAX_STREAM_DATA is received for a non-existing stream [Transport 19.9]/"

Randomized with seed 1914918977

Finished in 0.7035 seconds
15 examples, 1 failure

I'm planning to add some more error cases. Enjoy!

The Current Plan for Haskell QUIC

In recent days, QUIC is hot. As the following blog posts describe, the standardization of QUIC is now in the final stage and h3-29 (HTTP/3 over QUIC of draft 29) is out there:

These blog posts motivated me to explain my plan for "Haskell QUIC" to the Haskell community. As I described in "Implementing HTTP/3 in Haskell" and "Developing QUIC Loss Detection and Congestion Control in Haskell", Haskell QUIC servers can communicate with Firefox Nightly and Chrome via HTTP/3. Also, as you can see in "IETF QUIC Interop Matrix", the interoperability has been tested with many other QUIC implementations.

After QUIC drafts will be published as a set of RFCs, I will release several Haskell libraries in the following order:

  • tls to provide the Netork.TLS.QUIC module
  • http2 to provide necessary functions for QPACK and HTTP/2 client library
  • quic to provide QUIC core features
  • http3 to provide HTTP/3 client/server libraries and QPACK
  • warp-quic to glue QUIC with WAI (Web Application Interface)
  • mighttpd2 to provide HTTP/3 server

I expect that this would happen in April 2021. For the 2nd half of Japan fiscal year of 2020 (Oct 2020 - Mar 2021), I'm planning to tackle the followings (in no particular order):

Stay tuned.

Improving QUIC APIs of the TLS library in Haskell

In "Implementation status of QUIC in Haskell", I briefly described QUIC APIs of the TLS library in Haskell. I first invented APIs based on static functions but switched to the thread-based approach to follow Olivier Chéron's recommendation. The current APIs got two steps further. This article describes how Olivier and I improved the thread-based APIs.

Recall the handshake flow of TLS 1.3:

f:id:kazu-yamamoto:20200916150139p:plain:w400
The handshake flow of TLS 1.3

This article focuses on the client side because it is more challenging than the server side.

Thread-based APIs

With the thread-based APIs, a designate TLS thread is spawn and its controller is created by the following API:

newQUICClient :: ClientParams -> IO ClientController

ClientController is defined as follows:

type ClientController = ClientControl -> IO ClientStatus

That is, the controller passes ClientControl to the designated thread and receives ClientStatus. ClientControl is defined as follows:

data ClientControl =
    GetClientHello                 -- ^ 'SendClientHello'
  | PutServerHello ServerHello     -- ^ 'SendClientHello', 'RecvServerHello', 'ClientNeedsMore'
  | PutServerFinished Finished     -- ^ 'SendClientFinished'
  | PutSessionTicket SessionTicket -- ^ 'RecvSessionTicket'
  | ExitClient                     -- ^ 'ClientHandshakeDone'

The comments in the right-hand side describe corresponding ClientStatus. The entire definition of ClientStatus is as follows:

data ClientStatus =
    ClientNeedsMore
  | SendClientHello ClientHello (Maybe EarlySecretInfo)
  | RecvServerHello HandshakeSecretInfo
  | SendClientFinished Finished [ExtensionRaw] ApplicationSecretInfo
  | RecvSessionTicket
  | ClientHandshakeDone

Users of the controller can receive key information, TLS extension for QUIC transport parameters, etc through ClientStatus. The following is an example usage of the controller:

handshakeClient :: ClientConfig -> Connection -> IO ()
handshakeClient conf conn = do
    ver <- getVersion conn
    let sendEarlyData = isJust $ ccEarlyData conf
    control <- clientController conf ver (setResumptionSession conn) sendEarlyData
    setClientController conn control
    sendClientHelloAndRecvServerHello control conn $ ccEarlyData conf
    recvServerFinishedSendClientFinished control conn

sendClientHelloAndRecvServerHello :: ClientController -> Connection -> Maybe (StreamId,ByteString) -> IO ()
sendClientHelloAndRecvServerHello control conn mEarlyData = do
    SendClientHello ch0 mEarlySecInf <- control GetClientHello
    setEarlySecretInfo conn mEarlySecInf
    sendCryptoData conn $ OutHndClientHello ch0 mEarlyData
    (InitialLevel, sh0) <- recvCryptoData conn
    state0 <- control $ PutServerHello sh0
    case state0 of
      RecvServerHello hndSecInf -> do
          setHandshakeSecretInfo conn hndSecInf
          setEncryptionLevel conn HandshakeLevel
      SendClientHello ch1 mEarlySecInf1 -> do
          setEarlySecretInfo conn mEarlySecInf1
          sendCryptoData conn $ OutHndClientHello ch1 Nothing
          (InitialLevel, sh1) <- recvCryptoData conn
          state1 <- control $ PutServerHello sh1
          case state1 of
            RecvServerHello hndSecInf -> do
                setHandshakeSecretInfo conn hndSecInf
                setEncryptionLevel conn HandshakeLevel
            _ -> E.throwIO $ HandshakeFailed "sendClientHelloAndRecvServerHello"
      _ -> E.throwIO $ HandshakeFailed "sendClientHelloAndRecvServerHello"

recvServerFinishedSendClientFinished :: ClientController -> Connection -> IO ()
recvServerFinishedSendClientFinished control conn = loop (1 :: Int)
  where
    loop n = do
        (HandshakeLevel, eesf) <- recvCryptoData conn
        state <- control $ PutServerFinished eesf
        case state of
          ClientNeedsMore -> do
              -- Sending ACKs for three times rule
              when ((n `mod` 3) == 2) $
                  sendCryptoData conn $ OutControl HandshakeLevel []
              loop (n + 1)
          SendClientFinished cf exts appSecInf -> do
              setApplicationSecretInfo conn appSecInf
              setEncryptionLevel conn RTT1Level
              setPeerParams conn exts
              sendCryptoData conn $ OutHndClientFinished cf
          _ -> E.throwIO $ HandshakeFailed "putServerFinished"

NewSessionTicket in a CRYPTO frame is passed to the controller by a receiver thread.

Introducing callbacks

Olivier noticed that both the designated thread and the user of the controller manages the TLS state. To reduce the number of status in the user side, he defined callbacks:

data QUICCallbacks = QUICCallbacks {
    quicSend              :: [(CryptLevel, ByteString)] -> IO ()
  , quicRecv              :: CryptLevel -> IO (Either TLSError ByteString)
  , quicInstallKeys       :: KeyScheduleEvent -> IO ()
  , quicNotifyExtensions  :: [ExtensionRaw] -> IO ()
  }

Now newQUICClient takes QUICCallbacks additionally while the definition of ClientController remains the same:

newQUICClient :: ClientParams -> QUICCallbacks -> IO ClientController
type ClientController = ClientControl -> IO ClientStatus

Then Olivier divided the client side into two phases:

  • Phase 1: Send ClientHello, receive ServerHello etc and send Finished.
  • Phase 2: Receive NewSessionTicket and store it through a session manager.

The inside of the phases is opaque to users. Both ClientControl and ClientStatus were simplified as follows:

data ClientControl =
    EnterClient          -- ^ 'ClientHandshakeComplete', 'ClientHandshakeFailed'
  | RecvSessionTickets   -- ^ 'ClientRecvSessionTicket', 'ClientHandshakeFailed'
  | ExitClient           -- ^ 'ClientHandshakeDone'

data ClientStatus =
    ClientHandshakeComplete
  | ClientRecvSessionTicket
  | ClientHandshakeDone
  | ClientHandshakeFailed TLSError
  deriving Show

Phase 1 is handled by a user while phase 2 is treated by another thread:

do handshakeClient conf conn `E.onException` clearThreads conn
   tid4 <- forkIO $ getClientController conn >>= handshakeClientAsync conn

The followings are the definition of handshakeClient and handshakeClientAsync:

handshakeClient :: ClientConfig -> Connection -> IO ()
handshakeClient conf conn = do
    ver <- getVersion conn
    hsr <- newHndStateRef
    let sendEarlyData = isJust $ ccEarlyData conf
        qc = QUICCallbacks { quicSend = sendTLS conn hsr
                           , quicRecv = recvTLS conn hsr
                           , quicInstallKeys = installKeysClient
                           , quicNotifyExtensions = setPeerParams conn
                           }
    control <- clientController qc conf ver (setResumptionSession conn) sendEarlyData
    setClientController conn control
    state <- control EnterClient
    case state of
        ClientHandshakeComplete -> return ()
        ClientHandshakeFailed e -> notifyPeer conn e >>= E.throwIO
        _ -> E.throwIO $ HandshakeFailed $ "handshakeClient: unexpected " ++ show state

  where
    installKeysClient (InstallEarlyKeys mEarlySecInf) = do
        setEarlySecretInfo conn mEarlySecInf
        sendCryptoData conn $ OutEarlyData (ccEarlyData conf)
    installKeysClient (InstallHandshakeKeys hndSecInf) = do
        setHandshakeSecretInfo conn hndSecInf
        setEncryptionLevel conn HandshakeLevel
    installKeysClient (InstallApplicationKeys appSecInf) = do
        setApplicationSecretInfo conn appSecInf
        setEncryptionLevel conn RTT1Level

-- second half the the TLS handshake, executed out of the main thread
handshakeClientAsync :: Connection -> ClientController -> IO ()
handshakeClientAsync conn control = handleLog logAction $ forever $ do
    state <- control RecvSessionTickets
    case state of
        ClientRecvSessionTicket -> return ()
        ClientHandshakeFailed e -> notifyPeerAsync conn e >>= E.throwIO
        _ -> E.throwIO $ HandshakeFailed $ "unexpected " ++ show state
  where
    logAction msg = connDebugLog conn ("client handshake: " ++ msg)

Stateless APIs

When I tried to implement HTTP/3, I noticed a disadvantage of this APIs. handshakeClient returns when 1-RTT gets ready. This means that dynamically-created early data of 0-RTT cannot be sent.

So, I removed both ClientControl and ClientStatus completely from users. newQUICClient was renamed to tlsQUICClient and it does not return ClientController anymore:

tlsQUICClient :: ClientParams -> QUICCallbacks -> IO ()

QUICCallbacks takes one more field quicDone mainly used in the server side:

data QUICCallbacks = QUICCallbacks {
    quicSend             :: [(CryptLevel, ByteString)] -> IO ()
  , quicRecv             :: CryptLevel -> IO (Either TLSError ByteString)
  , quicInstallKeys      :: Context -> KeyScheduleEvent -> IO ()
  , quicNotifyExtensions :: Context -> [ExtensionRaw] -> IO ()
  , quicDone             :: Context -> IO ()
  }

handshakeClient forks the action of tlsQUICClient and synchronizes with it to return when either 0-RTT or 1-RTT gets ready:

handshakeClient :: ClientConfig -> Connection -> AuthCIDs -> IO ()
handshakeClient conf conn myAuthCIDs = do
    ver <- getVersion conn
    hsr <- newHndStateRef
    let use0RTT = ccUse0RTT conf
        qc = QUICCallbacks { quicSend = sendTLS conn hsr
                           , quicRecv = recvTLS conn hsr
                           , quicInstallKeys = installKeysClient hsr
                           , quicNotifyExtensions = setPeerParams conn
                           , quicDone = done
                           }
        setter = setResumptionSession conn
        handshaker = tlsQUICClient ... qc
    tid <- forkIO (handshaker `E.catch` tell)
    qlogParamsSet conn (confParameters (ccConfig conf), "local")
    setKillHandshaker conn tid
    if use0RTT then
       wait0RTTReady conn
     else
       wait1RTTReady conn
  where
    tell (TLS.HandshakeFailed (TLS.Error_Misc _)) = return () -- thread blocked
    tell e = notifyPeer conn $ getErrorCause e
    installKeysClient _ _ctx (InstallEarlyKeys Nothing) = return ()
    installKeysClient _ _ctx (InstallEarlyKeys (Just (EarlySecretInfo cphr cts))) = do
        setCipher conn RTT0Level cphr
        initializeCoder conn RTT0Level (cts, ServerTrafficSecret "")
        setConnection0RTTReady conn
    installKeysClient hsr _ctx (InstallHandshakeKeys (HandshakeSecretInfo cphr tss)) = do
        setCipher conn HandshakeLevel cphr
        setCipher conn RTT1Level cphr
        initializeCoder conn HandshakeLevel tss
        setEncryptionLevel conn HandshakeLevel
        rxLevelChanged hsr
    installKeysClient hsr ctx (InstallApplicationKeys appSecInf@(ApplicationSecretInfo tss)) = do
        storeNegotiated conn ctx appSecInf
        initializeCoder conn RTT1Level tss
        setEncryptionLevel conn RTT1Level
        rxLevelChanged hsr
        setConnection1RTTReady conn
        cidInfo <- getNewMyCID conn
        let ncid = NewConnectionID cidInfo 0
        putOutput conn $ OutControl RTT1Level [ncid]
    done _ctx = do
        info <- getConnectionInfo conn
        connDebugLog conn $ bhow info

Like the first approach, NewSessionTicket in a CRYPTO frame is passed to the controller by a receiver thread.

Now the APIs are thread-based, stateless and callbacks only!

Note

If you want to know the discussions deeply, please refer to the following pull requests:

Developing QUIC Loss Detection and Congestion Control in Haskell

For last two months, I have been trying to implement "QUIC Loss Detection and Congestion Control" in Haskell. This blog article describes a brief summary on what I have done.

ACK handling

Before loss detection and congestion control were developed, QUIC packets were retransmitted, if necessary, by a simple resender lightweight thread based on timeout. The internal data structure was, of course, Priority Search Queue (PSQ) where a packet number is the key, a timeout value is the priority and a plain packet is the value. Since both packet numbers and timeout values monotonically increase, packet numbers were used with the bit ordering reversed. With this PSQ, three threads behave as follows:

  • A sender stores sent packets to PSQ.
  • A receiver removes sent packets from PSQ when receiving valid ACKs.
  • A resender repeatedly wakes up, deletes timeouted packets from PSQ and retransmits them.

When the receiver receives an ACK frame, its ACK ranges are translated into a list of packet numbers like [4,5,7,8,9]. Then each packet number is used to delete the corresponding packet from PSQ. This operation is in O(n) where n is the size of PSQ.

After deploying Haskell QUIC in www.mew.org, I noticed that Firefox Nightly cannot download video streams smoothly while Chrome Canary can. Chrome Canary sends ACKs like this:

[0,1,2,3]
[4,5,6,7]
[8,9,10,11]

This means that Chrome Canary removes unnecessary packet numbers if an ACK is acknowledged. By contrast, Firefox Nightly sends ACKs like this:

[0,1,2,3]
[0,1,2,3,4,5,6,7]
[0,1,2,3,4,5,6,7,8,9,10,11]

This is perfectly valid for the spec. But my poor implementation was sacrificed for the operation of O(m log n) where m is the number of packet numbers in an ACK frame. Firefox Nightly specifies really huge values for m.

As workaround, I modified the code to keep track of the highest packet number already deleted to make valid m is small. Thanks to this, Firefox Nightly was now able to download video streams smoothly. However, its computational complexity is still O(m log n).

Suddenly, I hit upon a simple and elegant algorithm. If a list-like data structure is used to store sent packets, ACK ranges can be expressed as predicate. With the example of [4,5,7,8,9], the ranges can be expressed as [(4,5),(7,9)] and this can be translated into the following predicate function:

predicate :: PlainPacket -> Bool
predicate pkt = (4 <= n && n <= 5) || (7 <= n && n <= 9)
  where
    n = packetNumber pkt

A list-like data structure provides partition function:

-- | The partition function takes a predicate and a list and returns
     the pair of lists of elements which do and do not satisfy the
     predicate, respectively;
partition :: (a -> Bool) -> [a] -> ([a], [a]) 

So, partition splits the list into a list of ACKed and a list of not-yet-ACKed in O(n). I choose the finger tree (aka Data.Sequence) as a list-like data structure since QUIC loss detection manipulates both ends.

Translating Pseudocode

At that moment, I started to develop loss detection and congestion control by translating the pseudocode in the draft into Haskell. As Haskellers could imagine, it was tough to convert the imperative Python-like code into functional Haskell code. Through this work, I found several inconsistencies of the spec and the pseudocode. Yes, contributing protocol standardization is one of my missions.

Since it was first time for me to implement a congestion control algorithm, I wondered if my code worked well. To make debugging easier, I enhanced my qlog code so that qvis could visualize the passage of congestion control. qvis even visualizes the flow control both for streams and a connection. Without qvis, I could not make my code that stable. I very much thank Robin Marx, the developer of qlog and qvis.

According to the pseudocode, packet losses are detected and retransmittion is carried out when an ACK frame is received. So, the resender thread is not necessary anymore.

In addition to translate the pseudocode faithfully, I extended the code for the issue pointed by Kazuho Oku. Suppose that the following ACK is received:

[100-101,103-104]

The pseudocode detects the lost of packet 102. However, the following ACK might come soon later:

[100-104]

This ACK pattern is caused by UDP packet reordering. To prevent this spurious retransmissions, I added another finger tree to keep loss candidates for a while and re-introduced the resender thread:

resender :: LDCC -> IO ()
resender ldcc@LDCC{..} = forever $ do
    atomically $ do
        lostPackets <- readTVar lostCandidates
        check (lostPackets /= emptySentPackets)
    delay $ Microseconds 10000 -- 10 ms
    packets <- atomically $ do
        SentPackets pkts <- readTVar lostCandidates
        writeTVar lostCandidates emptySentPackets
        return pkts
    when (packets /= Seq.empty) $ do
        onPacketsLost ldcc packets
        retransmit ldcc packets

The resender thread wakes up when lostCandidates becomes non-empty, waits for 10 milliseconds and retransmits lost packets if still exist. Spurious retransmissions do not occur if necessary ACKs arrive within 10 milliseconds,

Testing

Due to COVID-19, I'm developing QUIC at my home. In some sense, this is lucky because I see many packet losses from my home network to the internet, which is suitable to develop loss detection and congestion control. (Since our company IIJ is an ISP, our office network is really stable. :) In addition to test my code in the real internet, I felt that comprehensive testing was necessary.

So, I developed a configurable pipe based on lightweight threads. This pipe can drop packets randomly or according to specified scenarios. I wrote scenarios, each of which drops a specific packet during the QUIC handshake. These discovered a lot of deadlocks, resulting in much stabler code.

Timer

Timer is necessary everywhere to implement QUIC. For loss detection, two timers are used. One is a timer to detect packet losses and the other is a timer to send a probe to detect tail packet losses. This section talks about how to implement low-cost timers.

Time library

To obtain the current time and calculate the timeout value, I choose the hourglass library at the beginning. Surprisingly profiling showed the hottest spot was timeCurrentP of this library, which is based on clock_gettime(2). So, I switched to my unix-time library. Internally it calls gettimeofday(2) which is highly optimized on some OSes.

Timeout function

Haskell provides beautiful abstraction for timeout:

timeout :: Int -> IO a -> IO (Maybe a) 

If the action of IO a is finished before timeout is expired, a value of Just a is returned. If timeout occurs, Nothing is returned.

As "Parallel and Concurrent Programming in Haskell" explains, the original implementation first forks a timeout thread to send a signal to the original thread on timeout. If the action is finished before timeout, it is the timer thread that is killed. The current implementation delays the execution of forkIO until the timeout is expired.

Anyway, I don't want to spawn a thread for each timeout. So, a single designated thread with a request queue is created for each QUIC connection and sends signals according to requests.

Reducing the number of timeout

To send a probe packet, the timer is set everytime when a packet is sent. If bulk packets are sent, the timer is set multiple times and only the last setting is effective. To reduce this ineffectiveness, a queue for timeout operations was introduced. Like the resender thread above, a designated thread wakes up when the queue becomes non-empty, waits for 10 milliseconds and executes the last of the timeout operations. This omission is applied only for the probe timer of the 1-RTT level.

Pacing

The QUIC recovery draft suggests to pace sending of packets. On Linux, there are two pacing mechanisms, SO_MAX_PACING_RATE and SO_TXTIME. As "Accelerating UDP packet transmission for QUIC" describes, each has advantages and disadvantages:

  • SO_MAX_PACING_RATE: can work with Generic Segmentation Offload (GSO) but can be only meaningful for connected sockets.
  • SO_TXTIME: is meaningful even for wildcard sockets but cannot work with GSO.

As I described in "Implementation status of QUIC in Haskell", the quic library in Haskell uses connected sockets, I tried SO_MAX_PACING_RATE. To use this socket option, you need to attach fq (Fair Queue) to your NIC:

% sudo tc qdisc add dev eth0 root fq 

After some trials, I realized that the quic library cannot tell how much delay is added for each packet sending. This results in inaccurate measuring of RTT. So, I gave up this approach.

Good new is that people tries to make SO_TXTIME friendly with GSO. I will get back to pacing when I will try to use GSO in my library.

「パケットの設計から見るQUIC」の訂正

QUICは、一年半実装を続けている僕でも全容を把握できているとは言い難いほど大きなプロトコルですが、ある側面をさっと理解するには、n月刊ラムダノート Vol.2, No.1(2020)に西田さんが書かれた「パケットの設計から見るQUIC」がオススメです。ただ、QUICの専門家から見ると、若干不正確な部分がありますので、訂正すべき箇所をまとめておきます。(遅くなって、すみません。) 記事を公開することは西田さんにお知らせしていますが、ここに書いてある内容はあくまで僕の意見です。

最初のページ

  • 「QUIC(Quick UDP Internet Connections)は、インターネットのアーキテクチャ 上で利用可能な、高い信頼性を提供する仕組みとして設計されたトランスポートプロ トコルです。」

IETF で標準化している QUIC は、「Quick UDP Internet Connections」の略語ではありません。何の略語でもありません。

1.3.1 コネクションID

  • 「もしサーバ側が、この宛先コネクション ID を使用したくない場合、より都合が良いコネクション ID を指定した Long ヘッダを返送することも可能です。このときのパケットタイプは Retry となります。」

Retryで変えることもできますが、通常は Initial で変更します。Retry は、アドレス検証のために用います。

1.3.2 パケット番号

  • 「Short ヘッダにはパケット番号と呼ばれるフィールドがあります。」

パケット番号は、Initial や Handshake にもあります。

  • 「たとえば、Offset フィール ドの値が 10000 ならば、この Stream Frame 中のデータはアプリケーションデータの 先頭から 10000 バイトめのデータになります。」

Offsetが0の場合は「1バイトめ」と表現するのが普通でしょうから、「10001 バイトめ」ですね。

  • 「パケット番号 2 のパケットでパケット番号 3 のパケットよりも 大きな Offset を持つアプリケーションデータを格納する」

もう少し大きなパケット番号を使っていれば、再送によって入れ替わったことが自然に表現できていたと思います。

  • 「異なるアルゴリズムで暗号化されたパケットを正しく効率的に復号するために、それぞれに独立したパケット番号空間が必要となります。」

これは efficiently decrypt のように読めますが、効率のためではありません。安全性のために、パケット番号空間が分かれています。

1.4 QUIC コネクションの確立

単なる感想ですが、1-RTTを詳しく説明する前に、0-RTTの話を始めてしまっているので、構成が複雑になっています。0-RTTを1-RTTの後に説明すれば、もっと簡単になったでしょう。

  • 「ちなみに、0-RTT でも 1-RTT でも、コネクションの確立に必要なパケット数に大きな違いはありません。」

0-RTTの場合、PSKが前提です。PSKは、セッションを再開する仕組みですが、そのポイントはサーバ認証を省略することです。つまり、サーバは証明書を送らなくても済みます。証明書は一般的に大きく、QUICの4パケット以上になることもあります。数パケットのコネクション確立で、たとえば4パケットが減ることを「大きな違いはありません」と表現するのは違和感があります。この文は、書かない方がよかったと思います。

1.5.2 ACKフレーム

  • 「このケースでは、 Gap フィールドと ACK Range フィールドを合計で 2 つ使用したので、ACK Range Count フィールドは 2 になります。」

ACK Range Count には、First ACK Range を含めませんので、この場合は「1」です。

Implementing HTTP/3 in Haskell

Mew.org is now speaking HTTP/3 (HTTP/2 over QUIC). If you gain access to the site using Firefox Nightly, the first connection would be HTTP/2 then the following connections should be HTTP/3 led by Alt-Svc:.

f:id:kazu-yamamoto:20200609135341p:plain
Firefox Nightly

This article explains insights which I found through the implementation activities of QUIC and HTTP/3 in Haskell.

HTTP/2 server library

I started implementing QUIC in January 2019. It took four months to reach a toy QUIC client since the negotiation part is really complicated. When I tackled the server side, my brain got befuddled. I have no idea on server architecture.

So, I got back to HTTP/2. As described in HTTP/2 server library in Haskell, I succeed to extract HTTP/2 server library from our HTTP/2 server.

QUIC client and server

After resuming QUIC implementation, I spent countless hours to develop QUIC. And finally, I joined 16th interop test event and 17th interop test event.

As described in Implementation status of QUIC in Haskell, I defined the following API:

runQUICServer :: ServerConfig -> (Connection -> IO ()) -> IO ()

When a QUIC connection is created at the server side, a designated lightweight thread is spawned with the Connection type. This abstraction seems reasonable because Connection hides internal information about a QUIC connection. However, I have no idea on how to abstract QUIC streams at that moment. So, I defined the following APIs temporally:

type StreamID = Int
type Fin = Bool
recvStream :: Connection -> IO (StreamID, ByteString)
sendStream :: Connection -> StreamID -> Fin -> ByteString -> IO ()
shutdownStream :: Connection -> StreamID -> IO ()

These APIs seem awkward since it exposes stream identifiers which applications should not know. Anyway, through this development I got an insight that a lot of code can be shared between client and server.

HTTP/2 client library

Now I wanted to verify that HTTP/2 client library can be achieved by sharing a lot of server code. The result is promising. HTTP/2 library in Haskell now provides both client and server side and implements self-testing.

And importantly, I found beautiful abstractions for HTTP requests and responses. For clients, requests are outgoing data. For servers, responses are also outgoing data. Since response statuses can be expressed as a pseudo :status header, we can define outgoing data as follows:

data OutObj = OutObj {
    outObjHeaders  :: [Header]      -- ^ Accessor for header.
  , outObjBody     :: OutBody       -- ^ Accessor for outObj body.
  , outObjTrailers :: TrailersMaker -- ^ Accessor for trailers maker.
  }

data OutBody = OutBodyNone
             -- | Streaming body takes a write action and a flush action.
             | OutBodyStreaming ((Builder -> IO ()) -> IO () -> IO ())
             | OutBodyBuilder Builder
             | OutBodyFile FileSpec

For the client libarary, Request is just a wrapper data type:

-- | Request from client.
newtype Request = Request OutObj deriving (Show)

Response in the server library is also a wrapper:

-- | Response from server.
newtype Response = Response OutObj deriving (Show)

The same discussion can be done for incoming data thanks to pseudo headers including :method and :path:

type InpBody = IO ByteString

-- | Input object
data InpObj = InpObj {
    inpObjHeaders  :: HeaderTable   -- ^ Accessor for headers.
  , inpObjBodySize :: Maybe Int     -- ^ Accessor for body length specified in c
ontent-length:.
  , inpObjBody     :: InpBody       -- ^ Accessor for body.
  , inpObjTrailers :: IORef (Maybe HeaderTable) -- ^ Accessor for trailers.
  }

Here comes Response for the client library:

-- | Response from server.
newtype Response = Response InpObj deriving (Show)

And Request in the server library is:

-- | Request from client.
newtype Request = Request InpObj deriving (Show)

I shouted about this experience:

HTTP/3 client and server library

Now it was time to implement HTTP/3. Thanks to Request and Response from HTTP/2 library and QUIC library itself, I was able to concentrate on how to manipulate multiple streams. Suddenly, I got an insight about QUIC streams:

Now QUIC library in Haskell provides an abstract data type for streams:

data Stream

Clients can creates a Stream like sockets:

stream :: Connection -> IO Stream
unidirectionalStream :: Connection -> IO Stream

A server get a Stream when a new QUIC connection comes:

acceptStream :: Connection -> IO (Either QUICError Stream) 

Data can be received and sent though Stream:

-- return "" when FIN is received
recvStream :: Stream -> Int -> IO ByteString
sendStream :: Stream -> ByteString -> IO () 
-- Sending FIN
shutdownStream :: Stream -> IO () 

With these APIs, I was able to develop HTTP/3 really fast. In the sense where a lightweight thread is used per stream, programming HTTP/3 is like HTTP/1.1. In the sense where frames are used, programming HTTP/3 is like HTTP/2. I felt that my long career for HTTP/1.1 and HTTP/2 is converged in HTTP/3!