あどけない話

Internet technologies

Developing QUIC Loss Detection and Congestion Control in Haskell

For last two months, I have been trying to implement "QUIC Loss Detection and Congestion Control" in Haskell. This blog article describes a brief summary on what I have done.

ACK handling

Before loss detection and congestion control were developed, QUIC packets were retransmitted, if necessary, by a simple resender lightweight thread based on timeout. The internal data structure was, of course, Priority Search Queue (PSQ) where a packet number is the key, a timeout value is the priority and a plain packet is the value. Since both packet numbers and timeout values monotonically increase, packet numbers were used with the bit ordering reversed. With this PSQ, three threads behave as follows:

  • A sender stores sent packets to PSQ.
  • A receiver removes sent packets from PSQ when receiving valid ACKs.
  • A resender repeatedly wakes up, deletes timeouted packets from PSQ and retransmits them.

When the receiver receives an ACK frame, its ACK ranges are translated into a list of packet numbers like [4,5,7,8,9]. Then each packet number is used to delete the corresponding packet from PSQ. This operation is in O(n) where n is the size of PSQ.

After deploying Haskell QUIC in www.mew.org, I noticed that Firefox Nightly cannot download video streams smoothly while Chrome Canary can. Chrome Canary sends ACKs like this:

[0,1,2,3]
[4,5,6,7]
[8,9,10,11]

This means that Chrome Canary removes unnecessary packet numbers if an ACK is acknowledged. By contrast, Firefox Nightly sends ACKs like this:

[0,1,2,3]
[0,1,2,3,4,5,6,7]
[0,1,2,3,4,5,6,7,8,9,10,11]

This is perfectly valid for the spec. But my poor implementation was sacrificed for the operation of O(m log n) where m is the number of packet numbers in an ACK frame. Firefox Nightly specifies really huge values for m.

As workaround, I modified the code to keep track of the highest packet number already deleted to make valid m is small. Thanks to this, Firefox Nightly was now able to download video streams smoothly. However, its computational complexity is still O(m log n).

Suddenly, I hit upon a simple and elegant algorithm. If a list-like data structure is used to store sent packets, ACK ranges can be expressed as predicate. With the example of [4,5,7,8,9], the ranges can be expressed as [(4,5),(7,9)] and this can be translated into the following predicate function:

predicate :: PlainPacket -> Bool
predicate pkt = (4 <= n && n <= 5) || (7 <= n && n <= 9)
  where
    n = packetNumber pkt

A list-like data structure provides partition function:

-- | The partition function takes a predicate and a list and returns
     the pair of lists of elements which do and do not satisfy the
     predicate, respectively;
partition :: (a -> Bool) -> [a] -> ([a], [a]) 

So, partition splits the list into a list of ACKed and a list of not-yet-ACKed in O(n). I choose the finger tree (aka Data.Sequence) as a list-like data structure since QUIC loss detection manipulates both ends.

Translating Pseudocode

At that moment, I started to develop loss detection and congestion control by translating the pseudocode in the draft into Haskell. As Haskellers could imagine, it was tough to convert the imperative Python-like code into functional Haskell code. Through this work, I found several inconsistencies of the spec and the pseudocode. Yes, contributing protocol standardization is one of my missions.

Since it was first time for me to implement a congestion control algorithm, I wondered if my code worked well. To make debugging easier, I enhanced my qlog code so that qvis could visualize the passage of congestion control. qvis even visualizes the flow control both for streams and a connection. Without qvis, I could not make my code that stable. I very much thank Robin Marx, the developer of qlog and qvis.

According to the pseudocode, packet losses are detected and retransmittion is carried out when an ACK frame is received. So, the resender thread is not necessary anymore.

In addition to translate the pseudocode faithfully, I extended the code for the issue pointed by Kazuho Oku. Suppose that the following ACK is received:

[100-101,103-104]

The pseudocode detects the lost of packet 102. However, the following ACK might come soon later:

[100-104]

This ACK pattern is caused by UDP packet reordering. To prevent this spurious retransmissions, I added another finger tree to keep loss candidates for a while and re-introduced the resender thread:

resender :: LDCC -> IO ()
resender ldcc@LDCC{..} = forever $ do
    atomically $ do
        lostPackets <- readTVar lostCandidates
        check (lostPackets /= emptySentPackets)
    delay $ Microseconds 10000 -- 10 ms
    packets <- atomically $ do
        SentPackets pkts <- readTVar lostCandidates
        writeTVar lostCandidates emptySentPackets
        return pkts
    when (packets /= Seq.empty) $ do
        onPacketsLost ldcc packets
        retransmit ldcc packets

The resender thread wakes up when lostCandidates becomes non-empty, waits for 10 milliseconds and retransmits lost packets if still exist. Spurious retransmissions do not occur if necessary ACKs arrive within 10 milliseconds,

Testing

Due to COVID-19, I'm developing QUIC at my home. In some sense, this is lucky because I see many packet losses from my home network to the internet, which is suitable to develop loss detection and congestion control. (Since our company IIJ is an ISP, our office network is really stable. :) In addition to test my code in the real internet, I felt that comprehensive testing was necessary.

So, I developed a configurable pipe based on lightweight threads. This pipe can drop packets randomly or according to specified scenarios. I wrote scenarios, each of which drops a specific packet during the QUIC handshake. These discovered a lot of deadlocks, resulting in much stabler code.

Timer

Timer is necessary everywhere to implement QUIC. For loss detection, two timers are used. One is a timer to detect packet losses and the other is a timer to send a probe to detect tail packet losses. This section talks about how to implement low-cost timers.

Time library

To obtain the current time and calculate the timeout value, I choose the hourglass library at the beginning. Surprisingly profiling showed the hottest spot was timeCurrentP of this library, which is based on clock_gettime(2). So, I switched to my unix-time library. Internally it calls gettimeofday(2) which is highly optimized on some OSes.

Timeout function

Haskell provides beautiful abstraction for timeout:

timeout :: Int -> IO a -> IO (Maybe a) 

If the action of IO a is finished before timeout is expired, a value of Just a is returned. If timeout occurs, Nothing is returned.

As "Parallel and Concurrent Programming in Haskell" explains, the original implementation first forks a timeout thread to send a signal to the original thread on timeout. If the action is finished before timeout, it is the timer thread that is killed. The current implementation delays the execution of forkIO until the timeout is expired.

Anyway, I don't want to spawn a thread for each timeout. So, a single designated thread with a request queue is created for each QUIC connection and sends signals according to requests.

Reducing the number of timeout

To send a probe packet, the timer is set everytime when a packet is sent. If bulk packets are sent, the timer is set multiple times and only the last setting is effective. To reduce this ineffectiveness, a queue for timeout operations was introduced. Like the resender thread above, a designated thread wakes up when the queue becomes non-empty, waits for 10 milliseconds and executes the last of the timeout operations. This omission is applied only for the probe timer of the 1-RTT level.

Pacing

The QUIC recovery draft suggests to pace sending of packets. On Linux, there are two pacing mechanisms, SO_MAX_PACING_RATE and SO_TXTIME. As "Accelerating UDP packet transmission for QUIC" describes, each has advantages and disadvantages:

  • SO_MAX_PACING_RATE: can work with Generic Segmentation Offload (GSO) but can be only meaningful for connected sockets.
  • SO_TXTIME: is meaningful even for wildcard sockets but cannot work with GSO.

As I described in "Implementation status of QUIC in Haskell", the quic library in Haskell uses connected sockets, I tried SO_MAX_PACING_RATE. To use this socket option, you need to attach fq (Fair Queue) to your NIC:

% sudo tc qdisc add dev eth0 root fq 

After some trials, I realized that the quic library cannot tell how much delay is added for each packet sending. This results in inaccurate measuring of RTT. So, I gave up this approach.

Good new is that people tries to make SO_TXTIME friendly with GSO. I will get back to pacing when I will try to use GSO in my library.