In recent days, QUIC is hot. As the following blog posts describe, the standardization of QUIC is now in the final stage and h3-29 (HTTP/3 over QUIC of draft 29) is out there:
I expect that this would happen in April 2021. For the 2nd half of Japan fiscal year of 2020 (Oct 2020 - Mar 2021), I'm planning to tackle the followings (in no particular order):
Error handling -- the current error handling is really poor
QPACK encoding with the dynamic table -- the current implementation only makes use of the static table
In "Implementation status of QUIC in Haskell", I briefly described QUIC APIs of the TLS library in Haskell. I first invented APIs based on static functions but switched to the thread-based approach to follow Olivier Chéron's recommendation. The current APIs got two steps further. This article describes how Olivier and I improved the thread-based APIs.
Users of the controller can receive key information, TLS extension for QUIC transport parameters, etc through ClientStatus. The following is an example usage of the controller:
handshakeClient :: ClientConfig -> Connection -> IO ()
handshakeClient conf conn =do
ver <- getVersion conn
let sendEarlyData = isJust $ ccEarlyData conf
control <- clientController conf ver (setResumptionSession conn) sendEarlyData
setClientController conn control
sendClientHelloAndRecvServerHello control conn $ ccEarlyData conf
recvServerFinishedSendClientFinished control conn
sendClientHelloAndRecvServerHello :: ClientController -> Connection -> Maybe (StreamId,ByteString) -> IO ()
sendClientHelloAndRecvServerHello control conn mEarlyData =do
SendClientHello ch0 mEarlySecInf <- control GetClientHello
setEarlySecretInfo conn mEarlySecInf
sendCryptoData conn $ OutHndClientHello ch0 mEarlyData
(InitialLevel, sh0) <- recvCryptoData conn
state0 <- control $ PutServerHello sh0
case state0 of
RecvServerHello hndSecInf ->do
setHandshakeSecretInfo conn hndSecInf
setEncryptionLevel conn HandshakeLevel
SendClientHello ch1 mEarlySecInf1 ->do
setEarlySecretInfo conn mEarlySecInf1
sendCryptoData conn $ OutHndClientHello ch1 Nothing
(InitialLevel, sh1) <- recvCryptoData conn
state1 <- control $ PutServerHello sh1
case state1 of
RecvServerHello hndSecInf ->do
setHandshakeSecretInfo conn hndSecInf
setEncryptionLevel conn HandshakeLevel
_ -> E.throwIO $ HandshakeFailed "sendClientHelloAndRecvServerHello"
_ -> E.throwIO $ HandshakeFailed "sendClientHelloAndRecvServerHello"
recvServerFinishedSendClientFinished :: ClientController -> Connection -> IO ()
recvServerFinishedSendClientFinished control conn = loop (1:: Int)
where
loop n =do
(HandshakeLevel, eesf) <- recvCryptoData conn
state <- control $ PutServerFinished eesf
case state of
ClientNeedsMore ->do-- Sending ACKs for three times rule
when ((n `mod`3) ==2) $
sendCryptoData conn $ OutControl HandshakeLevel []
loop (n +1)
SendClientFinished cf exts appSecInf ->do
setApplicationSecretInfo conn appSecInf
setEncryptionLevel conn RTT1Level
setPeerParams conn exts
sendCryptoData conn $ OutHndClientFinished cf
_ -> E.throwIO $ HandshakeFailed "putServerFinished"
NewSessionTicket in a CRYPTO frame is passed to the controller by a receiver thread.
Introducing callbacks
Olivier noticed that both the designated thread and the user of the controller manages the TLS state. To reduce the number of status in the user side, he defined callbacks:
The followings are the definition of handshakeClient and handshakeClientAsync:
handshakeClient :: ClientConfig -> Connection -> IO ()
handshakeClient conf conn =do
ver <- getVersion conn
hsr <- newHndStateRef
let sendEarlyData = isJust $ ccEarlyData conf
qc = QUICCallbacks { quicSend = sendTLS conn hsr
, quicRecv = recvTLS conn hsr
, quicInstallKeys = installKeysClient
, quicNotifyExtensions = setPeerParams conn
}
control <- clientController qc conf ver (setResumptionSession conn) sendEarlyData
setClientController conn control
state <- control EnterClient
case state of
ClientHandshakeComplete -> return ()
ClientHandshakeFailed e -> notifyPeer conn e >>= E.throwIO
_ -> E.throwIO $ HandshakeFailed $"handshakeClient: unexpected "++ show state
where
installKeysClient (InstallEarlyKeys mEarlySecInf) =do
setEarlySecretInfo conn mEarlySecInf
sendCryptoData conn $ OutEarlyData (ccEarlyData conf)
installKeysClient (InstallHandshakeKeys hndSecInf) =do
setHandshakeSecretInfo conn hndSecInf
setEncryptionLevel conn HandshakeLevel
installKeysClient (InstallApplicationKeys appSecInf) =do
setApplicationSecretInfo conn appSecInf
setEncryptionLevel conn RTT1Level
-- second half the the TLS handshake, executed out of the main thread
handshakeClientAsync :: Connection -> ClientController -> IO ()
handshakeClientAsync conn control = handleLog logAction $ forever $do
state <- control RecvSessionTickets
case state of
ClientRecvSessionTicket -> return ()
ClientHandshakeFailed e -> notifyPeerAsync conn e >>= E.throwIO
_ -> E.throwIO $ HandshakeFailed $"unexpected "++ show state
where
logAction msg = connDebugLog conn ("client handshake: "++ msg)
Stateless APIs
When I tried to implement HTTP/3, I noticed a disadvantage of this APIs. handshakeClient returns when 1-RTT gets ready. This means that dynamically-created early data of 0-RTT cannot be sent.
So, I removed both ClientControl and ClientStatus completely from users. newQUICClient was renamed to tlsQUICClient and it does not return ClientController anymore:
Before loss detection and congestion control were developed, QUIC packets were retransmitted, if necessary, by a simple resender lightweight thread based on timeout. The internal data structure was, of course, Priority Search Queue (PSQ) where a packet number is the key, a timeout value is the priority and a plain packet is the value. Since both packet numbers and timeout values monotonically increase, packet numbers were used with the bit ordering reversed. With this PSQ, three threads behave as follows:
A sender stores sent packets to PSQ.
A receiver removes sent packets from PSQ when receiving valid ACKs.
A resender repeatedly wakes up, deletes timeouted packets from PSQ and retransmits them.
When the receiver receives an ACK frame, its ACK ranges are translated into a list of packet numbers like [4,5,7,8,9]. Then each packet number is used to delete the corresponding packet from PSQ. This operation is in O(n) where n is the size of PSQ.
After deploying Haskell QUIC in www.mew.org, I noticed that Firefox Nightly cannot download video streams smoothly while Chrome Canary can. Chrome Canary sends ACKs like this:
[0,1,2,3]
[4,5,6,7]
[8,9,10,11]
This means that Chrome Canary removes unnecessary packet numbers if an ACK is acknowledged. By contrast, Firefox Nightly sends ACKs like this:
This is perfectly valid for the spec. But my poor implementation was sacrificed for the operation of O(m log n) where m is the number of packet numbers in an ACK frame. Firefox Nightly specifies really huge values for m.
As workaround, I modified the code to keep track of the highest packet number already deleted to make valid m is small. Thanks to this, Firefox Nightly was now able to download video streams smoothly. However, its computational complexity is still O(m log n).
Suddenly, I hit upon a simple and elegant algorithm. If a list-like data structure is used to store sent packets, ACK ranges can be expressed as predicate. With the example of [4,5,7,8,9], the ranges can be expressed as [(4,5),(7,9)] and this can be translated into the following predicate function:
predicate :: PlainPacket -> Bool
predicate pkt = (4<= n && n <=5) || (7<= n && n <=9)
where
n = packetNumber pkt
A list-like data structure provides partition function:
-- | The partition function takes a predicate and a list and returns
the pair of lists of elements which do and do not satisfy the
predicate, respectively;
partition :: (a -> Bool) -> [a] -> ([a], [a])
So, partition splits the list into a list of ACKed and a list of not-yet-ACKed in O(n). I choose the finger tree (aka Data.Sequence) as a list-like data structure since QUIC loss detection manipulates both ends.
Translating Pseudocode
At that moment, I started to develop loss detection and congestion control by translating the pseudocode in the draft into Haskell. As Haskellers could imagine, it was tough to convert the imperative Python-like code into functional Haskell code. Through this work, I found several inconsistencies of the spec and the pseudocode. Yes, contributing protocol standardization is one of my missions.
Since it was first time for me to implement a congestion control algorithm, I wondered if my code worked well. To make debugging easier, I enhanced my qlog code so that qvis could visualize the passage of congestion control. qvis even visualizes the flow control both for streams and a connection. Without qvis, I could not make my code that stable. I very much thank Robin Marx, the developer of qlog and qvis.
According to the pseudocode, packet losses are detected and retransmittion is carried out when an ACK frame is received. So, the resender thread is not necessary anymore.
In addition to translate the pseudocode faithfully, I extended the code for the issue pointed by Kazuho Oku. Suppose that the following ACK is received:
[100-101,103-104]
The pseudocode detects the lost of packet 102. However, the following ACK might come soon later:
[100-104]
This ACK pattern is caused by UDP packet reordering. To prevent this spurious retransmissions, I added another finger tree to keep loss candidates for a while and re-introduced the resender thread:
The resender thread wakes up when lostCandidates becomes non-empty, waits for 10 milliseconds and retransmits lost packets if still exist. Spurious retransmissions do not occur if necessary ACKs arrive within 10 milliseconds,
Testing
Due to COVID-19, I'm developing QUIC at my home. In some sense, this is lucky because I see many packet losses from my home network to the internet, which is suitable to develop loss detection and congestion control. (Since our company IIJ is an ISP, our office network is really stable. :) In addition to test my code in the real internet, I felt that comprehensive testing was necessary.
So, I developed a configurable pipe based on lightweight threads. This pipe can drop packets randomly or according to specified scenarios. I wrote scenarios, each of which drops a specific packet during the QUIC handshake. These discovered a lot of deadlocks, resulting in much stabler code.
Timer
Timer is necessary everywhere to implement QUIC. For loss detection, two timers are used. One is a timer to detect packet losses and the other is a timer to send a probe to detect tail packet losses. This section talks about how to implement low-cost timers.
Time library
To obtain the current time and calculate the timeout value, I choose the hourglass library at the beginning. Surprisingly profiling showed the hottest spot was timeCurrentP of this library, which is based on clock_gettime(2). So, I switched to my unix-time library. Internally it calls gettimeofday(2) which is highly optimized on some OSes.
Timeout function
Haskell provides beautiful abstraction for timeout:
timeout :: Int -> IO a -> IO (Maybe a)
If the action of IO a is finished before timeout is expired, a value of Just a is returned. If timeout occurs, Nothing is returned.
As "Parallel and Concurrent Programming in Haskell" explains, the original implementation first forks a timeout thread to send a signal to the original thread on timeout. If the action is finished before timeout, it is the timer thread that is killed. The current implementation delays the execution of forkIO until the timeout is expired.
Anyway, I don't want to spawn a thread for each timeout. So, a single designated thread with a request queue is created for each QUIC connection and sends signals according to requests.
Reducing the number of timeout
To send a probe packet, the timer is set everytime when a packet is sent. If bulk packets are sent, the timer is set multiple times and only the last setting is effective. To reduce this ineffectiveness, a queue for timeout operations was introduced. Like the resender thread above, a designated thread wakes up when the queue becomes non-empty, waits for 10 milliseconds and executes the last of the timeout operations. This omission is applied only for the probe timer of the 1-RTT level.
Pacing
The QUIC recovery draft suggests to pace sending of packets. On Linux, there are two pacing mechanisms, SO_MAX_PACING_RATE and SO_TXTIME. As "Accelerating UDP packet transmission for QUIC" describes, each has advantages and disadvantages:
SO_MAX_PACING_RATE: can work with Generic Segmentation Offload (GSO) but can be only meaningful for connected sockets.
SO_TXTIME: is meaningful even for wildcard sockets but cannot work with GSO.
As I described in "Implementation status of QUIC in Haskell", the quic library in Haskell uses connected sockets, I tried SO_MAX_PACING_RATE. To use this socket option, you need to attach fq (Fair Queue) to your NIC:
% sudo tc qdisc add dev eth0 root fq
After some trials, I realized that the quic library cannot tell how much delay is added for each packet sending. This results in inaccurate measuring of RTT. So, I gave up this approach.
Good new is that people tries to make SO_TXTIME friendly with GSO. I will get back to pacing when I will try to use GSO in my library.
Mew.org is now speaking HTTP/3 (HTTP/2 over QUIC). If you gain access to the site using Firefox Nightly, the first connection would be HTTP/2 then the following connections should be HTTP/3 led by Alt-Svc:.
This article explains insights which I found through the implementation activities of QUIC and HTTP/3 in Haskell.
HTTP/2 server library
I started implementing QUIC in January 2019. It took four months to reach a toy QUIC client since the negotiation part is really complicated. When I tackled the server side, my brain got befuddled. I have no idea on server architecture.
So, I got back to HTTP/2. As described in HTTP/2 server library in Haskell, I succeed to extract HTTP/2 server library from our HTTP/2 server.
When a QUIC connection is created at the server side, a designated lightweight thread is spawned with the Connection type. This abstraction seems reasonable because Connection hides internal information about a QUIC connection. However, I have no idea on how to abstract QUIC streams at that moment. So, I defined the following APIs temporally:
type StreamID = Int
type Fin = Bool
recvStream :: Connection -> IO (StreamID, ByteString)
sendStream :: Connection -> StreamID -> Fin -> ByteString -> IO ()
shutdownStream :: Connection -> StreamID -> IO ()
These APIs seem awkward since it exposes stream identifiers which applications should not know. Anyway, through this development I got an insight that a lot of code can be shared between client and server.
HTTP/2 client library
Now I wanted to verify that HTTP/2 client library can be achieved by sharing a lot of server code. The result is promising. HTTP/2 library in Haskell now provides both client and server side and implements self-testing.
And importantly, I found beautiful abstractions for HTTP requests and responses. For clients, requests are outgoing data. For servers, responses are also outgoing data. Since response statuses can be expressed as a pseudo :status header, we can define outgoing data as follows:
data OutObj = OutObj {
outObjHeaders :: [Header] -- ^ Accessor for header.
, outObjBody :: OutBody -- ^ Accessor for outObj body.
, outObjTrailers :: TrailersMaker -- ^ Accessor for trailers maker.
}
data OutBody = OutBodyNone
-- | Streaming body takes a write action and a flush action.| OutBodyStreaming ((Builder -> IO ()) -> IO () -> IO ())
| OutBodyBuilder Builder
| OutBodyFile FileSpec
For the client libarary, Request is just a wrapper data type:
Based on #QUIC experience, now I'm implementing HTTP/2 client library in #Haskell. Beautiful abstraction was found. Based on this activity, I'll tackle HTTP/3 and QPACK.
Now it was time to implement HTTP/3. Thanks to Request and Response from HTTP/2 library and QUIC library itself, I was able to concentrate on how to manipulate multiple streams. Suddenly, I got an insight about QUIC streams:
I got an insight where a #QUIC stream is like a TCP connection while a QUIC connection is an OS. So, the classical socket API style can be applied for QUIC streams. Haskell QUIC library takes this approach.
-- return "" when FIN is received
recvStream :: Stream -> Int -> IO ByteString
sendStream :: Stream -> ByteString -> IO ()
-- Sending FIN
shutdownStream :: Stream -> IO ()
With these APIs, I was able to develop HTTP/3 really fast. In the sense where a lightweight thread is used per stream, programming HTTP/3 is like HTTP/1.1. In the sense where frames are used, programming HTTP/3 is like HTTP/2. I felt that my long career for HTTP/1.1 and HTTP/2 is converged in HTTP/3!
If you wonder why I'm using Haskell to implement network protocols, please give a look at my position paper for NetPL 2017. In short, I love its strong and rich type system and concurrency based on lightweight threads (green threads).
This article mainly describes the server side because it is more challenging than the client side.
APIs
To implement APIs for QUIC servers, I started with the accept-then-fork style inspired by Berkeley socket APIs:
It appeared that my test server (mew.org:4433) based on this APIs stacked occasionally. First I suspected buffer overruns and illegal UDP packets. So, I set exception handlers everywhere but no exception was caught. I checked every code of underlying libraries and found a careless bug but it was not the source of this problem. At this moment, I ran out of ideas.
After taking a deep breath, I squeezed the print code everywehere to try to unstarstand what's going on. Printing was less smooth than I expected and I realized that the source of this problem is this API itself. accept was processed in the main thread. So, if accept processing the handshake stacks, everything stacks. This experience led to simpler APIs:
There is no intermediate data type (QUICServer) anymore. The high order function (the loan pattern in OOP terminology) ensures that the handshake is processed in a spawned thread and Connection is closed finally.
QUIC multiplexes streams in a connection. To send and receive data, the following APIs are provided at this moment:
TLS stands for Transport Layer Security. But QUIC uses TLS 1.3 as data structures, not as transport. This means that the QUIC frames on UDP convey the TLS handshake messages without the TLS record layer. Encryption/decryption are carried out by QUIC itself, not by TLS.
To separate the TLS data types from its record layer and transport, I first divided many functions of the server and client code in the TLS library in Haskell and introduced static-function APIs for QUIC. A QUIC client with this APIs succeeded in communicating ngtcp2 server. However, this approach had the following drawbacks:
APIs only cover limited cases. To cover all cases including hello retry request, resumption and new session ticket, more APIs should be provided.
Modifications are too drastic. When some code are merged into the client and server code in the master branch, I need to do the division again. (How many time did I rework?)
Olivier Chéron, another maintainer of the TLS library, hesitated to merge my modification and suggested me to introduce a flexible record layer. This motivated me to explore another approach based on lightweight threads. My conclusion of the record layer structure is as follows:
Executing a TLS thread with a transparent record layer (no encryption/decryption and no I/O), we can obtain TLS handshake messages itself. The TLS thread can be controlled by the following APIs:
With this APIs, all cases are covered with a little modification in the client and server code. The stability has been checked with many QUIC implementations. The usage of this APIs can be found in Network.QUIC.Handshake.
One long-standing issue is the timing to terminate the TLS thread for clients. After sending Client Finished to a server, the client waits for New Session Ticket (NST). However, some servers do not send NST.
QUIC draft 25 introduced the HANDSHAKE_DONE frame which is sent from servers to clients. Thanks to this, the main thread of the QUIC client can now terminate the TLS thread when HANDSHAKE_DONE is received. During the inter-operability test for draft 25, I noticed that the ngtcp2 server sends NST in the CRYPTO frame after HANDSHAKE_DONE. So, I changed the Haskell QUIC server to wait for a period after receiving HANDSHAKE_DONE hoping that NST will be also received during the period.
The server architecture
runQUICServer first spawns a Dispatcher thread for each network interface specified in ServerConfig. Each Dispatcher manages one wildcard socket, {UDP, local-addr, local-port, *, *}. After receiving an Initial packet from a wildcard socket, a connected socket, {UDP, local-addr, local-port, remote-addr, remote-port}, is created based on peer's address. For this connected socket, several threads are spawns to maintain Connection as illustrated in Fig 1:
Launcher: a thread to make a new Connection and launch user server code (Connection -> IO ()) specified to runQUICServer. recvStream pops incoming data from InputQ and sendStream pushes outgoing data to OutputQ.
TLS: a thread for TLS handshake. It gets TLS handshake messages as ServerControl and gives back TLS handshake message as ServerStatus. This thread is terminated when the TLS handshake is completed.
Reader: a thread to read data from the connected socket and pass them to Receiver via RecvQ.
Receiver: a thread to read data from RecvQ and decrypt them. It passes CRYPTO and STREAM frames to Launcher and processes control frames such as ACK and PING. For instance, when ACK frames are received, corresponding packets are removed from RetransDB.
Sender: a thread to read data from outputQ and encrypt-then-send them to the connected socket. It also saves original plain packets to RetransDB.
Resender: a thread to pops packets from RetransDB and pushes them to OutputQ repeatedly.
Reader and Receiver
Processing incoming QUIC packets is two-pass: decoding and decryption. This separation made the code drastically simpler. Reader decodes the unprotected part of the header using the following function:
Note that this function does not take the Connection argument. CryptPacket is defined as follows:
data CryptPacket = CryptPacket Header Crypt
data Header = Initial Version CID CID Token
| RTT0 Version CID CID
| Handshake Version CID CID
| Short CID
data Crypt = Crypt {
cryptPktNumOffset :: Int
, cryptPacket :: ByteString
}
CryptPacket is passed to Receiver via RecvQ:
newtype RecvQ = RecvQ (TQueue CryptPacket)
It is Receiver's responsibility to decrypt the protected part of the header and the encrypted body using the following function:
The third argument controls Padding. If Just n is specified, padding is generated so that the size of the result packet is just n. Otherwise, no padding is used. PlainPacket is defined as follows:
data PlainPacket = PlainPacket Header Plain
Sender keeps PlainPackets to RetransDB while Resender obtains PlainPackets from RetransDB and enqueues them to OutputQ again.
Dispachers
Dispatchers carry out the following jobs:
Passing information for new connections
Handling retry and version negotiation packets
Handling migration and NAT rebiding
Combining fragmented Initial packets
Dispatchers decode incoming packets using the following function:
This data type captures that version negotiation packets and retry packets are not encrypted. VersionNegotiationPacket and RetryPacket should be received by clients only. And servers should receive CryptPacket only. For instance, if a server receives VersionNegotiationPacket, the server ignores it.
New connections
A Dispatcher maintains a dictionary for Connection. The keys are destination connection IDs and values are a pair of Connection and MigrationQ described later.
If a version of Initial CryptPacket is known, it checks the Connection dictionary to see if the destination connection ID is stored. If not stored, it prepares for a new Connection. A new RecvQ is created and the Initial packet are pushed into it. And information to create a Connection including the RecvQ and peer's address/port is queued into so-called accepting queue. The destination connection ID is not registered to the Connection dictionary at this moment to prevent the Initial flooding attack.
The main thread repeatedly spawns Launcher. It accepts the information and tries to make a new Connection. Recall that Reader and Sender use a connected socket while Dispatcher holds a wildcard socket. How we can make a connected socket for a new Connection safely?
Suppose that a wildcard socket, {UDP, 192.0.2.1, 4433, *, *}, exists and peer's address/port is 203.0.113.0:3456. A socket, {UDP, 192.0.2.1, 4433, 203.0.113.0, 3456} should be created without errors nor race condition. My first attempt is as follows:
Unfortunately, BSD variants reject (2). Linux accepts (2) but race condition would happen. Kazuho Oku, one of the quicly maintainers suggested me using an ANY address in (2). The improved process is as follows:
Connect it to 203.0.113.0:3456. This also binds the local address to 192.0.2.1.
This process succeeds even on BSD variants and there is no race conditions on any platforms.
After a connected socket is created and TLS handshake is done through the socket, Launcher registers the Connection to the dictionary.
Retry and version negotiation
When a version in a CryptPacket is unknown (for instance, a grease value), Dispatcher send a version negotiation packet.
If no token is contained in an Initial CryptPacket and Connection is not found in the dictionary but the server configuration requires retry, it sends a Retry packet.
If a valid token is provided, a new connection is created as described in the previous subsection.
Migration or NAT rebiding
When a client moves to a new address/port or the client port is changed due to NAT rebinding, Dispatcher receives Short CryptPackets from its wildcard socket. Their destination connection ID are found in the Connection dictionary. In this case, Dispatcher creates MigrationQ, registers it to the dictionary and spawns Migrator. After that, Dispatcher enqueues this first migration packet and other migration packets, if any, to MigrationQ.
Migrator creates a new connected socket for the new address/port and spawns another Reader. Then Migrator tells the new socket to Sender, asks Sender to send PathChallenge and sleeps until Reader receives PathResponse. Then Migrator reads packets from MigrationQ and enqueues them to RecvQ. After a period, Migrator terminates the old Reader.
Fragmented Initial packets
In normal cases, a client sends one Initial packet when it tries to make a new connection. However, if a client wants to send a large Client Hello message, it is divided into multiple Initial packets.
My old implementation naively handles them. Let's consider the case of two Initial packets. When the first packet arrives, a wildcard socket catches it. A connected socket is created and the packet is acknowledged. Then the second packet is also caught by the wildcard socket. Creating another connected socket for the peer fails due to the same parameters. Since the second packet is not acknowledged, the client resends the second packet. The connected socket captures it this time. So, a new connection can be created.
The advantage of this approach is that Dispatcher does not maintain any information. The disadvantage is that this logic does not work on Linux. To my surprise, connect(2) succeeds even for the same parameters! This results in unexpected behavior.
So, I gave up the zero-cost approach and introduced another dictionary of a fixed size. The keys are originaldestination connection IDs while the values are RecvQ. An entry is created when the first Initial packet arrives. Then succeeding Initial packets are queued to RecvQ found in the dictionary.
I used Priority Search Queue (PSQ) to make this fixed-size dictionary. Choosing time as priority implements FIFO. The combination of size, deleteMin and insert implements fixed-size.
The contents of this article is just a snapshot and the code would be substantially changed during the development. So, I should write another article to update the contents when the quic library in Haskell is released.
I thank all people, especially IETF QUIC WG guys, who helped me.
Closing connections gracefully is an old and new problem in network programming.
In the HTTP/1.1 days, this did not get attention since HTTP/1.1 is a synchronous protocol. However, as Niklas Hambüchen concretely and completely explained, HTTP/2 servers should close connections gracefully. This is because HTTP/2 is an asynchronous protocol.
Unfortunately, most HTTP/2 server implementations do not close connections gracefully, hence browsers cannot display pages correctly in some situations.
The first half of this article explains the problem and its solution step by step in general. The second half talks about how to implement graceful-close in Haskell network library.
Normal cases of HTTP/1.1
Roughly speaking, synchronous HTTP/1.1 can be implemented as follows:
Browser: the loop of writing request and reading response
Server: the loop of reading request and writing response
Since HTTP/1.1 uses persistent connections by default, a browser should set the Connection: close header to close the current connection.
When the server received the Connection: close header,
it closes the connection by close() after sending its response.
Of course, the browser knows that the connection is being closed.
So, the browser reads the response until read() returns 0,
which means EOF.
Then, the browser closes the connection by close().
Error cases of HTTP/1.1
For security reasons, HTTP/1.1 servers close connections. The followings are typical situations:
Idle timer is expired
The number of requests reaches the limitation
In these cases, an HTTP/1.1. server calls close() resulting in generating TCP FIN.
When the browser tries to write the next request to the same connection, it would be nice to see if the connection is alive. Are there any system calls to check it? If my understanding is correct, there is no such system calls without IO. What the browser can do is just read or write the connection optimistically if it wants to reuse the connection.
So, what happens if the browser reads or writes the connection which has already received TCP FIN?
write() succeeds. However,since the server socket is already closed, the TCP layer of the browser received TCP FIN, which is not informed to the browser.
Another intersecting question is what happens if the browser reads or writes the connection which has already received TCP RST?
write() causes SIGPIPE. If its signal handler ignores it, write() is resumed and returns EPIPE.
read() returns ECONNREST.
Recovering in HTTP/1.1
Suppose that an HTTP/1.1 server closed a connection by close() but
the browser tries to send one more request.
When the TCP layer of the server received the request,
it sends TCP RST back to the browser.
The browser tries to read the corresponding response and notices that
the server resets the connection.
So, the browser can make another new connection and re-send the request to the server.
In this way, recovering in HTTP/1.1 is not so difficult.
Normal cases of HTTP/2
HTTP/2 uses only one TCP connection between a browser and a server.
Since HTTP/2 is asynchronous, the browser can send requests at anytime.
The server can send back responses in any order.
To combine a request and its corresponding response,
a unique stream ID is given to the pair.
In the following figure, the order of response 1 and response 2 is flipped.
To close the connection, the browser should send GOAWAY.
When the HTTP/2 server received GOAWAY,
the server should send back GOAWAY.
Typical implementations call close() after that.
Error cases of HTTP/2
For security reasons, an HTTP/2 server itself closes a connection
by sending GOAWAY.
Again, typical implementations call close() after that.
It is likely that the browser sent a request asynchronously and
the request reaches to the server after the socket is gone.
In this case, as explained earlier,
TCP RST is sent back to the browser.
Unfortunately, the TCP RST drops all data to be read in the TCP layer of
the browser.
This means that when the browser tries to read its response,
only ECONNREST is returned.
GOAWAY disappears.
GOAWAY contains the last stream ID which the server actually processed.
Without receiving GOAWAY, the browser cannot tell the recovering point.
In other words, the browser cannot render the target page correctly.
This problem actually happens in the real world.
And most HTTP/2 server implementations have this problem.
Graceful close
So, what is a solution?
The masterpiece book, "UNIX Network Programming
Volume 1: The Sockets Networking API (3rd Edition)", W. Richard Stevens et al.
suggests the following way.
The server should call shutdown(SHUT_WR) to close the sending side but keep the receiving side open. Even if requests reach to the server after shutdown(), TCP RST is not generated.
The browser can read GOAWAY in this scenario and send back GOAWAY followed by close().
The server should read data until read() returns 0, EOF.
The server finally should call close() to deallocate the socket resource.
It is not guaranteed that the browser sends back TCP FIN.
So, the server should set time out to read().
One approach is the SO_RCVTIMEO socket option.
From here, I would like to explain how to implement graceful-close in Haskell
network library.
Approach 1: the SO_RCVTIMEO socket option
After reading "UNIX Network Programming",
I started with the C-language way but many features are missing in the network library.
To time out reading, SO_RCVTIMEO should be supported in setSocketOption.
Since SO_RCVTIMEO is effective only for blocking sockets, a function to set a non-blocking socket back to blocking is necessary.
Receiving data from blocking sockets without triggering the IO manager is also needed.
I confirmed that this actually works but threw this away.
To not block RTS by calling the receiving function of 3, the function should be called via safeFFI. This means that an additional native (OS) thread is consumed. Closing connections should not be that costly. All in all, blocking sockets are not the Haskell way!
Approach 2: the timeout function
Of course, a very easy way is combine the timeout function and the original recv function which may trigger the IO manager.
This actually works.
But again I threw this away since an additional lightweight thread is consumed in timeout.
Approach 3: the threadDelay function
I finally hit upon the idea of threadDelay.
For this approach, a new receiving function is necessary.
It uses non-blocking socket and does not trigger the IO manager.
The algorithm is as follows:
loop until the time out is expired
reading data
if it returns EAGAIN, call threadDelay with a small delay value. If it returns data, breaks the loop.
The advantage of this approach is availability. This works on all platforms with both threaded and non-threaded RTS. The disadvantage is that the timing of timeout would be inaccurate.
Approach 4: callbacks of the IO/Timer manager
Michael Snoyman suggested to use a pair of callbacks for the IO and Timer managers.
First, an MVar is prepared.
Then the main code sets a callback to the IO manager asking to put data
to the MVar when available.
At the same time, the main code also sets a callback to the Timer manager asking to put a time-out signal to the MVar when the timeout is expired.
The two callbacks race and the main code will accept the result of the race through the MVar.
This idea is awesome because no resource is wasted.
What I was impressed is that he knows the IO/Timer managers better than me, who is one of the developers of the mangers!
Final remark
The Haskell network library version 3.1.1.0 will provide gracefulClose.
For threaded-RTS on UNIX where the IO manager is available, approach 4 is taken.
For Windows or non-threaded-RTS where the IO manager is not available, approach 3 is taken.
EDIT: It appeared that approach 4 leaks TCP connections. So, the current network library adopts approach 3 on all platforms.
My deep thank goes to Niklas Hambüchen for pointing out this problem, discussing solutions patiently and reviewing my implementations thoroughly. I would like to thank Tamar Christina for helping the development on Windows and Michael Snoyman for suggesting approach 4.