あどけない話

インターネットに関する技術的な話など

Seeking the reasons for segfaults of a Haskell program

My open server of Haskell QUIC on Linux sometimes got segfaults. I saw two types of segfaults. One is a simple segfault by accessing a wrong address:

mighty: segmentation fault

The other is relating to free():

*** Error in `mighty': corrupted double-linked list: 0x00007fcdf0008f90 ***

I guessed that a buffer overrun occurred against a buffer allocated by malloc() and this segfault happened when the buffer is freed.

Many Haskellers would be surprised at this kind of segfaults because it is hard to cause segfaults in normal Haskell programming. However, if you manipulate pointers or use unsafe functions, segfaults are usual like other programming language.

For the first type of segfault, you can use Foreign.Storable.peek:

% ghci
> import Foreign.Ptr
> import Foreign.Storable
> :type peek
peek :: Storable a => Ptr a -> IO a

Let's try to access so-called NULL:

> peek nullPtr :: IO Int
sh: segmentation fault  ghci

Buffer overruns can be caused by Foreign.Storable.poke. Its type is as follows:

> :type poke
poke :: Storable a => Ptr a -> a -> IO ()

I checked all peeks and pokes in my code but I could not figure out the reasons of segfaults. So, I needed to take another approach.

The -g option of GHC

Like other compilers, GHC provides the -g option to add debug information to a complied program. We can run the program in gdb and get a back trace if a segfault happens. To compile all dependent libraries with the -g option, I modified my Cabal wrapper, called cab, to provide a command line option (whose name is also -g) to implement this feature. I also used the sandbox feature of Cabal-v1:

% cd mighty
% cab init             # creating a sandbox
% cab add ~/work/quic  # adding non-Hackage deps
...
% cab install -d -f tls -f quic -g
% cab conf -f tls -f quic -g
% cab build

Then run the complied program in gdb:

% sudo gdb --args mighty conf route
(gdb) handle SIGPIPE nostop noprint pass
Signal        Stop  Print   Pass to program Description
SIGPIPE       No    No  Yes     Broken pipe
(gdb) handle SIGUSR1 nostop noprint pass
Signal        Stop  Print   Pass to program Description
SIGUSR1       No    No  Yes     User defined signal 1
(gdb) run

As you can see, I needed to modify behavior of two signal handlers to ignore them:

Segfault 1

When I added some test cases of QPACK to h3spec and test the open server, gdb finally caught a segfault and showed a back trace. The reason is Data.Array.Base.unsafeAt. I did not check the boundary of an array! (My QPACK code is derived from my HPACK code where this boundary check is not necessary.)

Segfault 2

The segfault relating to free() was really mysterious because the buffer boundary is always checked when poke is used. The error message of free() on Linux is not so kind. But when I got the same segfault on macOS, the following message was displayed:

mighty(75755,0x700009519000) malloc: Incorrect checksum for freed object 0x7fb8de80ea00: probably modified after being freed.

Eureka! Even if the boundary is checked everytime, this segfault happens because a freed buffer is used.

But why is a freed buffer used? This is one of difficulties of multi-thread programming. Suppose thread A and thread B share a buffer. The following is an example clean-up procedure:

  • Thread A sends a kill signal to thread B
  • Thread A frees the buffer
  • Thread A exits

This looks perfect. However the timing of termination of thread B depends on the scheduler. Even after thread A freed the buffer, thread B is alive and can manipulate the buffer.

To prevent this contention, I gave up the approach of Foreign.Marshal.Alloc.mallocBytes and Foreign.Marshal.Alloc.free. Instead, I started using GHC.ForeignPtr.mallocPlainForeignPtrBytes. Buffers allocated by this function are GCed like ByteString.

Now I believe that my QUIC server gets much stabler than before.