I have spent the past few days working on building a custom image of NixOS for my Raspberry Pi, taking note of all the problems that I encountered along the way to write about it. It has certainly been an interesting experience, as I am building the image emulating AArch64 (or ARM64v8, the architecture of the Pi 3) with QEMU, so I wasn’t expecting a smooth ride at all. There was one particular problem, though, which had my head scratching for a while.

After solving some other unrelated problems and attempting to build the image, I got this mysterious error:

$ nix-build '<nixpkgs/nixos>' -A config.system.build.sdImage --argstr system aarch64-linux -I nixos-config=./sd-image.nix
building '/nix/store/q4kcsy4f1jcxxa2kc6x02rjhg8z1911y-ext4-fs.img.zst.drv'...
[...]
copying store paths to image...
copying files to image...
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
NIXOS_SD: 93738/177408 files (0.1% non-contiguous), 581365/708701 blocks
Resizing to minimum allowed size
resize2fs 1.45.5 (07-Jan-2020)
Please run 'e2fsck -f temp.img' first.

builder for '/nix/store/q4kcsy4f1jcxxa2kc6x02rjhg8z1911y-ext4-fs.img.zst.drv' failed with exit code 1
cannot build derivation '/nix/store/ari7yzjkhif5d7q256dy5rrdfkjhqb8f-nixos-sd-image-20.09pre221814.10100a97c89-aarch64-linux.img.drv': 1 dependencies couldn't be built
error: build of '/nix/store/ari7yzjkhif5d7q256dy5rrdfkjhqb8f-nixos-sd-image-20.09pre221814.10100a97c89-aarch64-linux.img.drv' failed

What? It seems that resize2fs is complaining that the filesystem that is trying to resize was never checked with e2fsck, except that… well, that’s exactly what’s being done before running resize2fs, as it can be seen by the log. To understand the issue, I need to understand what’s being executed first. nix show-derivation will display the derivation passed as its argument as JSON, and it’s easy to parse it leveraging the power of nix-shell and jq:

nix show-derivation /nix/store/q4kcsy4f1jcxxa2kc6x02rjhg8z1911y-ext4-fs.img.zst.drv | \
    nix-shell -p jq --run 'jq -r .[].env.buildCommand'

Here are the relevant bits (some comments stripped for brevity):

...
faketime -f "1970-01-01 00:00:01" mkfs.ext4 -L NIXOS_SD -U 44444444-4444-4444-8888-888888888888 $img
...
cptofs -t ext4 -i $img nix-path-registration /
echo "copying store paths to image..."
cptofs -t ext4 -i $img $storePaths /nix/store/
echo "copying files to image..."
cptofs -t ext4 -i $img ./files/* /
if ! fsck.ext4 -n -f $img; then
  ...
fi

echo "Resizing to minimum allowed size"
resize2fs -M $img

...

What is being shown is the result of the compiled Nix expression which generates the ext4 filesystem for the image. It doesn’t look like there’s anything wrong here, especially considering that the build works on the official build server. Bug reports talking about the same issue seem to refer to invalid system dates being the culprit, but the date on the VM is definitely correct. In addition, the image is deliberately created with the UNIX epoch as its timestamp.

resize2fs -M rabbit-hole.img

At this point, I wanted to understand in which conditions resize2fs threw that error. resize2fs is part of e2fsprogs, so I delved into its source code.

The code first checks if resize2fs is called with the force option (which is not the case here), then does the following:

int checkit = 0;
if (fs->super->s_state & EXT2_ERROR_FS)
    checkit = 1;
if ((fs->super->s_state & EXT2_VALID_FS) == 0)
    checkit = 1;
if ((fs->super->s_lastcheck < fs->super->s_mtime) &&
    !print_min_size)
    checkit = 1;
if ((ext2fs_free_blocks_count(fs->super) >
     ext2fs_blocks_count(fs->super)) ||
    (fs->super->s_free_inodes_count > fs->super->s_inodes_count))
    checkit = 1;
if (checkit) {
    fprintf(stderr,
        _("Please run 'e2fsck -f %s' first.\n\n"),
        device_name);
    exit(1);
}

So the error is triggered if any of this is true:

  1. The filesystem is marked as errored or not marked as valid.
  2. The last check of the filesystem was done before the time it was last mounted.
  3. Whether the number of free blocks is greater than the number of available blocks, or the same for inodes.

I doubted that the error was due to any kind of corruption, so I took a closer look. Nix has very convenient ways to debug packages, but since building an image is a bit of a special process, I can’t just be dropped in a shell when the build fails. What I can do, though, is run nix-build with the -K flag which makes it retain the broken build (with the image):

$ nix-build -K '<nixpkgs/nixos>' -A config.system.build.sdImage --argstr system aarch64-linux -I nixos-config=./sd-image.nix
...
note: keeping build directory '/tmp/nix-build-ext4-fs.img.zst.drv-0'

With nix-shell I can be dropped into a shell with the same toolset that the builder had when it failed:

$ cd /tmp/nix-build-ext4-fs.img.zst.drv-0
$ nix-shell -p e2fsprogs.bin --option system aarch64-linux
[nix-shell:/tmp/nix-build-ext4-fs.img.zst.drv-0]$ ls
env-vars  files  nix  nix-path-registration  temp.img

So what’s the deal with the image?

[nix-shell:/tmp/nix-build-ext4-fs.img.zst.drv-0]$ dumpe2fs -h temp.img
dumpe2fs 1.45.5 (07-Jan-2020)
Filesystem volume name:   NIXOS_SD
[...]
Filesystem state:         clean
[...]
Inode count:              177408
Block count:              708701
Free blocks:              666124
Free inodes:              177081
[...]
Filesystem created:       Thu Jan  1 00:00:01 1970
Last mount time:          Thu Jan  1 00:00:02 1970
Last write time:          Thu Jan  1 00:00:05 1970
[...]
Last checked:             Thu Jan  1 00:00:01 1970

Thanks to dumpe2fs, in one shot we know that:

  • The filesystem is clean (i.e. valid, non-errored). Check (1) passes.
  • The number of blocks and inodes is coherent. Check (3) passes.
  • The filesystem was last checked at 00:00:01 on the epoch, while its last mount was on 00:00:02. This is what makes resize2fs refuse to execute.

An extra second was able to disrupt my build process. That’s what I’d define interesting! But I had quite a few questions in my mind that needed answering.

Why is the last checked time not changed, even though fsck was called?

To answer this, I first looked at the manpage of e2fsck to understand the meaning of the options given to it:

-n

Open the filesystem read-only, and assume an answer of `no’ to all questions. Allows e2fsck to be used non-interactively. This option may not be specified at the same time as the -p or -y options.

This already looks like an answer, but a quick look at the source code confirms the assumption:

if (!(ctx->flags & E2F_FLAG_RUN_RETURN) &&
    !(ctx->options & E2F_OPT_READONLY)) {
    [...]
    if (!(ctx->flags & E2F_FLAG_TIME_INSANE))
        sb->s_lastcheck = ctx->now;
}

How come that the official builds aren’t failing then?

Let’s dissect one. I took the last successful build at the time of writing and downloaded it. The SD card image builder prepends to the root ext4 fs the FAT32 system used by the Raspberry Pi to boot, so to extract it I need the offset at which it starts.

It’s not too difficult to analyze the partition table or find the ext magic number in the image, but it’s quicker to check the log on Hydra which conveniently dumps the partition table:

Device Boot Start     End Sectors  Size Id Type
[snip]      16384   77823   61440   30M  b W95 
[snip] *    77824 5470991 5393168  2.6G 83 Linu

And this is what it had to tell me:

$ dd if=nixos-sd-image-20.03.1619.ab3adfe1c76-aarch64-linux.img skip=77824 of=nixos-root.img
$ dumpe2fs -h nixos-root.img
Filesystem created:       Thu Jan  1 00:00:01 1970
Last mount time:          Thu Jan  1 00:00:01 1970
Last write time:          Thu Jan  1 00:00:01 1970
[...]
Last checked:             Thu Jan  1 00:00:01 1970

Hmm… So resize2fs is working on the official builds because the timestamps are all aligned, thus last_check < mount_time is not true.

A persistent clock that isn’t much of a clock

The hardest part of this investigation was figuring out why the timestamps were set to the UNIX epoch and where this extra second was coming from. To get to the answer, first I wanted to figure out what was incrementing the “last mount time” on my system. The answer is not immediately apparent, but is hidden in these lines:

cptofs -t ext4 -i $img nix-path-registration /
echo "copying store paths to image..."
cptofs -t ext4 -i $img $storePaths /nix/store/
echo "copying files to image..."
cptofs -t ext4 -i $img ./files/* /

cptofs isn’t a very widespread utility, and I didn’t know about its existence. A quick search brings up the source code and more importantly the repo where it comes from, a fork of the Linux kernel.

LKL (Linux Kernel Library) is aiming to allow reusing the Linux kernel code as extensively as possible with minimal effort and reduced maintenance overhead.

Examples of how LKL can be used are: creating userspace applications […] that can read or write Linux filesystems or can use the Linux networking stack […]

The concept sounds pretty cool: to copy its files to the build image without having to mount it, NixOS is basically using an utility which runs the Linux kernel (with its ext4 fs driver) in userspace. This means that cptofs is:

  • bootstrapping a modified version of the Linux kernel along with some modules
  • loading the original implementation of ext4 from the Linux kernel
  • virtually mounting the image on this instance of the kernel
  • copying the source files to the virtual mount path.

The filesystem is effectively getting mounted, thus it’s quite obvious that its mount time is also going to be altered. But why is it set to the epoch, and moreover one second late?

I suspected some kind of issues with the timekeeping done by the userspace kernel, and I didn’t know if there were any changes made to ext4 in this regard, so I went to take a look. The first function of interest is ext4_setup_user which increases the mount count and calls ext4_update_tstamp to update s_mtime:

le16_add_cpu(&es->s_mnt_count, 1);
ext4_update_tstamp(es, s_mtime);

ext4_update_tstamp is a macro which eventually sets the timestamp to:

time64_t now = ktime_get_real_seconds();

Which, according to an helpful comment:

ktime_get_real_seconds - Get the seconds portion of CLOCK_REALTIME

Returns the wall clock seconds since 1970. […]

Right, this makes sense. But why is it the epoch and not the actual wall clock time? Turns out that the answer to this question is quite tricky. Originally, LKL never initialized the kernel timer and always started counting time from the epoch, ignoring the host’s wall clock time. Then, a patch was submitted which used the time of the host to update the one of the userspace kernel.

Some time later someone else reverted the original patch in favor of the “proper” way of doing it, which was to use the arch-specific function read_persistent_clock. As the knowledgeable author stated in response to a maintainer asking if this patch allowed to remove the previously introduced code:

yes [we can remove the code in setup.c that sets the clock directly]. […]

read_persistent_clock() is a weak symboled function so if there is no arch-overrided function, it skips the call.

That sounds very reasonable, and the patch worked.

What wasn’t eventually caught downstream, though, is that in the coming years the kernel started to move towards 64-bit timestamps in an effort to prevent the Y2K38 problem. A step of this transition was migrating the existing read_persistent_clock to a new read_persistent_clock64, which used timespec64 instead of timespec. It was a gradual transition, which eventually led read_persistent_clock to be phased out. In 2018, when there were no more consumers of this function, it was completely removed in favor of the 64-bit option.

The original read_persistent_clock added in the patch stayed in the LKL tree never to be used again, and since the new one is weakly symboled too, the kernel quietly started to use its default implementation. Here is it in all of its glory:

/**
 * read_persistent_clock64 -  Return time from the persistent clock.
 *
 * Weak dummy function for arches that do not yet support it.
 * Reads the time from the battery backed persistent clock.
 * Returns a timespec with tv_sec=0 and tv_nsec=0 if unsupported.
 *
 *  XXX - Do be sure to remove it once all arches implement it.
 */
void __weak read_persistent_clock64(struct timespec64 *ts)
{
    ts->tv_sec = 0;
    ts->tv_nsec = 0;
}

And there is our answer. Since 2018, nobody ever noticed that LKL regressed and started counting time from the Unix epoch again, as the original issue reported. Welcome back January 1st, 1970! I have reported my findings upstream.

One second too late

After all of this, there is still one question outstanding: why the hell is my last mount time one second later than it should be? With what I’ve deduced before, I now know that LKL timekeeping is working, but it’s starting from the UNIX epoch rather than the current time. Could it be that LKL was just slow (also due to emulation) and took two seconds to mount the image?

Let’s open a shell with lkl and e2fsprogs and create an ext4 image:

$ nix-shell -p '[e2fsprogs.bin lkl]' --option system aarch64-linux
[nix-shell:~]$ truncate -s 10485760 /tmp/lkl_test.img
[nix-shell:~]$ mkfs.ext4 -q /tmp/lkl_test.img
[nix-shell:~]$ dumpe2fs -h /tmp/lkl_test.img
Filesystem created:       Wed Apr 29 22:26:54 2020
Last mount time:          n/a
Last write time:          Wed Apr 29 22:26:54 2020
[...]
Last checked:             Wed Apr 29 22:26:54 2020

Nothing wrong so far. Looking at the help of cptofs, I found a very interesting flag:

-p, --enable-printk

show Linux printks

This flag allows to see the logging messages of the kernel, which (exactly as dmesg outputs) are prefixed with the relative start time of the kernel. Time for the moment of truth.

[nix-shell:~]$ touch /tmp/empty_file
[nix-shell:~]$ cptofs -p -i /tmp/lkl_test.img -t ext4 /tmp/empty_file /
[    0.000000] Linux version 5.3.0 (nixbld@) (gcc version 9.3.0 (GCC)) #1 Sat Apr 4 07:54:38 UTC 2020
[...]
[    0.000000] clocksource: lkl: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000248] lkl: time and timers initialized (irq2)
[...]
[    0.483229] raid6: int64x8  gen()  1891 MB/s
[    0.700411] raid6: int64x8  xor()  1183 MB/s
[    0.923161] raid6: int64x4  gen()  1739 MB/s
[    1.139449] raid6: int64x4  xor()  1003 MB/s
[    1.358977] raid6: int64x2  gen()  1292 MB/s
[    1.572580] raid6: int64x2  xor()   672 MB/s
[    1.787268] raid6: int64x1  gen()   790 MB/s
[    2.001716] raid6: int64x1  xor()   418 MB/s
[    2.001838] raid6: using algorithm int64x8 gen() 1891 MB/s
[    2.001909] raid6: .... xor() 1183 MB/s, rmw enabled
[    2.002166] raid6: using intx1 recovery algorithm
[    2.032205] clocksource: Switched to clocksource lkl
[...]
[    2.563646] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts:
[    2.682402] reboot: Restarting system
[nix-shell:~]$

Yes! Due to btrfs being bundled with the userspace kernel, the raid6_pq and xor modules are also being loaded which perform some relatively slow benchmarking on startup to determine the best algorithms to use. The delay is high enough when running with emulation that it causes the mount time of the filesystem to be set two seconds away from the epoch, which is the root cause of the original issue. As far as I know, NixOS is using a native AArch64 build box to create the images, which means that is probably fast enough to always get to the mount stage before the two seconds mark.

[nix-shell:~]$ dumpe2fs -h /tmp/lkl_test.img
[...]
Filesystem created:       Wed Apr 29 22:26:54 2020
Last mount time:          Thu Jan  1 00:00:02 1970
Last write time:          Thu Jan  1 00:00:02 1970

Conclusion

With the root cause known, it’s still evident that the mechanism used in the NixOS builder is fragile, as it relies on a fortunate coincidence which doesn’t always apply. Moreover, when the issue on LKL is going to be fixed and the package updated, it won’t work anymore, as LKL will (correctly) update the mount time to the current timestamp, making it greater than the last check time.

Thus, I have submitted a patch upstream to solve this issue once and for all, hoping that nobody else will have to delve into the roots of a forked Linux kernel to understand an issue. It was also brought to my attention that another contributor got rid of cptofs entirely, which would also solve the issue.

Thanks for reading!