Why doesn't `resize2fs` resize my filesystem?
I have spent the past few days working on building a custom image of NixOS for my Raspberry Pi, taking note of all the problems that I encountered along the way to write about it. It has certainly been an interesting experience, as I am building the image emulating AArch64
(or ARM64v8, the architecture of the Pi 3) with QEMU, so I wasn’t expecting a smooth ride at all. There was one particular problem, though, which had my head scratching for a while.
After solving some other unrelated problems and attempting to build the image, I got this mysterious error:
$ nix-build '<nixpkgs/nixos>' -A config.system.build.sdImage --argstr system aarch64-linux -I nixos-config=./sd-image.nix
building '/nix/store/q4kcsy4f1jcxxa2kc6x02rjhg8z1911y-ext4-fs.img.zst.drv'...
[...]
copying store paths to image...
copying files to image...
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
NIXOS_SD: 93738/177408 files (0.1% non-contiguous), 581365/708701 blocks
Resizing to minimum allowed size
resize2fs 1.45.5 (07-Jan-2020)
Please run 'e2fsck -f temp.img' first.
builder for '/nix/store/q4kcsy4f1jcxxa2kc6x02rjhg8z1911y-ext4-fs.img.zst.drv' failed with exit code 1
cannot build derivation '/nix/store/ari7yzjkhif5d7q256dy5rrdfkjhqb8f-nixos-sd-image-20.09pre221814.10100a97c89-aarch64-linux.img.drv': 1 dependencies couldn't be built
error: build of '/nix/store/ari7yzjkhif5d7q256dy5rrdfkjhqb8f-nixos-sd-image-20.09pre221814.10100a97c89-aarch64-linux.img.drv' failed
What? It seems that resize2fs
is complaining that the filesystem that is trying to resize was never checked with e2fsck
, except that… well, that’s exactly what’s being done before running resize2fs
, as it can be seen by the log. To understand the issue, I need to understand what’s being executed first. nix show-derivation
will display the derivation passed as its argument as JSON, and it’s easy to parse it leveraging the power of nix-shell
and jq
:
nix show-derivation /nix/store/q4kcsy4f1jcxxa2kc6x02rjhg8z1911y-ext4-fs.img.zst.drv | \
nix-shell -p jq --run 'jq -r .[].env.buildCommand'
Here are the relevant bits (some comments stripped for brevity):
...
faketime -f "1970-01-01 00:00:01" mkfs.ext4 -L NIXOS_SD -U 44444444-4444-4444-8888-888888888888 $img
...
cptofs -t ext4 -i $img nix-path-registration /
echo "copying store paths to image..."
cptofs -t ext4 -i $img $storePaths /nix/store/
echo "copying files to image..."
cptofs -t ext4 -i $img ./files/* /
if ! fsck.ext4 -n -f $img; then
...
fi
echo "Resizing to minimum allowed size"
resize2fs -M $img
...
What is being shown is the result of the compiled Nix expression which generates the ext4
filesystem for the image. It doesn’t look like there’s anything wrong here, especially considering that the build works on the official build server. Bug reports talking about the same issue seem to refer to invalid system dates being the culprit, but the date on the VM is definitely correct. In addition, the image is deliberately created with the UNIX epoch as its timestamp.
resize2fs -M rabbit-hole.img
At this point, I wanted to understand in which conditions resize2fs
threw that error. resize2fs
is part of e2fsprogs
, so I delved into its source code.
The code first checks if resize2fs
is called with the force
option (which is not the case here), then does the following:
int checkit = 0;
if (fs->super->s_state & EXT2_ERROR_FS)
checkit = 1;
if ((fs->super->s_state & EXT2_VALID_FS) == 0)
checkit = 1;
if ((fs->super->s_lastcheck < fs->super->s_mtime) &&
!print_min_size)
checkit = 1;
if ((ext2fs_free_blocks_count(fs->super) >
ext2fs_blocks_count(fs->super)) ||
(fs->super->s_free_inodes_count > fs->super->s_inodes_count))
checkit = 1;
if (checkit) {
fprintf(stderr,
_("Please run 'e2fsck -f %s' first.\n\n"),
device_name);
exit(1);
}
So the error is triggered if any of this is true:
- The filesystem is marked as errored or not marked as valid.
- The last check of the filesystem was done before the time it was last mounted.
- Whether the number of free blocks is greater than the number of available blocks, or the same for
inodes
.
I doubted that the error was due to any kind of corruption, so I took a closer look. Nix has very convenient ways to debug packages, but since building an image is a bit of a special process, I can’t just be dropped in a shell when the build fails. What I can do, though, is run nix-build
with the -K
flag which makes it retain the broken build (with the image):
$ nix-build -K '<nixpkgs/nixos>' -A config.system.build.sdImage --argstr system aarch64-linux -I nixos-config=./sd-image.nix
...
note: keeping build directory '/tmp/nix-build-ext4-fs.img.zst.drv-0'
With nix-shell
I can be dropped into a shell with the same toolset that the builder had when it failed:
$ cd /tmp/nix-build-ext4-fs.img.zst.drv-0
$ nix-shell -p e2fsprogs.bin --option system aarch64-linux
[nix-shell:/tmp/nix-build-ext4-fs.img.zst.drv-0]$ ls
env-vars files nix nix-path-registration temp.img
So what’s the deal with the image?
[nix-shell:/tmp/nix-build-ext4-fs.img.zst.drv-0]$ dumpe2fs -h temp.img
dumpe2fs 1.45.5 (07-Jan-2020)
Filesystem volume name: NIXOS_SD
[...]
Filesystem state: clean
[...]
Inode count: 177408
Block count: 708701
Free blocks: 666124
Free inodes: 177081
[...]
Filesystem created: Thu Jan 1 00:00:01 1970
Last mount time: Thu Jan 1 00:00:02 1970
Last write time: Thu Jan 1 00:00:05 1970
[...]
Last checked: Thu Jan 1 00:00:01 1970
Thanks to dumpe2fs
, in one shot we know that:
- The filesystem is clean (i.e. valid, non-errored). Check (1) passes.
- The number of blocks and inodes is coherent. Check (3) passes.
- The filesystem was last checked at
00:00:01
on the epoch, while its last mount was on00:00:02
. This is what makesresize2fs
refuse to execute.
An extra second was able to disrupt my build process. That’s what I’d define interesting! But I had quite a few questions in my mind that needed answering.
Why is the last checked time not changed, even though fsck
was called?
To answer this, I first looked at the manpage of e2fsck
to understand the meaning of the options given to it:
-n
Open the filesystem read-only, and assume an answer of `no’ to all questions. Allows e2fsck to be used non-interactively. This option may not be specified at the same time as the -p or -y options.
This already looks like an answer, but a quick look at the source code confirms the assumption:
if (!(ctx->flags & E2F_FLAG_RUN_RETURN) &&
!(ctx->options & E2F_OPT_READONLY)) {
[...]
if (!(ctx->flags & E2F_FLAG_TIME_INSANE))
sb->s_lastcheck = ctx->now;
}
How come that the official builds aren’t failing then?
Let’s dissect one. I took the last successful build at the time of writing and downloaded it. The SD card image builder prepends to the root ext4
fs the FAT32
system used by the Raspberry Pi to boot, so to extract it I need the offset at which it starts.
It’s not too difficult to analyze the partition table or find the ext
magic number in the image, but it’s quicker to check the log on Hydra which conveniently dumps the partition table:
Device Boot Start End Sectors Size Id Type
[snip] 16384 77823 61440 30M b W95
[snip] * 77824 5470991 5393168 2.6G 83 Linu
And this is what it had to tell me:
$ dd if=nixos-sd-image-20.03.1619.ab3adfe1c76-aarch64-linux.img skip=77824 of=nixos-root.img
$ dumpe2fs -h nixos-root.img
Filesystem created: Thu Jan 1 00:00:01 1970
Last mount time: Thu Jan 1 00:00:01 1970
Last write time: Thu Jan 1 00:00:01 1970
[...]
Last checked: Thu Jan 1 00:00:01 1970
Hmm… So resize2fs
is working on the official builds because the timestamps are all aligned, thus last_check < mount_time
is not true.
A persistent clock that isn’t much of a clock
The hardest part of this investigation was figuring out why the timestamps were set to the UNIX epoch and where this extra second was coming from. To get to the answer, first I wanted to figure out what was incrementing the “last mount time” on my system. The answer is not immediately apparent, but is hidden in these lines:
cptofs -t ext4 -i $img nix-path-registration /
echo "copying store paths to image..."
cptofs -t ext4 -i $img $storePaths /nix/store/
echo "copying files to image..."
cptofs -t ext4 -i $img ./files/* /
cptofs
isn’t a very widespread utility, and I didn’t know about its existence. A quick search brings up the source code and more importantly the repo where it comes from, a fork of the Linux kernel.
LKL (Linux Kernel Library) is aiming to allow reusing the Linux kernel code as extensively as possible with minimal effort and reduced maintenance overhead.
Examples of how LKL can be used are: creating userspace applications […] that can read or write Linux filesystems or can use the Linux networking stack […]
The concept sounds pretty cool: to copy its files to the build image without having to mount it, NixOS is basically using an utility which runs the Linux kernel (with its ext4
fs driver) in userspace. This means that cptofs
is:
- bootstrapping a modified version of the Linux kernel along with some modules
- loading the original implementation of
ext4
from the Linux kernel - virtually mounting the image on this instance of the kernel
- copying the source files to the virtual mount path.
The filesystem is effectively getting mounted, thus it’s quite obvious that its mount time is also going to be altered. But why is it set to the epoch, and moreover one second late?
I suspected some kind of issues with the timekeeping done by the userspace kernel, and I didn’t know if there were any changes made to ext4
in this regard, so I went to take a look. The first function of interest is ext4_setup_user
which increases the mount count and calls ext4_update_tstamp
to update s_mtime
:
le16_add_cpu(&es->s_mnt_count, 1);
ext4_update_tstamp(es, s_mtime);
ext4_update_tstamp
is a macro which eventually sets the timestamp to:
time64_t now = ktime_get_real_seconds();
Which, according to an helpful comment:
ktime_get_real_seconds
- Get the seconds portion of CLOCK_REALTIMEReturns the wall clock seconds since 1970. […]
Right, this makes sense. But why is it the epoch and not the actual wall clock time? Turns out that the answer to this question is quite tricky. Originally, LKL never initialized the kernel timer and always started counting time from the epoch, ignoring the host’s wall clock time. Then, a patch was submitted which used the time of the host to update the one of the userspace kernel.
Some time later someone else reverted the original patch in favor of the “proper” way of doing it, which was to use the arch-specific function read_persistent_clock
. As the knowledgeable author stated in response to a maintainer asking if this patch allowed to remove the previously introduced code:
yes [we can remove the code in
setup.c
that sets the clock directly]. […]
read_persistent_clock()
is a weak symboled function so if there is no arch-overrided function, it skips the call.
That sounds very reasonable, and the patch worked.
What wasn’t eventually caught downstream, though, is that in the coming years the kernel started to move towards 64-bit timestamps in an effort to prevent the Y2K38 problem. A step of this transition was migrating the existing read_persistent_clock
to a new read_persistent_clock64
, which used timespec64
instead of timespec
. It was a gradual transition, which eventually led read_persistent_clock
to be phased out. In 2018, when there were no more consumers of this function, it was completely removed in favor of the 64-bit option.
The original read_persistent_clock
added in the patch stayed in the LKL tree never to be used again, and since the new one is weakly symboled too, the kernel quietly started to use its default implementation. Here is it in all of its glory:
/**
* read_persistent_clock64 - Return time from the persistent clock.
*
* Weak dummy function for arches that do not yet support it.
* Reads the time from the battery backed persistent clock.
* Returns a timespec with tv_sec=0 and tv_nsec=0 if unsupported.
*
* XXX - Do be sure to remove it once all arches implement it.
*/
void __weak read_persistent_clock64(struct timespec64 *ts)
{
ts->tv_sec = 0;
ts->tv_nsec = 0;
}
And there is our answer. Since 2018, nobody ever noticed that LKL regressed and started counting time from the Unix epoch again, as the original issue reported. Welcome back January 1st, 1970! I have reported my findings upstream.
One second too late
After all of this, there is still one question outstanding: why the hell is my last mount time one second later than it should be? With what I’ve deduced before, I now know that LKL timekeeping is working, but it’s starting from the UNIX epoch rather than the current time. Could it be that LKL was just slow (also due to emulation) and took two seconds to mount the image?
Let’s open a shell with lkl
and e2fsprogs
and create an ext4
image:
$ nix-shell -p '[e2fsprogs.bin lkl]' --option system aarch64-linux
[nix-shell:~]$ truncate -s 10485760 /tmp/lkl_test.img
[nix-shell:~]$ mkfs.ext4 -q /tmp/lkl_test.img
[nix-shell:~]$ dumpe2fs -h /tmp/lkl_test.img
Filesystem created: Wed Apr 29 22:26:54 2020
Last mount time: n/a
Last write time: Wed Apr 29 22:26:54 2020
[...]
Last checked: Wed Apr 29 22:26:54 2020
Nothing wrong so far. Looking at the help of cptofs
, I found a very interesting flag:
-p, --enable-printk
show Linux printks
This flag allows to see the logging messages of the kernel, which (exactly as dmesg
outputs) are prefixed with the relative start time of the kernel. Time for the moment of truth.
[nix-shell:~]$ touch /tmp/empty_file
[nix-shell:~]$ cptofs -p -i /tmp/lkl_test.img -t ext4 /tmp/empty_file /
[ 0.000000] Linux version 5.3.0 (nixbld@) (gcc version 9.3.0 (GCC)) #1 Sat Apr 4 07:54:38 UTC 2020
[...]
[ 0.000000] clocksource: lkl: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000248] lkl: time and timers initialized (irq2)
[...]
[ 0.483229] raid6: int64x8 gen() 1891 MB/s
[ 0.700411] raid6: int64x8 xor() 1183 MB/s
[ 0.923161] raid6: int64x4 gen() 1739 MB/s
[ 1.139449] raid6: int64x4 xor() 1003 MB/s
[ 1.358977] raid6: int64x2 gen() 1292 MB/s
[ 1.572580] raid6: int64x2 xor() 672 MB/s
[ 1.787268] raid6: int64x1 gen() 790 MB/s
[ 2.001716] raid6: int64x1 xor() 418 MB/s
[ 2.001838] raid6: using algorithm int64x8 gen() 1891 MB/s
[ 2.001909] raid6: .... xor() 1183 MB/s, rmw enabled
[ 2.002166] raid6: using intx1 recovery algorithm
[ 2.032205] clocksource: Switched to clocksource lkl
[...]
[ 2.563646] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts:
[ 2.682402] reboot: Restarting system
[nix-shell:~]$
Yes! Due to btrfs
being bundled with the userspace kernel, the raid6_pq
and xor
modules are also being loaded which perform some relatively slow benchmarking on startup to determine the best algorithms to use. The delay is high enough when running with emulation that it causes the mount time of the filesystem to be set two seconds away from the epoch, which is the root cause of the original issue. As far as I know, NixOS is using a native AArch64 build box to create the images, which means that is probably fast enough to always get to the mount stage before the two seconds mark.
[nix-shell:~]$ dumpe2fs -h /tmp/lkl_test.img
[...]
Filesystem created: Wed Apr 29 22:26:54 2020
Last mount time: Thu Jan 1 00:00:02 1970
Last write time: Thu Jan 1 00:00:02 1970
Conclusion
With the root cause known, it’s still evident that the mechanism used in the NixOS builder is fragile, as it relies on a fortunate coincidence which doesn’t always apply. Moreover, when the issue on LKL is going to be fixed and the package updated, it won’t work anymore, as LKL will (correctly) update the mount time to the current timestamp, making it greater than the last check time.
Thus, I have submitted a patch upstream to solve this issue once and for all, hoping that nobody else will have to delve into the roots of a forked Linux kernel to understand an issue. It was also brought to my attention that another contributor got rid of cptofs
entirely, which would also solve the issue.
Thanks for reading!