I had a very interesting conversation with my friend T.H yesterday, and it all turned into an interesting disk usage puzzle for a Saturday night.
The Question
So he says, here’s a rebus to solve. Take a look at these:
$ ls -l /mnt/vbox/centos.img -rw------- 1 alc alc 8589935104 Mar 15 20:32 /mnt/vbox/centos.img
$ du -sk /mnt/vbox/centos.img 1002388 /mnt/vbox/centos.img
$ df -hT | egrep "File|vbox" Filesystem Type Size Used Avail Use% Mounted on /dev/sda5 ext4 20G 5.8G 13G 31% /mnt/vbox
# dumpe2fs /dev/sda5 | grep "Block size" dumpe2fs 1.42.5 (29-Jul-2012) Block size: 4096
And tell me, what’s the size of the centos.img file?
The Quest
So what do we know?
According to the ls listing, the file size is ~8GB.
The du command’s output shows the file space usage of 1002388 KB, what would be roughly 1GB.
The df reported disk usage says there’re 5.8GB of space in use and 13GB available out of total 20GB. Filesystem is ext4 with a block size of 4096B, or 4KB.
Exploring Info Pages
Assuming the centos.img file size is 8GB, how is that possible that df shows only 5.8GB in use and du reports the file’s size of 1GB?
We need some more info. Let’s check the output of ls -ls:
$ ls -ls /mnt/vbox/centos.img 1002388 -rw------- 1 alc alc 8589935104 Mar 15 20:32 /mnt/vbox/centos.img
According to ls info page:
-s, --size : print the disk allocation of each file to the left of the filename. This is the amount of disk space used by the file, which is usually a bit more than the file's size, but it can be less if the file has holes. Normally the disk allocation is printed in units of 1024 bytes, but this can be overridden.
As we may see above, -s parameters reports the amount of disk space that is in use by a file, in kilobytes (1024B = 1KB). For our centos.img file, it’s 1002388KB, or ~1GB. It is the same amount of space that was reported by du -sk command earlier.
The ls -l long listing format reports the allocated file size (the difference between the end-of-file and the beginning-of-file), while ls -s shows the real amount of disk space in use in blocks. In our particular case it means that centos.img file most likely contains holes: it has 8GB of file size allocated, but only 1GB is actually in use (written data) on a disk.
The du -sk command shows blocks in use on a disk, the output is the same as for ls -s.
The dd Case
Now, let’s create a copy of the centos.img file with dd command:
$ dd < /mnt/vbox/centos.img > /mnt/vbox/centos.img_dd 16777217+0 records in 16777217+0 records out 8589935104 bytes (8.6 GB) copied, 87.2762 s, 98.4 MB/s
And check the size of the new centos.img_dd file:
$ ls -ls /mnt/vbox/centos.img* 1002388 -rw------- 1 alc alc 8589935104 Mar 15 20:32 /mnt/vbox/centos.img 8388616 -rw-r--r-- 1 alc alc 8589935104 Mar 15 20:39 /mnt/vbox/centos.img_dd
$ du -sk /mnt/vbox/centos.img* 1002388 /mnt/vbox/centos.img 8388616 /mnt/vbox/centos.img_dd
What do we see here? The real size of the centos.img_dd file, or, the disk usage in other words, is 8388616KB (~8GB). This is because the dd utility does a low-level copying bypassing the filesystem layer. This means that a bit-for-bit copy makes any holes a file may contain overwritten with zeros. The file content itself doesn’t actually change, but holes are no longer empty disk space – it’s zeroes.
The cp Case
The most interesting case we found was to copy the image file with the cp command:
$ cp -p /mnt/vbox/centos.img /mnt/vbox/centos.img_cp
If we check the size of the new centos.img_cp file, we see the following:
$ ls -ls /mnt/vbox/centos.img* 1002388 -rw------- 1 alc alc 8589935104 Mar 15 20:32 /mnt/vbox/centos.img 813036 -rw------- 1 alc alc 8589935104 Mar 15 20:51 /mnt/vbox/centos.img_cp 8388616 -rw-r--r-- 1 alc alc 8589935104 Mar 15 20:39 /mnt/vbox/centos.img_dd
$ du -sk /mnt/vbox/centos.img* 1002388 /mnt/vbox/centos.img 813036 /mnt/vbox/centos.img_cp 8388616 /mnt/vbox/centos.img_dd
So, the disk usage of the centos.img_cp file, 813036KB, is smaller than the original centos.img file’s, 1002388KB. How did that happen? Let’s get back to info pages. The info page for cp says:
--sparse=WHEN : a "sparse file" contains "holes"--a sequence of zero bytes that does not occupy any physical disk blocks. By default, "cp" detects holes in input source files via a crude heuristic and makes the corresponding output file sparse as well. Only regular files may be sparse. The WHEN value can be one of the following: auto : the default behavior: if the input file is sparse, attempt to make the output file sparse, too. However, if an output file exists but refers to a non-regular file, then do not attempt to make it sparse. always : for each sufficiently long sequence of zero bytes in the input file, attempt to create a corresponding hole in the output file, even if the input file does not appear to be sparse. never : never make the output file sparse.
This puts some light on the case – the cp commands uses crude heuristics to detect holes in an input file by default, and produces the sparse output file if possible. Therefore our centos.img file has to be a sparse file.
Disk Usage and File Size: Some More Stuff
To get a better idea of disk usage and allocated file size, let’s create a one byte’s file with dd:
$ dd if=/dev/zero of=filex bs=1 count=1
1+0 records in
1+0 records out
1 byte (1 B) copied, 5.4057e-05 s, 18.5 kB/s
We know the file size is 1B, we can check that with the ls command:
$ ls -ls filex 4 -rw-r--r-- 1 alc alc 1 Mar 15 22:07 filex
The file size is 1B, but the disk usage is 4KB. More and more curious, isn’t it? Not really. This is because our filesystem is configured to use a block size of 4096B, or 4KB. Even if we create a file of 1B, it will still consume 4K of space on a disk. This means that we have wasted 4095B of disk space by creating a 1B file.
Let’s create a 4K file now:
$ dd if=/dev/zero of=filex bs=1 count=4096 4096+0 records in 4096+0 records out 4096 bytes (4.1 kB) copied, 0.0121268 s, 338 kB/s
Here’s the ls output:
$ ls -ls filex 4 -rw-r--r-- 1 alc alc 4096 Mar 15 22:08 filex
The file size is 4K now, the disk usage is the same as before, 4K. How about creating a 4097B file? This one should now use 8KB of the disk space.
$ dd if=/dev/zero of=filex bs=1 count=4097 4097+0 records in 4097+0 records out 4097 bytes (4.1 kB) copied, 0.0155807 s, 263 kB/s
List sizes again:
$ ls -ls filex 8 -rw-r--r-- 1 alc alc 4097 Mar 15 22:09 filex
Wuala, all exactly as we expected. The disk usage increased to 8KB, while the file’s size is slightly more than 4KB.
This example, and the one before, showed the disk size equal or bigger than the file size. How about if we do the opposite thing? How about if we create a file which is bigger than its disk usage?
Let’s extend the size of the existing “filex” file to 16KB:
$ truncate -s 16K filex
If the file specified is shorter, it is extended and the extended part (hole) reads as zero bytes. This should not affect the disk space usage, but would only increase the allocated file’s size instead.
$ ls -ls filex 8 -rw-r--r-- 1 alc alc 16384 Mar 15 22:11 filex
As we may notice above, the disk usage is still the same, 8KB. The file size, however, has increased to 16KB. We learn something new every day.
Thanks for the write up! I recently had a bizarre situation where a NetApp system seemed to think there was more free space available than Windows. It turned out to be a SQL Instant File Initialisation feature, which reclaimed used disk space without filling that space with zeros. I had some empty database files which didn’t occupy much space on the storage controller until.
Great post, thank you
Thanks, this helped me understand things a lot better! FYI, “Wuala” should be “Voilà”.
You’re welcome!
It was a reference to an encrypted cloud storage quite popular at the time (it does not exist anymore). Not everyone will get it.
Oops, guess I didn’t do my research there. Clever!
I don’t blame you. It’s for paranoid people who don’t trust Dropbox, Google drive and the likes.