Add a O_NOCMTIME flag which prevents inode time updates on writes and
can greatly reduce the IO overhead of writes to allocated and
initialized regions of files.
ceph servers can have loads where they perform O_DIRECT overwrites of
allocated file data and then sync to make sure that the O_DIRECT writes
are flushed from write caches. If the writes dirty the inode with mtime
updates then the syncs also write out the metadata needed to track the
inodes which can add significant iop and latency overhead.
The ceph servers don't use mtime at all. They're using the local file
system as a backing store and any backups would be driven by their upper
level ceph metadata.
In simple tests a O_DIRECT|O_NOCMTIME overwriting write followed by a
sync went from 2 serial write round trips to 1 in XFS and from 4 serial
IO round trips to 1 in ext4.
file_update_time() is changed to call a file_is_nocmtime() helper which
tests the file flag in addition to testing the inode's S_NOCMTIME flag.
It doesn't check FMODE_NOCMTIME because that's only used by XFS to
trigger private flags which trigger private flags which do other things.
O_NOCMTIME can only be used if the mount has its MNT_NOCMTIME flag set.
This requires priviledged intervention to testify that mtime isn't
critical to, say, backup infrastructure or NFS server consistency
guarantees.