From: Kefu Chai <k.chai@proxmox.com>
Date: Wed, 13 May 2026 05:02:16 +0000 (+0800)
Subject: crimson/osd: disable ofstream buffering to fix concurrent logging
X-Git-Tag: v21.0.1~159^2~1
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=1d09118286aff3737807dc7ef40fd55dbc863608;p=ceph.git

crimson/osd: disable ofstream buffering to fix concurrent logging

seastar::logger::do_log() writes to the shared static _out pointer from
every shard's reactor thread with no lock. std::cerr is safe in this
setting because it is unbuffered: each write maps to a single write(2)
syscall, which POSIX serializes at the kernel level. A buffered
std::ofstream is not safe: multiple shards concurrently advance the
filebuf's put pointer (pptr) past the end-of-buffer marker (epptr),
causing _M_convert_to_external to compute a length longer than the
8192-byte internal buffer and write past it.

The C++ standard does not provide thread-safety guarantees for
std::ofstream. [res.on.data.races] (C++23 §16.4.6.10) specifies that
concurrent non-const access to a standard library object from multiple
threads is a data race with undefined behavior. std::basic_filebuf,
which std::ofstream owns internally, maintains mutable state (pptr,
epptr, _M_buf, and the codec state) that is updated on every write with
no synchronization. std::cerr is an explicit exception: [iostream.objects]
guarantees that concurrent writes to the standard stream objects are
safe, and cerr achieves this by being unbuffered (no pptr/epptr to race
on) and writing through a single atomic write(2) per flush.

This manifested as a heap-buffer-overflow in basic_filebuf::overflow()
via seastar::logger::do_log() during a multi-shard run, reported against
the build that included the dangling-pointer fix (6680e02d041). The
dangling-pointer bug had masked this latent race by crashing before
multiple shards came online.

In this change, we fix this by calling pubsetbuf(nullptr, 0) immediately
after opening the file. This suppresses _M_allocate_internal_buffer and
makes each operator<< call fall through to a single write(2) syscall,
matching std::cerr's thread-safety guarantee.

This fix is necessary for all production deployments, not just
pathological configurations: log_file has a daemon_default of
/var/log/ceph/$cluster-$name.log in global.yaml.in ($cluster is always
"ceph" in modern Ceph, as customized cluster names have been deprecated).
Every crimson-osd process therefore opens a log_file_stream by default,
and every multi-shard run is exposed to this race.

An alternative would be to follow ScyllaDB's approach: its service file
has no StandardOutput or StandardError directives, so systemd connects
the process to the journal, and the logger keeps _out pointing at
std::cerr. This sidesteps the buffered-ofstream problem entirely. For
Crimson to adopt that model it would need to respect log_to_file and
log_to_stderr (which it currently ignores, checking only log_file), and
a dedicated ceph-crimson-osd@.service unit would be needed so that a
StandardError=append:/var/log/ceph/ceph-osd.%i.log directive could be
added without affecting the classic OSD. That is a larger refactor;
pubsetbuf(nullptr, 0) is the minimal correct fix for now.

Fixes: https://tracker.ceph.com/issues/76524
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
---

diff --git a/src/crimson/osd/main.cc b/src/crimson/osd/main.cc
index c549277888d..6f287c9f1f8 100644
--- a/src/crimson/osd/main.cc
+++ b/src/crimson/osd/main.cc
@@ -165,6 +165,11 @@ int main(int argc, const char* argv[])
             } catch (const std::system_error& e) {
               ceph_abort_msg(fmt::format("unable to open log file: {}", e.what()));
             }
+            // seastar::logger::do_log() writes to _out from every shard's thread
+            // with no lock. std::cerr is safe because it is unbuffered; a buffered
+            // ofstream is not. Disable buffering so each write() is a single syscall,
+            // matching cerr's thread-safety guarantee.
+            log_file_stream.rdbuf()->pubsetbuf(nullptr, 0);
             logger().set_ostream(log_file_stream);
           }
           auto reset_logger = seastar::defer([] {