mds: drop partial entry and adjust write_pos when opening PurgeQueue
At tail journal, there can be partial written entry. Before appending
new entries to the journal, we need to drop any partial written entry
and adjust write_pos. For mds log, partial written entry is detected
and dropped when replaying the journal.
For PurgeQueue journal, we don't replay the whole journal when MDS
starts. Before appending new entry to the journal, we need to drop
any partial written entry and adjust write_pos.
Previous patch makes the journal header write_pos align to boundary
of fully flushed entry. We can start finding partial written entry
from the journal header write_pos. It should be fast even when the
purge queue is very large.
David Zafman [Thu, 13 Apr 2017 18:41:18 +0000 (11:41 -0700)]
mon: Use currently configure full ratio to determine available space
This is a bug that would not adjust available space based on the
currently configured full ratio, but rather the mon_osd_full_ratio
default initial value.
David Zafman [Wed, 12 Apr 2017 05:04:07 +0000 (22:04 -0700)]
osd: check_full_status() remove bogus comment and use equivalent computation
We actually compute kb_used as the kb - kb_avail. We don't have the
statfs() system call issue of non-privileged f_bavail vs f_bfree. It
was assumed that used was really like (blocks - f_bfree). It is not.
Matt Benjamin [Fri, 14 Apr 2017 19:56:37 +0000 (15:56 -0400)]
rgw_file: fix readdir after dirent-change
Also, fixes link count computation off-by-one, update of state.nlink
after computation, link computation reset at start, and a time print
in debug log.
Fixes: http://tracker.ceph.com/issues/19634 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
link count
Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
ceph-disk: enable directory backed OSD at boot time
https://github.com/ceph/ceph/commit/539385b143feee3905dceaf7a8faaced42f2d3c6
introduced a regression preventing directory backed OSD from starting at
boot time.
For device backed OSD the boot sequence starts with ceph-disk@.service
and proceeds to
systemctl enable --runtime ceph-osd@.service
where the --runtime ensure ceph-osd@12 is removed when the machine
reboots so that it does not compete with the ceph-disk@/dev/sdb1 unit at
boot time.
However directory backed OSD solely rely on the ceph-osd@.service unit
to start at boot time and will therefore fail to boot.
The --runtime flag is selectively set for device backed OSD only.
mon/OSDMonitor: transit creating_pgs from pgmap when upgrading
there could be some pg(s) still being created when we are upgrading to
luminous, and the pools holding them are not changed in the sense of
pg_pool_t::last_change after the upgrade and before we scan for
creating pgs. in that case, the existing update_pending_creatings()
will fail to collect the pgs being created before the upgrade.
with this change, the creating_pgs in pgmap are also used for updating
the OSDMonitor's creating_pgs if it's updated.
but we should stopupdating the pgmap once the upgrade completes. i.e.
stop dispatching MSG_PGSTATS messages to PGMonitor if the quorum and all
osds are luminous.
John Spray [Wed, 8 Mar 2017 12:13:46 +0000 (12:13 +0000)]
mds: shut down finisher before objecter
Some of the finisher contexts would try to call into Objecter.
We mostly are protected from this by mds_lock+the stopping
flag, but at the Filer level there's no mds_lock, so in the
case of file size probing we have a problem.
Fixes: http://tracker.ceph.com/issues/19204 Signed-off-by: John Spray <john.spray@redhat.com>
John Spray [Tue, 28 Mar 2017 18:13:33 +0000 (14:13 -0400)]
mds: ignore ENOENT on writing backtrace
We get ENOENT when a pool doesn't exist. This can
happen because we don't prevent people deleting
former cephfs data pools whose files may not have
had their metadata flushed yet.
http://tracker.ceph.com/issues/19401 Signed-off-by: John Spray <john.spray@redhat.com>
Matt Benjamin [Tue, 11 Apr 2017 10:42:07 +0000 (06:42 -0400)]
rgw_file: don't expire directories being read
If a readdir expire event turns out to be older than last_readdir,
just reschedule it (but actually, we should just discard it, as
another expire event must be in queue.
Fixes: http://tracker.ceph.com/issues/19625 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
John Spray [Fri, 7 Apr 2017 13:24:01 +0000 (09:24 -0400)]
mon: emit cluster log messages on MDS health changes
Previously, when we got a beacon that updated the health
metrics for an MDS, the user would just see mysterious-looking
cluster log messages indicating a rising fsmap epoch number.
It would be good to do this for health messages in general at
some point, but for now just do it for the MDS ones.
Fixes: http://tracker.ceph.com/issues/19551 Signed-off-by: John Spray <john.spray@redhat.com>
Matt Benjamin [Tue, 11 Apr 2017 09:56:13 +0000 (05:56 -0400)]
rgw_file: chunked readdir
Adjust readdir callback path for new nfs-ganesha chunked readdir,
including changes to respect the result of callback to not
continue.
Pending introduction of offset name hint, our caller will just be
completely enumerating, so it is possible to remove the offset map
and just keep a last offset.
Fixes: http://tracker.ceph.com/issues/19624 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Sage Weil [Wed, 12 Apr 2017 02:35:32 +0000 (22:35 -0400)]
mon/OSDMonitor: fix initial map when require_luminous_osds not set on mkfs
If we don't set the luminous flag, we should not set the new luninous
fields or else we'll get a crc mismatch. (Funnily that happens in the
epoch where the flag is eventually set and the encoded map finally includes
the field we have set in memory.)
Yang Honggang [Thu, 13 Apr 2017 12:09:07 +0000 (20:09 +0800)]
cephfs: fix write_buf's _len overflow problem
After I have set about 400 64KB xattr kv pair to a file,
mds is crashed. Every time I try to start mds, it will crash again.
The root reason is write_buf._len overflowed when doing
Journaler::append_entry().
This patch try to fix this problem through the following changes:
John Spray [Wed, 29 Mar 2017 18:38:37 +0000 (19:38 +0100)]
tools/cephfs: set dir_layout when injecting inodes
When we left this as zero, the MDS would interpret it was HASH_LINUX
rather than the default HASH_RJENKINS. Potentially that
could cause problems if there perhaps were already dirfrags in
the metadata pool that were set up using rjenkins. Mainly
it just seems more appropriate to explicitly set this field
rather than hit the fallback behaviour.
Related: http://tracker.ceph.com/issues/19406 Signed-off-by: John Spray <john.spray@redhat.com>