Sage Weil [Tue, 23 Apr 2013 21:06:41 +0000 (14:06 -0700)]
mon: drop forwarded requests after an election
On each election, we resend routed requests to the new leader (or
requeue for ourselves). Therefore, if we receive a forwarded request,
we should drop it on the floor if there is a new election. Add a field
in the PaxosServiceMessage struct to track which election epoch we
received the request in, and drop it in PaxosService::dispatch() if
that is in the past.
Sage Weil [Tue, 23 Apr 2013 20:45:59 +0000 (13:45 -0700)]
mon: requeue routed_requests for self if elected leader
If we have requests that we have forwarded, and are elected leader,
requeue those requests for ourself and queue them normally and clear out
the routed_requests map.
Gary Lowell [Thu, 11 Apr 2013 16:42:13 +0000 (09:42 -0700)]
ceph-disk: OSD hotplug fixes for Centos
Two fixes for Centos 6.3 and other systems with udev versions
prior to 172. The disk peristant name using the GPT UUID does
not exist, so use the by_path persistent name instead for the
journal symlink.
The gpt label fields are not available for use in udev rules. Add
ceph-disk-udev wrapper script that extracts the partition
type guid from the label and calls ceph-disk-activate if it is
a ceph guid type. (Bug #4632)
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Mon, 22 Apr 2013 22:01:09 +0000 (15:01 -0700)]
mon: commit LogSummary on every message
This moves our version pointer up so that we don't re-log (by re-consuming)
log messages to /var/log/ceph/ceph.log on ceph-mon restart. OTOH, it means
we rewrite the summary of the last 50 messages, but we consider that to be
relatively cheap (and something we *always* did prior for bobtail and
earlier anyway).
ceph-mon: Attempt to obtain monmap from several possible sources
In order of interest/priority:
- our latest monmap version
- a backup monmap version created during sync start, if the store
appears to be in a post-aborted sync state
- a mkfs monmap version
If none of these are found, we should go ahead and try to build a
monmap from ceph.conf to join an existing cluster.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: Monitor: backup monmap prior to starting a store sync
If by fate we end up attempting a store sync after failing at
least one before, we might not have a monmap to read from the
store to backup. Therefore, in that case, we shall backup the
current monmap being used by the monitor.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
David Zafman [Wed, 20 Mar 2013 06:12:35 +0000 (23:12 -0700)]
tools/ceph-filestore-dump: Error messages lost because stderr is closed
Use cout instead of cerr for command errors
Use cerr for debug mode because stderr is avail
Output map_epoch in debug mode
Fix a message and only for debug mode
Signed-off-by: David Zafman <david.zafman@inktank.com>
With OSD sharing data and journal, the previous code created the
journal partiton from the end of the device. A uint32_t is
used in sgdisk to get the last sector, with large HD, uint32_t
is too small.
The journal partition will be created backwards from the
a sector in the midlle of the disk leaving space before
and after it. The data partition will use whichever of
these spaces is greater. The remaining will not be used.
This patch creates the journal partition from the start as a workaround.
Sage Weil [Thu, 18 Apr 2013 03:11:33 +0000 (20:11 -0700)]
global: call observers (and start logging) in global_init
Call observers so that the logging infrastructure gets initailized and we
start logging. Otherwise, unless a default log setting has been modified,
we won't start logging until we daemonize, and we won't get the nice
version banner in the log file.
Unlike the previous attempt to fix this (a3091774), we do this after all
of the lockdep initialization has completed.
Samuel Just [Fri, 19 Apr 2013 00:54:39 +0000 (17:54 -0700)]
osd/: optionally track every pg ref
This involves three pieces:
For intrusive_ptr type references, we use TrackedIntPtr instead. This
uses get_with_id and put_with_id to associate an id and backtrace with
each particular ref instance.
For refs taken via direct calls to get() and put(), get and put now
require a tag string. The PG tracks individual ref counts for each tag
as well as the total.
Finally, PGs register/unregister themselves on construction/destruction
with OSDService.
As a result, on shutdown, we can check for live pgs and determine where
the references are held.
This behavior is compiled out by default, but can be included with the
--enable-pgrefdebugging flag.
MDS crashes while journaling dirty root inode in handle_client_setxattr
and handle_client_removexattr. We should use journal_dirty_inode to
safely log root inode here.
Signed-off-by: Kuan Kai Chiu <big.chiu@bigtera.com> Reviewed-by: Greg Farnum <greg@inktank.com>
mon: PaxosService: fix trim criteria so to avoid constantly trimming
Say a service establishes it will only keep 500 versions once a given
condition X is true. Now say that said condition X only becomes true
after said service committing some 800 versions.
Once we decide to trim, this service would trim all 300 surplus versions
in one go. After that, each committed version would also trim the
previous version.
Trimming an unbounded number of versions is not a good practice
as it will generate bigger transactions (thus a greater workload on
leveldb) and therefore bigger messages too.
Constantly trimming versions implies more frequent accesses to leveldb,
and keeping around a couple more versions won't hurt us in any significant
way, so let us put off trimming unless we go over a predefined minimum.
This patch adds two new options:
paxos service trim min - minimum amount of versions to trigger a trim
(default: 30, 0 disables it)
paxos service trim max - maximum amount of versions to trim during a
single proposal
(default: 50, 0 disables it)
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>