Greg Farnum [Fri, 1 Nov 2013 22:45:02 +0000 (15:45 -0700)]
OSDMonitor: be a little nicer about letting users do pg splitting
We were previously blocking pg splits whenever pg creations were in-
progress, but we only really need to avoid splitting any pgs which are
currently being created. Let the user set a different pg_num if there
are no creating PGs on the pool in question.
Fixes: #6673, take two Signed-off-by: Greg Farnum <greg@inktank.com>
Samuel Just [Thu, 31 Oct 2013 20:19:32 +0000 (13:19 -0700)]
sharedptr_registry.hpp: removed ptrs need to not blast contents
See the included unit test update. Consider:
1) x = lookup_or_create(1, 1)
2) remove(1)
3) y = lookup_or_create(1, 2)
4) x.reset()
5) z = lookup(1)
The bug is that z will be null since x.reset() caused the
cleanup callback to remove y's key value from contents.
To fix this, contents also records the pointer value for
the weak_ptr. The removal callback only removes the
key from contents if it matches the ptr in contents.
This should work since the pointer passed to the removal
callback must be unique up to that point since it has
not yet been deleted.
This allowed a pg removal -> pg recreation -> pg removal
sequence to cause the second pg removal entry to be
erroneously cleared by the first pg removal's destructor
as it finally made its way through the removal queue.
Fixes: #5951 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Noah Watkins [Wed, 30 Oct 2013 23:34:29 +0000 (16:34 -0700)]
prio-q: initialize cur iterator
For new SubQueues `cur` is not intialized, so front/pop_front will freak
out. I honestly I have no idea how this hasn't been seen, but it was
being triggered frequently on OSX.
Fixes: #6686 Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 30 Oct 2013 23:54:39 +0000 (16:54 -0700)]
PGLog: remove obsolete assert in merge_log
This assert assumes that if olog.head != log.head, olog contains
a log entry at log.head, which may not be true since pg splitting
might have left the log with arbitrary holes.
Sage Weil [Tue, 29 Oct 2013 18:08:58 +0000 (11:08 -0700)]
upstart, sysvinit: use ceph-crush-location hook
Instead of hard-coding a check in ceph.conf and some reasonable
defaults, defer this work to ceph-crush-location, and allow users to
specify their own hook with alternative logic.
This can be helpful in a nubmer of cases, like:
- rack (or other) information included in hostname and easily parsed
out by a hook
- multiple types of devices in each host, resulting in 'parallel'
crush trees (e.g., one for hdd, one for ssd)
Sage Weil [Tue, 29 Oct 2013 17:10:21 +0000 (10:10 -0700)]
mon/PGMonitor: always send pg creations after mapping
At some point in the dumpling cycle I separated the map stage from the
send stage. We can send the creates any time we have a non-zero osdmap
epoch, and are in good shape as long as we do the map step after the
osdmap is loaded (hence the post_paxos_update).
Some background:
We originally introduced the map-but-don't send in a2fe0137, at which
point all was well because we only called it on ceph-mon startup.
Later, this turned into post_paxos_update in e635c478, at which point
it was now called by a running monitor.. but we didn't add in the
send_pg_creates(). This is where this bug stems from.
This particular path is responsible for the stalled test referenced in
bug #6673.
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com>
We can't adjust last_backfill to object x until x has been fully
backfilled. pending_backfill_updates contains all those backfills
started, but which have not yet been reflected in pinfo.last_update.
backfills_in_flight contains those backfills which have not yet
completed. Thus, we can adjust last_update to the largest entry
in pending_backfill_updates not in backfills_in_flight.
Samuel Just [Mon, 28 Oct 2013 22:53:24 +0000 (15:53 -0700)]
ReplicatedPG: replace backfill_pos with last_backfill_started
last_backfill_started reflects what pinfo.last_backfill will be
once all currently outstanding backfills complete. backfill_pos
was tricky since we couldn't correctly inialize it without
doing the first backfill scan pair.
In recover_backfill, we rescan from last_backfill_started rather
than from backfill_pos. This ensures that we capture all clones
created between last_backfill_started and what previously had been
backfill_pos without special handling in make_writeable. The main
downside is that we will tend to "rescan" last_backfill_started.
Sage Weil [Mon, 28 Oct 2013 22:56:15 +0000 (15:56 -0700)]
init-ceph: make crush update on osd start time out
If the monitor is not currently available, this crush update would block
forever, preventing the OSD and (potentially) the rest of the system
from starting up. Instead, make it time out after 10 seconds and then
abort startup. This prevents startup of an OSD if we failed to update
the CRUSH position for some reason.
In fact, do not start up the OSD if the CRUSH update fails for any
reason--not just a timeout!
Works-around: #5612 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Mon, 28 Oct 2013 18:02:34 +0000 (11:02 -0700)]
ReplicatedBackend: don't hold ObjectContexts in pull completion callback
We need flushing the sequencer to ensure that all Contexts which hold
ObjectContextRefs have been run or deleted.
C_ReplicatedBackend_OnPullComplete, however, gets queued in a second
work queue in order to avoid performing expensive push related reads
in the FileStore finisher.
Rather than keep the objects contexts around, we instead put off
removing the object from the pulling map until the call back
fires and read the object context out of the pulling map. This
way the ObjectContextRef will be cleaned up along with the rest
of the pulling map in on_change.
Samuel Just [Sat, 26 Oct 2013 00:58:10 +0000 (17:58 -0700)]
ReplicatedPG: have make_writeable adjust backfill_pos
If we are writing to backfill_pos and create a clone, we end
up failing to send the transaction creating the clone to the
backfill peer. This is fine as long as we end up backfilling
the clone. To that end, we simply add the clone to
backfill_info and adjust backfill_pos accordingly. This is less
brittle than the waiting_for_backfill_pos mechanism since it
works even if we wait between that check and issuing the repop,
which can happen for copy_from.
Samuel Just [Sat, 26 Oct 2013 23:52:16 +0000 (16:52 -0700)]
ReplicatedPG,osd_types: move rw tracking from its own map to ObjectContext
We also modify recovering to hold a reference to the recovering obc
in order to ensure that our backfill_read_lock doesn't outlive the
obc.
ReplicatedPG::op_applied no longer clears repop->obc since we need
it to live until the op is finally cleaned up. This is fine since
repop->obc is now an ObjectContextRef and can clean itself up.
Greg Farnum [Wed, 23 Oct 2013 18:28:45 +0000 (11:28 -0700)]
ReplicatedPG: take and drop read locks when doing backfill
All our interfaces are in place, so now we can actually take and
drop the locks.
1) Take locks in ReplicatedPG::recover_backfill. This is the entry
into the backfill code path, and covers all objects which are
added to backfills_in_flight (via prep_backfill_object_push()). If we
can't get the lock right away, we stop the backfill movement there
until we can do so.
2) Drop the locks in ReplicatedPG::on_peer_recover(), called when the
push is completed.
2b) Further drop the locks on all backfills_in_flight objects in
_clear_recovery_state(), for when we cancel peering.
Greg Farnum [Tue, 22 Oct 2013 00:36:04 +0000 (17:36 -0700)]
PG: switch the start_recovery_ops interface to specify work to do as a param
We previously inferred whether there was useful work to be done
by looking at the number of ops started, but with the upcoming
introduction of the rw_manager read locking on backfill, we could
start no ops while still having work to do. Switch around the
interfaces to specify these as separate pieces of information.
Greg Farnum [Mon, 21 Oct 2013 21:27:32 +0000 (14:27 -0700)]
ReplicatedPG: implement the RWTracker mechanisms for backfill read locking
We want backfill to take read locks on the objects it's pushing. Add
a get_backfill_read(hobject_t) function, a corresponding drop_backfill_read(),
and a backfill_waiting_on_read member in ObjState. Check that member when
getting a write lock, and in put_write(). Tell callers to requeue the recovery
if necessary, and clean up the backfill block when its read lock is dropped.
Sage Weil [Sat, 26 Oct 2013 00:45:06 +0000 (17:45 -0700)]
mon/OSDMonitor: make racing dup pool rename behave
If we get dup pool rename requests that are racing, make sure the second
one comes back with 'success' if the rename entry already exists in the
pending_inc map.
mon: OSDMonitor: Make 'osd pool rename' idempotent
'ceph osd pool rename' takes two arguments: source pool and dest pool.
If by chance 'source pool' does not exist and 'destination pool' does,
then, in order to assure it's idempotent, we want to assume that if
'source pool' no longer exists is because it was already renamed.
However, while we will return success in such case, we want to make sure
to let the user know that we made such assumption. Mostly to warn the
user of such a thing in case of a mistake on the user's part (say, the
user didn't notice that the source pool didn't exist, while the dest did),
but also to make sure that the user is not surprised by the command
returning success if the user expected an ENOENT or EEXIST.
Fixes: #6635 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Gregory Farnum [Fri, 25 Oct 2013 20:57:21 +0000 (13:57 -0700)]
Merge pull request #769 from ceph/wip-copy-get
With this branch we make copy-get significantly easier to extend by applying our standard encode/decode stuff to it, instead of doing an inline encode-onto-the-payload. We also add some infrastructure for dealing with completion of RepGathers.
Greg Farnum [Fri, 25 Oct 2013 20:41:29 +0000 (13:41 -0700)]
osd: add category to object_copy_data_t
We don't bump the encoding version -- and stick it in the middle --
since it's still brand-new. For simplicity, we encode it unconditionally
rather than trying to embed it alongside the attrs or with its own
"complete" flag in the cursor.
Greg Farnum [Wed, 9 Oct 2013 17:39:19 +0000 (10:39 -0700)]
OSD: add back CEPH_OSD_OP_COPY_GET, and use it in the Objecter
This one is encoded with version information. We are not doing anything
to control which op gets sent by the client, but after discussion with
Sam we think this op isn't accessible enough to clients (right now it's
only triggered by a client sending copy-from, which can only happen via
ceph-test-rados) to require compatibility versioning.
Greg Farnum [Tue, 8 Oct 2013 21:57:31 +0000 (14:57 -0700)]
osd: Add a new object_copy_data_t, and use it in the OSD/Objecter
Right now this is very primitive, but we're about to extend it to
deal with request versioning appropriately, and adding in some
extra fields.
Sadly we are doing a little extra copying in the Objecter as a result, but
too bad -- being able to do updates will be worth it.
This version is a user version, and since we're in the OSD we
should call it such. (In particular, we may want to keep track
of the internal version too when doing cache promotes.)
Greg Farnum [Fri, 4 Oct 2013 20:30:46 +0000 (13:30 -0700)]
ReplicatedPG: copy: do not let start_copy() return error codes
There's no failure it can actually run into, and handling error
codes in some of its callers is going to be a pain.
While we're here, document the parameters.
Samuel Just [Fri, 25 Oct 2013 17:47:28 +0000 (10:47 -0700)]
PGLog::read_log: don't add items past backfill line to missing
Fixes: #6574 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Adam Twardowski [Thu, 24 Oct 2013 16:24:11 +0000 (12:24 -0400)]
Update init-rbdmap
Add a chkconfig line for RHEL based distros to make chkconfig start rbdmap earlier on boot and stop later on shutdown. This will help prevent shutdown/reboot from hanging your system forever in the event that some daemon has a file held open on an rbd mounted filesystem.
Signed-off-by: Adam Twardowski <adam.twardowski@gmail.com>(cherry picked from commit 80384a1a24e681fff11c8715804b7f8cc4a2189a)
Josh Durgin [Thu, 24 Oct 2013 15:42:48 +0000 (08:42 -0700)]
rgw: escape bucket and object names in StreamReadRequests
This fixes copy operations for objects that contain unsafe characters,
like a newline, which would return a 403 otherwise, since the GET to
the source rgw would be unable to verify the signature on a partially
valid bucket name.
Josh Durgin [Thu, 24 Oct 2013 15:37:25 +0000 (08:37 -0700)]
rgw: move url escaping to a common place
This is useful outside of the s3 interface. Rename url_escape()
url_encode() for consistency with the exsting common url_decode()
function. This is in preparation for the next commit, which needs
to escape url-unsafe characters in another place.
Josh Durgin [Thu, 24 Oct 2013 15:34:24 +0000 (08:34 -0700)]
rgw: update metadata log list to match data log list
Send the last marker whether the log is truncated in the same format
as data log list, so clients don't have more needless complexity
handling the difference. Keep bucket index logs the same, since they
contain the marker already, and are not used in exactly the same way
metadata and data logs are.
Josh Durgin [Thu, 24 Oct 2013 15:26:19 +0000 (08:26 -0700)]
rgw: include marker and truncated flag in data log list api
Consumers of this api need to know their position in the log. It's
readily available when fetching the log, so return it. Without the
marker in this call, a client could not easily or efficiently figure
out its position in the log, since it would require getting the global
last marker in the log, and then reading all the log entries.
This would be slow for large logs, and would be subject to races that
would cause potentially very expensive duplicate work.
Returning this atomically while fetching the log entries simplifies
all of this.
Josh Durgin [Thu, 24 Oct 2013 15:18:19 +0000 (08:18 -0700)]
cls_log: always return final marker from log_list
There's no reason to restrict returning the marker to the case where
less than the whole log is returned, since there's already a truncated
flag to tell the client what happened.
Giving the client the last marker makes it easy to consume when the
log entries do not contain their own marker. If the last marker is not
returned, the client cannot get the last marker without racing with
updates to the log.
mon: MonClient: ping monitors without authenticating
* add support on the monitor to reply to MPing messages with the contents of
'mon_status' and 'health', regardless of a client having authenticated beforehand.
* add support on the MonClient to send a MPing message to a randomly picked
monitor (it was easier this way, '-m ip:port' allows for targeted ping) and block
waiting for a reply.
* add support on librados, pybind/rados.py and the 'ceph' tool to send pings to
monitors.
Resolves: #5984
Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>