Sage Weil [Fri, 4 Oct 2013 23:07:20 +0000 (16:07 -0700)]
osd/ReplicatedPG: factor out simple_repop_{create,submit} helpers
This makes it easier to create repops correctly, and should help
prevent bugs like the one we remove here in process_copy_op (we were
serializing on the wrong object!)
Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Farnum <greg@inktank.com>
Sage Weil [Thu, 10 Oct 2013 16:56:39 +0000 (09:56 -0700)]
osdc/Objecter: reimplement list_objects
Return to caller at the end of each PG. This allows the caller to look at
the [pg_]hash_position and get something meaningful.
If there are no objects in the PG, we skip it so that every callback has
*some* data (unless the pool is totally empty!). So the real difference
here is that we don't move on to the next PG just to reach max_entries.
This gives the client some data sooner, but may mean more callbacks into
client code.
Sage Weil [Sun, 6 Oct 2013 20:22:31 +0000 (13:22 -0700)]
osdc/Objecter: separate explicit pg target from current target
The pgid field is used to store the pg the op mapped to. We were just
setting it directly for PGLS. Instead, fill in a new base_pgid, and copy that
to pgid in recalc_op_target(), the same way we do when we map an object
name to a PG.
In particular, we take this opportunity to map a raw pgid to an actual
pgid. This means the base_pg could come from a raw hash value (although
it doesn't, yet).
Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Farnum <greg@inktank.com>
Samuel Just [Wed, 6 Nov 2013 22:33:03 +0000 (14:33 -0800)]
ReplicatedPG: don't skip missing if sentries is empty on pgls
Formerly, if sentries is empty, we skip missing. In general,
we need to continue adding items from missing until we get
to next (returned from collection_list_partial) to avoid
missing any objects.
Fixes: #6633 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Tue, 5 Nov 2013 23:40:29 +0000 (15:40 -0800)]
RadosModel: use sharedptr_registry for snaps_in_use
There might be two concurrent rollback ops each of which
adds snap x to snaps_in_use. Between when the first
completes and the second completes, snap x may be removed
since the first would have removed snap x from snaps_in_use.
Using sharedptr_registry here avoids this by ensuring that
the snap won't be removed from snaps_in_use until all refs
are gone.
This patch also adds size() to sharedptr_registry.
Fixes: #6719 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Mon, 4 Nov 2013 19:25:31 +0000 (11:25 -0800)]
FileStore::_collection_move_rename: handle missing dst dir on replay
In case of a replay, a missing destination directory indicates that
the destination object and directory have been removed by a later
transaction. Thus, we need to remove the src object and return
0.
Fixes: #6714 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Danny Al-Gaaf [Mon, 4 Nov 2013 22:30:47 +0000 (23:30 +0100)]
galois.c: fix compiler warning
galois_create_split_w8_tables() takes no parameter, remove '8' passed
to the function in one case.
osd/ErasureCodePluginJerasure/galois.c: In function 'galois_w32_region_multiply':
osd/ErasureCodePluginJerasure/galois.c:696:5: warning: call to function 'galois_create_split_w8_tables' without a real prototype [-Wunprototyped-calls]
In file included from osd/ErasureCodePluginJerasure/galois.c:53:0:
osd/ErasureCodePluginJerasure/galois.h:71:12: note: 'galois_create_split_w8_tables' was declared here
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Samuel Just [Mon, 4 Nov 2013 05:02:36 +0000 (21:02 -0800)]
OSD: allow project_pg_history to handle a missing map
If we get a peering message for an old map we don't have, we
can throwit out: the sending OSD will learn about the newer
maps and update itself accordingly, and we don't have the
information to know if the message is valid. This situation
can only happen if the sender was down for a long enough time
to create a map gap and its PGs have not yet advanced from
their boot-up maps to the current ones, so we can rely on it
Fixes: #6712 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Sun, 3 Nov 2013 19:06:10 +0000 (11:06 -0800)]
OSD: don't clear peering_wait_for_split in advance_map()
I really don't know why I added this... Ops can be discarded from the
waiting_for_pg queue if we aren't primary simply because there must have
been an exchange of peering events before subops will be sent within a
particular epoch. Thus, any events in the waiting_for_pg queue must be
client ops which should only be seen by the primary. Peering events, on
the other hand, should only be discarded if we are in a new interval,
and that check might as well be performed in the peering wq.
Fixes: #6681 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Sat, 2 Nov 2013 20:54:51 +0000 (13:54 -0700)]
ReplicatedPG::recover_backfill: adjust last_backfill to HEAD if snapdir
Otherwise, if last_backfill_started is a snapdir, we will fail to send a
transaction for a client IO creating the head object and removing the
snapdir object. The result will be that head will eventually be
backfilled, but the snapdir object will erroneously not be removed.
Fixes: #6685 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Josh Durgin [Sat, 2 Nov 2013 02:02:29 +0000 (19:02 -0700)]
rbd: omit 'rw' option during map
The ro and rw options were added in linux 3.7. To be compatible with
older kernels, don't specify rw. The default will probably always be
rw, so this should not present any problems in the future.
Greg Farnum [Fri, 1 Nov 2013 22:45:02 +0000 (15:45 -0700)]
OSDMonitor: be a little nicer about letting users do pg splitting
We were previously blocking pg splits whenever pg creations were in-
progress, but we only really need to avoid splitting any pgs which are
currently being created. Let the user set a different pg_num if there
are no creating PGs on the pool in question.
Fixes: #6673, take two Signed-off-by: Greg Farnum <greg@inktank.com>
Samuel Just [Thu, 31 Oct 2013 20:19:32 +0000 (13:19 -0700)]
sharedptr_registry.hpp: removed ptrs need to not blast contents
See the included unit test update. Consider:
1) x = lookup_or_create(1, 1)
2) remove(1)
3) y = lookup_or_create(1, 2)
4) x.reset()
5) z = lookup(1)
The bug is that z will be null since x.reset() caused the
cleanup callback to remove y's key value from contents.
To fix this, contents also records the pointer value for
the weak_ptr. The removal callback only removes the
key from contents if it matches the ptr in contents.
This should work since the pointer passed to the removal
callback must be unique up to that point since it has
not yet been deleted.
This allowed a pg removal -> pg recreation -> pg removal
sequence to cause the second pg removal entry to be
erroneously cleared by the first pg removal's destructor
as it finally made its way through the removal queue.
Fixes: #5951 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Noah Watkins [Wed, 30 Oct 2013 23:34:29 +0000 (16:34 -0700)]
prio-q: initialize cur iterator
For new SubQueues `cur` is not intialized, so front/pop_front will freak
out. I honestly I have no idea how this hasn't been seen, but it was
being triggered frequently on OSX.
Fixes: #6686 Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 30 Oct 2013 23:54:39 +0000 (16:54 -0700)]
PGLog: remove obsolete assert in merge_log
This assert assumes that if olog.head != log.head, olog contains
a log entry at log.head, which may not be true since pg splitting
might have left the log with arbitrary holes.
Sage Weil [Tue, 29 Oct 2013 18:08:58 +0000 (11:08 -0700)]
upstart, sysvinit: use ceph-crush-location hook
Instead of hard-coding a check in ceph.conf and some reasonable
defaults, defer this work to ceph-crush-location, and allow users to
specify their own hook with alternative logic.
This can be helpful in a nubmer of cases, like:
- rack (or other) information included in hostname and easily parsed
out by a hook
- multiple types of devices in each host, resulting in 'parallel'
crush trees (e.g., one for hdd, one for ssd)
Sage Weil [Tue, 29 Oct 2013 17:10:21 +0000 (10:10 -0700)]
mon/PGMonitor: always send pg creations after mapping
At some point in the dumpling cycle I separated the map stage from the
send stage. We can send the creates any time we have a non-zero osdmap
epoch, and are in good shape as long as we do the map step after the
osdmap is loaded (hence the post_paxos_update).
Some background:
We originally introduced the map-but-don't send in a2fe0137, at which
point all was well because we only called it on ceph-mon startup.
Later, this turned into post_paxos_update in e635c478, at which point
it was now called by a running monitor.. but we didn't add in the
send_pg_creates(). This is where this bug stems from.
This particular path is responsible for the stalled test referenced in
bug #6673.
Backport: dumpling Signed-off-by: Sage Weil <sage@inktank.com>
We can't adjust last_backfill to object x until x has been fully
backfilled. pending_backfill_updates contains all those backfills
started, but which have not yet been reflected in pinfo.last_update.
backfills_in_flight contains those backfills which have not yet
completed. Thus, we can adjust last_update to the largest entry
in pending_backfill_updates not in backfills_in_flight.
Samuel Just [Mon, 28 Oct 2013 22:53:24 +0000 (15:53 -0700)]
ReplicatedPG: replace backfill_pos with last_backfill_started
last_backfill_started reflects what pinfo.last_backfill will be
once all currently outstanding backfills complete. backfill_pos
was tricky since we couldn't correctly inialize it without
doing the first backfill scan pair.
In recover_backfill, we rescan from last_backfill_started rather
than from backfill_pos. This ensures that we capture all clones
created between last_backfill_started and what previously had been
backfill_pos without special handling in make_writeable. The main
downside is that we will tend to "rescan" last_backfill_started.
Sage Weil [Mon, 28 Oct 2013 22:56:15 +0000 (15:56 -0700)]
init-ceph: make crush update on osd start time out
If the monitor is not currently available, this crush update would block
forever, preventing the OSD and (potentially) the rest of the system
from starting up. Instead, make it time out after 10 seconds and then
abort startup. This prevents startup of an OSD if we failed to update
the CRUSH position for some reason.
In fact, do not start up the OSD if the CRUSH update fails for any
reason--not just a timeout!
Works-around: #5612 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Mon, 28 Oct 2013 18:02:34 +0000 (11:02 -0700)]
ReplicatedBackend: don't hold ObjectContexts in pull completion callback
We need flushing the sequencer to ensure that all Contexts which hold
ObjectContextRefs have been run or deleted.
C_ReplicatedBackend_OnPullComplete, however, gets queued in a second
work queue in order to avoid performing expensive push related reads
in the FileStore finisher.
Rather than keep the objects contexts around, we instead put off
removing the object from the pulling map until the call back
fires and read the object context out of the pulling map. This
way the ObjectContextRef will be cleaned up along with the rest
of the pulling map in on_change.
Samuel Just [Sat, 26 Oct 2013 00:58:10 +0000 (17:58 -0700)]
ReplicatedPG: have make_writeable adjust backfill_pos
If we are writing to backfill_pos and create a clone, we end
up failing to send the transaction creating the clone to the
backfill peer. This is fine as long as we end up backfilling
the clone. To that end, we simply add the clone to
backfill_info and adjust backfill_pos accordingly. This is less
brittle than the waiting_for_backfill_pos mechanism since it
works even if we wait between that check and issuing the repop,
which can happen for copy_from.
Samuel Just [Sat, 26 Oct 2013 23:52:16 +0000 (16:52 -0700)]
ReplicatedPG,osd_types: move rw tracking from its own map to ObjectContext
We also modify recovering to hold a reference to the recovering obc
in order to ensure that our backfill_read_lock doesn't outlive the
obc.
ReplicatedPG::op_applied no longer clears repop->obc since we need
it to live until the op is finally cleaned up. This is fine since
repop->obc is now an ObjectContextRef and can clean itself up.
Greg Farnum [Wed, 23 Oct 2013 18:28:45 +0000 (11:28 -0700)]
ReplicatedPG: take and drop read locks when doing backfill
All our interfaces are in place, so now we can actually take and
drop the locks.
1) Take locks in ReplicatedPG::recover_backfill. This is the entry
into the backfill code path, and covers all objects which are
added to backfills_in_flight (via prep_backfill_object_push()). If we
can't get the lock right away, we stop the backfill movement there
until we can do so.
2) Drop the locks in ReplicatedPG::on_peer_recover(), called when the
push is completed.
2b) Further drop the locks on all backfills_in_flight objects in
_clear_recovery_state(), for when we cancel peering.