+++ /dev/null
-
-underlying client capabilities:
-
-- read + cache
-- read sync
-- write sync
-- write + buffer
- (...potentially eventually augmented by byte ranges)
-
-whatever system of modes, tokens, etc. has to satisfy the basic
-constraint that no conflicting capabilities are ever in the
-hands of clients.
-
-
-questions:
-- is there any use to clients writing to a replica?
- - reading, yes.. 100,000 open same file..
-
-
-------
-
-simplest approach:
-- all readers, writers go through authority
-- all open, close traffic at replicas forwarded to auth
-
-- fh state migrates with exports.
-
-
-
---------
-
-less simple:
-- all writers go through authority
- - open, close traffic fw
-- readers from any replica
- - need token from auth
-- weird auth <-> replica <-> client interactions ensue!
-
-
---------
-
-even more complex (and totally FLAWED, ignore this!)
-
-- clients can open a file with any replica (for read or write).
-- replica gets a read or write token from the primary
- - primary thus knows if it's all read, all write, mixed, or none.
-- once replica has a token it can service as many clients (of given type(s)) as it wants.
-- on export, tokens are moved too.
- - primary give _itself_ a token too! much simpler.
-
-- clients maintain a mode for each open file: rdonly, wronly, rdwr, lock
-- globally, the mode is controlled by the primary, based on the mixture of
- read and write tokens issued
-
-
-
-- [optional] if a client has a file open rdwr and the mode is rdonly or wronly, it can
- request to read or write from the mds (which might twiddle the mode for performance
- reasons.. e.g. lots of ppl rdwr but no actual reading)
-
-
-
-
---------
-
-
+++ /dev/null
-
-
-- LogEvent.replay() is idempotent. we won't know whether the update is old or not.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-journal is distributed among different nodes. because authority changes over time, it's not immedicatley clear to a recoverying node relaying the journal whether the data is "real" or not (it might be exported later in the journal).
-
-
-possibilities:
-
-
-ONE.. bloat the journal!
-
-- journal entry includes full trace of dirty data (dentries, inodes) up until import point
- - local renames implicit.. cache is reattached on replay
- - exports are a list of exported dirs.. which are then dumped
- ...
-
-recovery phase 1
-- each entry includes full trace (inodes + dentries) up until the import point
-- cache during recovery is fragmetned/dangling beneath import points
-- when export is encountered items are discarded (marked clean)
-
-recovery phase 2
-- import roots ping store to determine attachment points (if not already known)
- - if it was imported during period, attachment point is already known.
- - renames affecting imports are logged too
-- import roots discovered from other nodes, attached to hierarchy
-
-then
-- maybe resume normal operations
-- if recovery is a background process on a takeover mds, "export" everything to that node.
-
-
--> journal contains lots of clean data.. maybe 5+ times bigger as a result!
-
-possible fixes:
- - collect dir traces into journal chunks so they aren't repeated as often
- - each chunk summarizes traces in previous chunk
- - hopefully next chunk will include many of the same traces
- - if not, then the entry will include it
-
-
-
-
-=== log entry types ===
-- all inode, dentry, dir items include a dirty flag.
-- dirs are implicitly _never_ complete; even if they are, a fetch before commit is necessary to confirm
-
-ImportPath - log change in import path
-Import - log import addition (w/ path, dirino)
-
-InoAlloc - allocate ino
-InoRelease - release ino
-
-Inode - inode info, along with dentry+inode trace up to import point
-Unlink - (null) dentry + trace, + flag (whether inode/dir is destroyed)
-Link - (new) dentry + inode + trace
-
-
------------------------------
-
-TWO..
-- directories in store contain path at time of commit (relative to import, and root)
-- replay without attaching anything to heirarchy
-- after replay, directories pinged in store to attach to hierarchy
-
--> phase 2 too slow!
--> and nested dirs may reattach... that won't be apparent from journal.
- - put just parent dir+dentry in dir store.. even worse on phase 2!
-
-
-THREE
--
-
-
-
-
-
-
-
-metadata journal/log
-
-
-event types:
-
-chown, chmod, utime
- InodeUpdate
-
-mknod, mkdir, symlink
- Mknod .. new inode + link
-
-unlink, rmdir
- Unlink
-
-rename
- Link + Unlink (foreign)
-or Rename (local)
-
-link
- Link .. link existing inode
-
-
-
-
-InodeUpdate
-DentryLink
-DentryUnlink
-InodeCreate
-InodeDestroy
-Mkdir?
+++ /dev/null
-
-
-SOME GENERAL REQUIREMENTS
-
-- cluster expansion:
- - any or all of the replicas may move to new OSDs.
-
-- cluster map may change frequently
- - map change should translate into pending replication/migration
- state quickly (or better yet, instantly), so that we could push
- through a series of (say, botched) maps quickly and be fine, so long
- as the final map is correct.
-
-- ideally, unordered osd<->osd, client<->osd communication
- (mds<->mds, client<->mds communication is ordered, but file i/o
- would be too slow that way?)
-
-
-
-
-PRIMARY ONLY PICTURE
-
-let's completely ignore replication for a while, and see how
-complicated the picture needs to be to reliably support cluster expansion.
-
-typedef __uint64_t version_t;
-
-
-per-Object metadata:
-- version #. incremented when an object is modified.
- e.g. version_t version;
-- on primary, keep list of stray replicas
- e.g. map<int,version_t> stray_replicas; // osds w/ stray replicas
- includes old primary osd(s), until deletion is confirmed. used while rg
- is importing.
-
-
-per-RG metadata
-- object list. well, a method to fetch it by querying a collection or whatever.
-- negative <object,version> list
- e.g. map<object_t, version_t> deleted_objects;
- - used to enumerate deleted objects, when in "importing" state.
-- a RG "state" (enum/int)
-
-
-
-
-
-
-Normal RG state:
-- role=primary
- clean - i am primary, all is well. no stray copies. i can
- discard my negative object list, since my local
- object store tells me everything.
-
-
-After a map change:
-- new primary
- undef - initially; i don't know RG exists.
-- old primary
- homeless - i was primary, still have unmolested data. new primary is not yet migrating
- (presumably it's state=undef.) i need to contact new primary and tell them
- this RG exists.
-
-- new primary
- importing - i am migrating data from old primary. keep negative dir entries for deletions.
- write locally. proxy reads (force immediately migration). do whole objects
- initially (on write, block until i migrate the object). later we can do
- sub-object state (where "live" object data is spread across new/old primaries..
-- old primary
- exporting - primary is migrating my data.
- undef - when it finishes. (i will forget this RG existed.)
-
-
-After a second map change (scenario 1):
- as above, if we were clean again.
-
-After a second map change (scenario 2):
- we weren't clean yet.
-- new primary
- undef - initially (until i learn RG exists)
-- old primary
- importing - i'm still migrating from old old primary
-- old old primary
- exporting - ...
-- old primary
-?? importing+exporting - proxy reads as before. continue migrating from old old primary.
-
-
-After a second map change (scenario 3):
- we weren't clean yet, and old old primary is also new primary
-- new primary (and old old primary)
- exporting - change state to importing. be sure to compare object versions, and neg dir
- entries (as we always should do, really!).
-- old primary
- importing - note that the old import source matches new primary, and change
- state to exporting, and stop importing. (unlike scenario 2)
-
--> this approach could mean that a series of fast map changes could
- force data to migrate down a "chain" of old primaries to reach the
- new one. maybe old primary should go from importing -> exporting,
- and pass along old old primary id to new primary such that the
- import is a many-to-one thing, instead of one-to-one. version
- numbers and neg entries will make it easy to pick out correct versions.
-
-
-
-For the importing process on a given RG:
-
-- metadata for each source
- - each source has a state:
- 'starting' - don't know anything about source yet. query source!
- this probaby induces the source to change from
- 'homeless' or something similar to 'exporting'.
- 'importing' - i've fetched the source's object list (and neg
- object list). i'm busy reading them! these lists
- will shrink as the process continues. after i fetch
- an object, i will erase it from the source.
- (object metadata will include stray copy info
- until i confirm that its removed.)
- 'finishing' - i've read all my data, and i'm telling the old person
- to discard any remaining RG metadata (RG contents
- should already be gone)
- - unmigrated object list
- - migrated but not deleted object list
- - stray osd is also listed in per-object MD during this stage
- - negative object list
- - i can remove these items if i see a newer object version (say,
- from a different import source or something).
- - i can remove any local objects or ignore imported ones if it is
- older than deleted version
-
-- the lists should be sets or otherwise queryable so that while i'm
- importing and a real op comes through I can quickly determine if a
- given object_id is pending migration etc or if my local store is to
- be trusted.
-
-
-
-
-
-SOME CODE BITS
-
-
-typedef __uint64_t version_t;
-class Object {
- version_t version;
- map<int, version_t> stray_replicas;
-};
-
-
-class ReplicaGroup {
- int enumerate_objects(list<object_t>& ls);
-
- int state;
-
- // for unstable states,
- map<object_t, version_t> deleted_objects; // locally
- map<int, RGExporter_t> exporters; // importing from these guys.
-};
-
-// primary
-#define RG_STATE_CLEAN 1
-#define RG_STATE_IMPORTING 2 // pulling data
-
-// non-primary
-#define RG_STATE_HOMELESS 5 // old primary; new primary not yet
- // notified; not yet exporting.
-#define RG_STATE_EXPORTING 6 // a newer primary is extracting my
- // data.
-
-
-struct RGExporter_t {
- int import_state;
-
- set<object_t> remaining_objects; // remote object list
- set<object_t> stray_objects; // imported but not deleted.
-
-};
-
-
-
-
-
-----
-all crap from here on down
-
-
-
-
-REPLICAS
--
-
-
-
-
-OSD STATES
-- primary, up to date.
-- replica, up to date.
-
-- primary, proxy to old primary (primaries?)
-
-- replica, not up to date.
-
-
-REPLICATION STUFF
-
-Per-RG metadata
-- primary
- - per-replica state: clean, catching up?
-- replica
-
-Per-object metadata
-- primary and replica
- - version number/mtime
- - rg (reverse indexed)
-- primary
- - replication level and state.
- - commited to memory and/or disk, on which replicas (#1, #2, etc.)
-- replica
-
-
-
-
-
-->
\ No newline at end of file