From b856a2fe5bd016750445f9f91ba88923c114ea77 Mon Sep 17 00:00:00 2001 From: sageweil Date: Tue, 27 Nov 2007 20:26:50 +0000 Subject: [PATCH] removed some old outdated docs git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2121 29311d96-e01e-0410-9327-a35deaab8ce9 --- trunk/ceph/doc/dentries.txt | 4 - trunk/ceph/doc/file_modes.txt | 66 --------- trunk/ceph/doc/journal.txt | 124 ---------------- trunk/ceph/doc/osd_outline.txt | 37 ----- trunk/ceph/doc/osd_replication.txt | 226 ----------------------------- trunk/ceph/doc/shutdown.txt | 13 -- 6 files changed, 470 deletions(-) delete mode 100644 trunk/ceph/doc/dentries.txt delete mode 100644 trunk/ceph/doc/file_modes.txt delete mode 100644 trunk/ceph/doc/journal.txt delete mode 100644 trunk/ceph/doc/osd_outline.txt delete mode 100644 trunk/ceph/doc/osd_replication.txt delete mode 100644 trunk/ceph/doc/shutdown.txt diff --git a/trunk/ceph/doc/dentries.txt b/trunk/ceph/doc/dentries.txt deleted file mode 100644 index ab14765998b2f..0000000000000 --- a/trunk/ceph/doc/dentries.txt +++ /dev/null @@ -1,4 +0,0 @@ - -null dentires only exist - - on auth - - on replica, if they are xlock \ No newline at end of file diff --git a/trunk/ceph/doc/file_modes.txt b/trunk/ceph/doc/file_modes.txt deleted file mode 100644 index d4ceba4034e5f..0000000000000 --- a/trunk/ceph/doc/file_modes.txt +++ /dev/null @@ -1,66 +0,0 @@ - -underlying client capabilities: - -- read + cache -- read sync -- write sync -- write + buffer - (...potentially eventually augmented by byte ranges) - -whatever system of modes, tokens, etc. has to satisfy the basic -constraint that no conflicting capabilities are ever in the -hands of clients. - - -questions: -- is there any use to clients writing to a replica? - - reading, yes.. 100,000 open same file.. - - ------- - -simplest approach: -- all readers, writers go through authority -- all open, close traffic at replicas forwarded to auth - -- fh state migrates with exports. - - - --------- - -less simple: -- all writers go through authority - - open, close traffic fw -- readers from any replica - - need token from auth -- weird auth <-> replica <-> client interactions ensue! - - --------- - -even more complex (and totally FLAWED, ignore this!) - -- clients can open a file with any replica (for read or write). -- replica gets a read or write token from the primary - - primary thus knows if it's all read, all write, mixed, or none. -- once replica has a token it can service as many clients (of given type(s)) as it wants. -- on export, tokens are moved too. - - primary give _itself_ a token too! much simpler. - -- clients maintain a mode for each open file: rdonly, wronly, rdwr, lock -- globally, the mode is controlled by the primary, based on the mixture of - read and write tokens issued - - - -- [optional] if a client has a file open rdwr and the mode is rdonly or wronly, it can - request to read or write from the mds (which might twiddle the mode for performance - reasons.. e.g. lots of ppl rdwr but no actual reading) - - - - --------- - - diff --git a/trunk/ceph/doc/journal.txt b/trunk/ceph/doc/journal.txt deleted file mode 100644 index 22cb4fc9e21b2..0000000000000 --- a/trunk/ceph/doc/journal.txt +++ /dev/null @@ -1,124 +0,0 @@ - - -- LogEvent.replay() is idempotent. we won't know whether the update is old or not. - - - - - - - - - - - - - - - -journal is distributed among different nodes. because authority changes over time, it's not immedicatley clear to a recoverying node relaying the journal whether the data is "real" or not (it might be exported later in the journal). - - -possibilities: - - -ONE.. bloat the journal! - -- journal entry includes full trace of dirty data (dentries, inodes) up until import point - - local renames implicit.. cache is reattached on replay - - exports are a list of exported dirs.. which are then dumped - ... - -recovery phase 1 -- each entry includes full trace (inodes + dentries) up until the import point -- cache during recovery is fragmetned/dangling beneath import points -- when export is encountered items are discarded (marked clean) - -recovery phase 2 -- import roots ping store to determine attachment points (if not already known) - - if it was imported during period, attachment point is already known. - - renames affecting imports are logged too -- import roots discovered from other nodes, attached to hierarchy - -then -- maybe resume normal operations -- if recovery is a background process on a takeover mds, "export" everything to that node. - - --> journal contains lots of clean data.. maybe 5+ times bigger as a result! - -possible fixes: - - collect dir traces into journal chunks so they aren't repeated as often - - each chunk summarizes traces in previous chunk - - hopefully next chunk will include many of the same traces - - if not, then the entry will include it - - - - -=== log entry types === -- all inode, dentry, dir items include a dirty flag. -- dirs are implicitly _never_ complete; even if they are, a fetch before commit is necessary to confirm - -ImportPath - log change in import path -Import - log import addition (w/ path, dirino) - -InoAlloc - allocate ino -InoRelease - release ino - -Inode - inode info, along with dentry+inode trace up to import point -Unlink - (null) dentry + trace, + flag (whether inode/dir is destroyed) -Link - (new) dentry + inode + trace - - ------------------------------ - -TWO.. -- directories in store contain path at time of commit (relative to import, and root) -- replay without attaching anything to heirarchy -- after replay, directories pinged in store to attach to hierarchy - --> phase 2 too slow! --> and nested dirs may reattach... that won't be apparent from journal. - - put just parent dir+dentry in dir store.. even worse on phase 2! - - -THREE -- - - - - - - - -metadata journal/log - - -event types: - -chown, chmod, utime - InodeUpdate - -mknod, mkdir, symlink - Mknod .. new inode + link - -unlink, rmdir - Unlink - -rename - Link + Unlink (foreign) -or Rename (local) - -link - Link .. link existing inode - - - - -InodeUpdate -DentryLink -DentryUnlink -InodeCreate -InodeDestroy -Mkdir? diff --git a/trunk/ceph/doc/osd_outline.txt b/trunk/ceph/doc/osd_outline.txt deleted file mode 100644 index 2c6f3287aac5f..0000000000000 --- a/trunk/ceph/doc/osd_outline.txt +++ /dev/null @@ -1,37 +0,0 @@ - -intro - -osd cluster map - requirements - desireable properties - (c)rush - -failure detection - distributed ping or heartbeat - central filter, notifier - -design - placement seed, class/superset, groups - -normal operation - reads - writes - -recovery - triggers: failed disk, or total cluster reorganization - - notify - peering - pull - push - clean - -writes during recovery - -graceful data loss + recovery? - - - - - - diff --git a/trunk/ceph/doc/osd_replication.txt b/trunk/ceph/doc/osd_replication.txt deleted file mode 100644 index 907d00e2050a2..0000000000000 --- a/trunk/ceph/doc/osd_replication.txt +++ /dev/null @@ -1,226 +0,0 @@ - - -SOME GENERAL REQUIREMENTS - -- cluster expansion: - - any or all of the replicas may move to new OSDs. - -- cluster map may change frequently - - map change should translate into pending replication/migration - state quickly (or better yet, instantly), so that we could push - through a series of (say, botched) maps quickly and be fine, so long - as the final map is correct. - -- ideally, unordered osd<->osd, client<->osd communication - (mds<->mds, client<->mds communication is ordered, but file i/o - would be too slow that way?) - - - - -PRIMARY ONLY PICTURE - -let's completely ignore replication for a while, and see how -complicated the picture needs to be to reliably support cluster expansion. - -typedef __uint64_t version_t; - - -per-Object metadata: -- version #. incremented when an object is modified. - e.g. version_t version; -- on primary, keep list of stray replicas - e.g. map stray_replicas; // osds w/ stray replicas - includes old primary osd(s), until deletion is confirmed. used while rg - is importing. - - -per-RG metadata -- object list. well, a method to fetch it by querying a collection or whatever. -- negative list - e.g. map deleted_objects; - - used to enumerate deleted objects, when in "importing" state. -- a RG "state" (enum/int) - - - - - - -Normal RG state: -- role=primary - clean - i am primary, all is well. no stray copies. i can - discard my negative object list, since my local - object store tells me everything. - - -After a map change: -- new primary - undef - initially; i don't know RG exists. -- old primary - homeless - i was primary, still have unmolested data. new primary is not yet migrating - (presumably it's state=undef.) i need to contact new primary and tell them - this RG exists. - -- new primary - importing - i am migrating data from old primary. keep negative dir entries for deletions. - write locally. proxy reads (force immediately migration). do whole objects - initially (on write, block until i migrate the object). later we can do - sub-object state (where "live" object data is spread across new/old primaries.. -- old primary - exporting - primary is migrating my data. - undef - when it finishes. (i will forget this RG existed.) - - -After a second map change (scenario 1): - as above, if we were clean again. - -After a second map change (scenario 2): - we weren't clean yet. -- new primary - undef - initially (until i learn RG exists) -- old primary - importing - i'm still migrating from old old primary -- old old primary - exporting - ... -- old primary -?? importing+exporting - proxy reads as before. continue migrating from old old primary. - - -After a second map change (scenario 3): - we weren't clean yet, and old old primary is also new primary -- new primary (and old old primary) - exporting - change state to importing. be sure to compare object versions, and neg dir - entries (as we always should do, really!). -- old primary - importing - note that the old import source matches new primary, and change - state to exporting, and stop importing. (unlike scenario 2) - --> this approach could mean that a series of fast map changes could - force data to migrate down a "chain" of old primaries to reach the - new one. maybe old primary should go from importing -> exporting, - and pass along old old primary id to new primary such that the - import is a many-to-one thing, instead of one-to-one. version - numbers and neg entries will make it easy to pick out correct versions. - - - -For the importing process on a given RG: - -- metadata for each source - - each source has a state: - 'starting' - don't know anything about source yet. query source! - this probaby induces the source to change from - 'homeless' or something similar to 'exporting'. - 'importing' - i've fetched the source's object list (and neg - object list). i'm busy reading them! these lists - will shrink as the process continues. after i fetch - an object, i will erase it from the source. - (object metadata will include stray copy info - until i confirm that its removed.) - 'finishing' - i've read all my data, and i'm telling the old person - to discard any remaining RG metadata (RG contents - should already be gone) - - unmigrated object list - - migrated but not deleted object list - - stray osd is also listed in per-object MD during this stage - - negative object list - - i can remove these items if i see a newer object version (say, - from a different import source or something). - - i can remove any local objects or ignore imported ones if it is - older than deleted version - -- the lists should be sets or otherwise queryable so that while i'm - importing and a real op comes through I can quickly determine if a - given object_id is pending migration etc or if my local store is to - be trusted. - - - - - -SOME CODE BITS - - -typedef __uint64_t version_t; -class Object { - version_t version; - map stray_replicas; -}; - - -class ReplicaGroup { - int enumerate_objects(list& ls); - - int state; - - // for unstable states, - map deleted_objects; // locally - map exporters; // importing from these guys. -}; - -// primary -#define RG_STATE_CLEAN 1 -#define RG_STATE_IMPORTING 2 // pulling data - -// non-primary -#define RG_STATE_HOMELESS 5 // old primary; new primary not yet - // notified; not yet exporting. -#define RG_STATE_EXPORTING 6 // a newer primary is extracting my - // data. - - -struct RGExporter_t { - int import_state; - - set remaining_objects; // remote object list - set stray_objects; // imported but not deleted. - -}; - - - - - ----- -all crap from here on down - - - - -REPLICAS -- - - - - -OSD STATES -- primary, up to date. -- replica, up to date. - -- primary, proxy to old primary (primaries?) - -- replica, not up to date. - - -REPLICATION STUFF - -Per-RG metadata -- primary - - per-replica state: clean, catching up? -- replica - -Per-object metadata -- primary and replica - - version number/mtime - - rg (reverse indexed) -- primary - - replication level and state. - - commited to memory and/or disk, on which replicas (#1, #2, etc.) -- replica - - - - - --> \ No newline at end of file diff --git a/trunk/ceph/doc/shutdown.txt b/trunk/ceph/doc/shutdown.txt deleted file mode 100644 index e5ccde3171004..0000000000000 --- a/trunk/ceph/doc/shutdown.txt +++ /dev/null @@ -1,13 +0,0 @@ - -- mds0 triggers shutdown by sending a shutdown_start to all nodes. - -- from here on out, all client requests are discarded (unless they are a file close?) - -- each mds checks for outstanding inter-mds transations. e.g imports, discoveries, etc. once they're all done, send a shutdown_ready to mds0 - -- each mds successively disassembles its cache, flushing data to long-term storage, and sending inodeexpires, exporting imported dirs to parent (after they're clean + empty) - -- when the cache is empty, send shutdown_done to mds0 and exit. - -- mds0 exits when all mdss have finished. - -- 2.39.5