+++ /dev/null
-
-all: index.html overview.html publications.html source.html tasks.html
-
-%.html: %.body template.html
- ./gen.pl $< > $@
+++ /dev/null
-html, body, table {
- margin: 0;
- padding: 0;
- font-family: Verdana, sans-serif;
- font-size: 12px;
-}
-a { text-decoration: none; }
-a:hover { text-decoration: underline; }
-#banner {
- width: 100%;
- margin: 0 0 10px 0;
- padding: 0;
- text-align: left;
- background-color: #ffffff;
- border-bottom: 1px solid #999999;
- font-family: Verdana, "Arial Black", Arial, sans-serif;
- color: white;
-}
-#banner div {
- margin: 0;
- border-bottom: 1px solid #333333;
-}
-#banner div h1 {
- margin: 0;
- padding: 5px;
- font-weight: bold;
- font-size: 40px;
-}
-small { font-size: 10px; }
-#navcolumn {
- vertical-align: top;
- margin: 0;
- padding: 0 5px 0 8px;
-}
-.navsegment {
- padding: 0;
- border: 1px solid #C9D2DC;
- background-color: #EEEEFF;
- margin-bottom: 10px;
-}
-.navsegment h4 {
- margin: 0 0 5px 0;
- font-family: sans-serif, Arial;
- font-weight: bold;
- color: #000066;
- background-color: #C9D2EC;
- border-top: 1px solid white;
- border-left: 1px solid white;
- border-bottom: 1px solid #93A4B7;
- border-right: 0;
- padding: 2px 0 2px 5px;
-}
-.navsegment ul {
- list-style: none;
- margin: 0;
- padding: 0 1px 0 7px;
- font-size: 12px;
-}
-.navsegment ul li {
- white-space: nowrap;
- padding: 0;
- margin: 0 0 5px 0;
-}
-#maincolumn {
- width: 100%;
- vertical-align: top;
- margin: 0;
- padding: 0 8px 0 5px;
-}
-.mainsegment {
- margin: 0 0 10px 0;
- padding: 0;
- border: 1px solid #93A4B7;
-}
-.mainsegment h3 {
- margin: 0;
- background-color: #7982BC;
- color: white;
- padding: 2px 1px 2px 6px;
- font-family: sans-serif, "Trebuchet MS", Tahoma, Arial;
- border-top: 1px solid #C9D2DC;
- border-left: 1px solid #C9D2DC;
- border-bottom: 1px solid #51657B;
- border-right: 0;
-}
-.mainsegment div {
- margin: 0;
- padding: 10px;
-}
-.mainsegment h4 {
- margin: 0;
- padding: 2px 1px 2px 6px;
- font-family: sans-serif, Arial, Tahoma;
- background-color: #DDDDEE;
- border-top: 1px solid white;
- border-left: 1px solid white;
- border-bottom: 1px solid #999999;
- border-right: 0;
-}
-pre, .programlisting {
- background-color: #EFEFEF;
- border: 1px solid #CCCCEE;
- padding: 10px;
-}
-
-/* Docbook formatting */
-h3.SECT3, h3.AUTHOR {
- border: 0;
- font-family: sans-serif, serif;
- background-color: white;
- color: black;
- margin: 0;
- padding: 0;
-}
-h1 a:hover, h2 a:hover, h3 a:hover {
- color: #51657B;
- text-decoration: none;
-}
-DD { padding-bottom: 0 }
-.synopsis {
- background: #eeeeff;
- border: solid 1px #aaaaff;
- padding: 10px;
-}
-
-
-/* Diagram thinger */
-td.kernel {text-align: center; background-color: pink}
-td.entity {text-align: center; background-color: #4ae544}
-td.lib {text-align: center; background-color: orange}
-td.abstract {text-align: center; background-color: #dde244}
-td.net {text-align: center; background-color: lightgray}
-
+++ /dev/null
-#!/usr/bin/perl
-
-use strict;
-my $bodyf = shift @ARGV;
-my $templatef = 'template.html';
-
-open(O, $bodyf);
-my $body = join('',<O>);
-close O;
-open(O, $templatef);
-my $template = join('',<O>);
-close O;
-$template =~ s/--body--/$body/;
-print $template;
+++ /dev/null
-<div class="mainsegment">
- <h3>Interested in working on Ceph?</h3>
- <div>
-We are actively seeking experienced C/C++ and Linux kernel developers who are interested in helping turn Ceph into a stable production-grade storage system. Competitive salaries, benefits, etc. If interested, please contact sage at newdream dot net.
- </div>
-</div>
-
-
-<div class="mainsegment">
- <h3>Welcome</h3>
- <div>
- Ceph is a distributed network file system designed to provide excellent performance, reliability, and scalability. Ceph fills two significant gaps in the array of currently available file systems:
-
- <ol>
- <li><b>Robust, open-source distributed storage</b> -- Ceph is released under the terms of the LGPL, which means it is free software (as in speech and beer). Ceph will provide a variety of key features that are generally lacking from existing open-source file systems, including seamless scalability (the ability to simply add disks to expand volumes), intelligent load balancing, and efficient, easy to use snapshot functionality.
- <li><b>Scalability</b> -- Ceph is built from the ground up to seamlessly and gracefully scale from gigabytes to petabytes and beyond. Scalability is considered in terms of workload as well as total storage. Ceph is designed to handle workloads in which tens thousands of clients or more simultaneously access the same file, or write to the same directory--usage scenarios that bring typical enterprise storage systems to their knees.
- </ol>
-
- Here are some of the key features that make Ceph different from existing file systems that you may have worked with:
-
- <ol>
- <li><b>Seamless scaling</b> -- A Ceph filesystem can be seamlessly expanded by simply adding storage nodes (OSDs). However, unlike most existing file systems, Ceph proactively migrates data onto new devices in order to maintain a balanced distribution of data. This effectively utilizes all available resources (disk bandwidth and spindles) and avoids data hot spots (e.g., active data residing primarly on old disks while newer disks sit empty and idle).
- <li><b>Strong reliability and fast recovery</b> -- All data in Ceph is replicated across multiple OSDs. If any OSD fails, data is automatically re-replicated to other devices. However, unlike typical RAID systems, the replicas for data on each disk are spread out among a large number of other disks, and when a disk fails, the replacement replicas are also distributed across many disks. This allows recovery to proceed in parallel (with dozens of disks copying to dozens of other disks), removing the need for explicit "spare" disks (which are effectively wasted until they are needed) and preventing a single disk from becoming a "RAID rebuild" bottleneck.
- <li><b>Adaptive MDS</b> -- The Ceph metadata server (MDS) is designed to dynamically adapt its behavior to the current workload. As the size and popularity of the file system hierarchy changes over time, that hierarchy is dynamically redistributed among available metadata servers in order to balance load and most effectively use server resources. (In contrast, current file systems force system administrators to carve their data set into static "volumes" and assign volumes to servers. Volume sizes and workloads inevitably shift over time, forcing administrators to constantly shuffle data between servers or manually allocate new resources where they are currently needed.) Similarly, if thousands of clients suddenly access a single file or directory, that metadata is dynamically replicated across multiple servers to distribute the workload.
- </ol>
-
- For more information about the underlying architecture of Ceph, please see the <a href="overview.html">Overview</a>. This project is based on a substantial body of research conducted by the <a href="http://ssrc.cse.ucsc.edu/proj/ceph.html">Storage Systems Research Center</a> at the University of California, Santa Cruz over the past few years that has resulted in a number of <a href="publications.html">publications</a>.
- </div>
-
- <h3>Current Status</h3>
- <div>
- Ceph is roughly alpha quality, and is under very active development. <b>Ceph is not yet suitable for any uses other than testing and review.</b> The file system is mountable and more or less usable using a FUSE-based client, and development is underway on a native Linux kernel client. Many features are planned but not yet implemented, including snapshots.
-
- <p>The Ceph project is actively seeking participants. If you are interested in using Ceph, or contributing to its development, please <a href="http://lists.sourceforge.net/mailman/listinfo/ceph-devel">join the mailing list</a>.
- </div>
-</div>
-
-<b>Please feel free to <a href="mailto:sage@newdream.net">contact me</a> with any questions or comments.</b>
\ No newline at end of file
+++ /dev/null
-
-<div class="mainsegment">
- <h3>Ceph Overview -- What is it?</h3>
- <div>
- Ceph is a scalable distributed network file system that provides both excellent performance and reliability. Like network file protocols such as NFS and CIFS, clients require only a network connection to mount and use the file system. Unlike NFS and CIFS, however, Ceph clients can communicate directly with storage nodes (which we call OSDs) instead of a single "server" (something that limits the scalability of installations using NFS and CIFS). In that sense, Ceph resembles "cluster" file systems based on SANs (storage area networks) and FC (fibre-channel) or iSCSI. The main difference is that FC and iSCSI are block-level protocols that communicate with dumb, passive disks; Ceph OSDs are intelligent storage nodes, all communication is over TCP and commodity IP networks.
- <p>
- Ceph's intelligent storage nodes (basically, storage servers running software to serve "objects" instead of files) facilitate improved scalability and parallelism. NFS servers (i.e. NAS devices) and cluster file systems funnel all I/O through a single (or limited set of) servers, limiting scalability. Ceph clients interact with a set of (perhaps dozens or hundreds of) metadata servers (MDSs) for high-level operations like open() and rename(), but communicate directly with storage nodes (OSDs) for I/O, of which there may be thousands.
- <p>
- There are a handful of new file systems and enterprise storage products adopting a similar object- or brick-based architecture, including Lustre (also open-source, but with restricted access to source code) and the Panasas file system (a commercial storage product). Ceph is different:
- <ul>
- <li><b>Open source, open development.</b> We're hosted on SourceForge, and are actively looking for interested users and developers.
- <li><b>Scalability.</b> Ceph sheds legacy file system design principles like explicit allocation tables that are still found in almost all other file systems (including Lustre and the Panasas file system) and ultimately limit scalability.
- <li><b>Commodity hardware.</b> Ceph is designed to run on commodity hardware running Linux (or any other POSIX-ish Unix variant). (Lustre relies on a SAN or other shared storage failover to make storage nodes reliable, while Panasas is based on custom hardware using integrated UPSs.)
- </ul>
- In additional to promising greater scalability than existing solutions, Ceph also promises to fill the huge gap between open-source filesystems and commercial enterprise systems. If you want network-attached storage without shelling out the big bucks, your are usually stuck with NFS and a direct-attached RAID. Technologies like ATA-over-ethernet and iSCSI help scale raw volume sizes, but the relative lack of "cluster-aware" open-source file systems (particularly those with snapshot-like functionality) still limits one to a single NFS "server" that limits scalability.
-<p>
-Ceph fills this gap by providing a scalable, reliable file system that can seamlessly grow from gigabytes to petabytes. Moreover, Ceph will eventually provide efficient snapshots, which almost no freely available file system (besides ZFS on Solaris) provides, despite snapshots having become almost ubiquitous in enterprise systems.
- </div>
-
- <h3>Ceph Architecture</h3>
- <div>
- <center><img src="images/ceph-architecture.png"></center>
- <p>
- A thorough overview of the system architecture can be found in <a href="http://www.usenix.org/events/osdi06/tech/weil.html">this paper</a> that appeared at <a href="http://www.usenix.org/events/osdi06">OSDI '06</a>.
- <p>
- A Ceph installation consists of three main elements: clients, metadata servers (MDSs), and object storage devices (OSDs). Ceph clients can either be individual processes linking directly to a user-space client library, or a host mounting the Ceph file system natively (ala NFS). OSDs are servers with attached disks and are responsible for storing data.
- <p>
- The Ceph architecture is based on three key design principles that set it apart from traditional file systems.
-
- <ol>
- <li><b>Separation of metadata and data management.</b><br>
- A small set of metadata servers (MDSs) manage the file system hierarchy (namespace). Clients communicate with an MDS to open/close files, get directory listings, remove files, or any other operations that involve file names. Once a file is opened, clients communicate directly with OSDs (object-storage devices) to read and write data. A large Ceph system may involve anywhere from one to many dozens (or possibly hundreds) of MDSs, and anywhere from four to hundreds or thousands of OSDs.
- <p>
- Both file data and file system metadata are striped over multiple <i>objects</i>, each of which is replicated on multiple OSDs for reliability. A special-purpose mapping function called <a href="http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf">CRUSH</a> is used to determine which OSDs store which objects. CRUSH resembles a hash function in that this mapping is pseudo-random (it appears random, but is actually deterministic). This provides load balancing across all devices that is relatively invulnerable to "hot spots," while Ceph's policy of redistributing data ensures that workload remains balanced and all devices are equally utilized even when the storage cluster is expanded or OSDs are removed.
-
- <p>
-
- <li><b>Intelligent storage devices</b><br>
- Each Ceph OSD stores variably-sized, named <i>objects</i>. (In contract, conventional file systems are built directly on top of raw hard disks that store small fixed-sized numbered <i>blocks</i>.) In contract, Ceph OSDs can be built with conventional server hardware with attached storage (either raw disks or a small RAID). Each Ceph OSD runs a special-purpose "object file system" called <b>EBOFS</b> designed to efficiently and reliable store variably sized objects.
- <p>
- More importantly, Ceph OSDs are intelligent. Collectively, the OSD cluster manages data replication, data migration (when the cluster composition changes due to expansion, failures, etc.), failure detection (OSDs actively monitor their peers), and failure recovery. We call the collective cluster <b>RADOS</b>--Reliable Autonomic Distributed Object Store--because it provides the illusion of a single logical object store while hiding the details of data distribution, replication, and failure recovery.
- <p>
-
- <li><b>Dynamic distributed metadata</b><br>
- Ceph dynamically distributes responsibility for managing the file system directory hierarchy over tens or even hundreds of MDSs. Because Ceph embeds most inodes directly within the directory that contains them, the hierarchical partitional allows each MDS to operation independently and efficiently. This distribution is entirely adaptive, based on the current workload, allowing the cluster to redistribute the hierarchy to balance load as client access patterns change over time. Ceph also copes with metadata hot spots: popular metadata is replicated across multiple MDS nodes, and extremely large directories can be distributed across the entire cluster when necessary.
-
- </ol>
-
-
- </div>
-
-</div>
-
+++ /dev/null
-
-<div class="mainsegment">
- <h3>Publications</h3>
- <div>
- Ceph has grown out of the petabyte-scale storage research at the Storage Systems Research Center at the University of California, Santa Cruz. The project is funded primarily by a grant from the Lawrence Livermove, Sandia, and Los Alamos National Laboratories. A range of publications related to scalable storage systems have resulted.
- <p>
- The following publications are directly related to the current design of Ceph.
- <ul>
- <li>Sage A. Weil, Andrew W. Leung, Scott A. Brandt, Carlos Maltzahn. <a href="http://www.pdsi-scidac.org/SC07/resources/rados-pdsw.pdf">RADOS: A Fast, Scalable, and Reliable Storage Service for Petabyte-scale Storage Clusters</a>. Petascale Data Storage Workshop SC07, November, 2007. [ <a href="http://www.pdsi-scidac.org/SC07/resources/weil-20071111-rados-pdsw.pdf">slides</a> ]
- <li>Sage Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn, <a href="http://www.usenix.org/events/osdi06/tech/weil.html"><b>Ceph: A Scalable, High-Performance Distributed File System</b></a>, Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI '06), November 2006.
- <li>Sage Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn, <a href="http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf">CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data</a>, Proceedings of SC '06, November 2006.
- <li>Sage Weil, Kristal Pollack, Scott A. Brandt, Ethan L. Miller, <a href="http://www.ssrc.ucsc.edu/Papers/weil-sc04.pdf">Dynamic Metadata Management for Petabyte-Scale File Systems</a>, Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04), November 2004.
- <li>Qin Xin, Ethan L. Miller, Thomas Schwarz, <a href="http://www.ssrc.ucsc.edu/Papers/xin-mss05.pdf">Evaluation of Distributed Recovery in Large-Scale Storage Systems</a>, Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC 2004), June 2004, pages 172-181.
- </ul>
-
- The following papers describe aspects of subsystems of Ceph that have not yet been fully designed or integrated, but soon will be.
- <ul>
- <li>Andrew Leung, Ethan L. Miller, <u>Scalable Security for Large, High Performance Storage Systems</u>, Proceedings of the 2nd ACM Workshop on Storage Security and Survivability (StorageSS 2006), October 2006.
- <li>Joel C. Wu, Scott A. Brandt, <a href="http://www.ssrc.ucsc.edu/Papers/wu-mss06.pdf">The Design and Implementation of AQuA: an Adaptive Quality of Service Aware Object-Based Storage Device</a>, Proceedings of the 23rd IEEE / 14th NASA Goddard Conference on Mass Storage Systems and Technologies, May 2006, pages 209-218.
-
- </ul>
-
- The following papers represent earlier research upon which Ceph's design is partially based.
- <ul>
- <li>Christopher Olson, Ethan L. Miller, <a href="http://www.cs.ucsc.edu/~elm/Papers/storagess05.pdf">Secure Capabilities for a Petabyte-Scale Object-Based Distributed File System</a>, Proceedings of the 2005 ACM Workshop on Storage Security and Survivability (StorageSS 2005), November 2005.
- <li>Qin Xin, Thomas Schwarz, Ethan L. Miller, <a href="http://www.ssrc.ucsc.edu/Papers/xin-mascots05.pdf">Disk Infant Mortality in Large Storage Systems</a>, Proceedings of the 13th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '05), September 2005.
- <li>Joel C. Wu, Scott A. Brandt, <a href="http://www.soe.ucsc.edu/~jwu/papers/wu-nossdav05.pdf">Hierarchical Disk Sharing for Multimedia Systems and Servers</a>, Proceedings of the 15th ACM International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV 2005), June 2005, pages 189-194.
- <li>Qin Xin, Ethan L. Miller, Thomas Schwarz, Darrell D. E. Long, <a href="http://www.ssrc.ucsc.edu/Papers/xin-mss05.pdf">Impact of Failure on Interconnection Networks in Large Storage Systems</a>, Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies, April 2005.
-
- <li>Feng Wang, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, <a href="http://ssrc.cse.ucsc.edu/Papers/wang-mss04b.pdf">OBFS: A File System for Object-Based Storage Devices</a>, Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, April 2004, pages 283-300.
- <li>Feng Wang, Qin Xin, Bo Hong, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Tyce T. Mclarty, <a href="http://ssrc.cse.ucsc.edu/Papers/wang-msst04c.pdf">File System Workload Analysis For Large Scientific Computing Applications</a>, NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST 2004), April 2004, pages 139?152.
- <li>Andy Hospodor, Ethan L. Miller, <a href="http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf">Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems</a>, Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, April 2004, pages 273-281.
- <li>R. J. Honicky, Ethan L. Miller, <a href="http://www.ssrc.ucsc.edu/Papers/honicky-ipdps04.pdf">Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution</a>, Proceedings of the 18th International Parallel & Distributed Processing Symposium (IPDPS 2004), April 2004.
- </ul>
-
- This is a partial selection. A complete list of publications for the project is available on the <a href="http://ssrc.cse.ucsc.edu/proj/ceph.html">SSRC Ceph project web site</a>.
-
- </div>
-</div>
+++ /dev/null
-
-<div class="mainsegment">
- <h3>Getting Started</h3>
- <div>
- The Ceph source code is managed with Git. For a Git crash course, there is a <a href="http://www.kernel.org/pub/software/scm/git/docs/tutorial.html">tutorial</a> and more from the <a href="http://git.or.cz/#documentation">official Git site</a>. Here is a quick <a href="http://git.or.cz/course/svn.html">crash course for Subversion users</a>.
-
- <p>The Ceph project is always looking for more participants. If you are interested in using Ceph, or contributing to its development, please <a href="http://lists.sourceforge.net/mailman/listinfo/ceph-devel">join our mailing list</a> and <a href="mailto:ceph-devel@lists.sourceforge.net">drop us a line</a>.
-
- <h4>Checking out the source</h4>
- <div>
- You can check out a working copy (actually, clone the repository) with
-<pre>
-git clone git://ceph.newdream.net/ceph.git
-</pre>
-or
-<pre>
-git clone http://ceph.newdream.net/git/ceph.git
-</pre>
- To pull the latest,
-<pre>
-git pull
-</pre>
- You can browse the git repo at <a href="http://ceph.newdream.net/git">http://ceph.newdream.net/git</a>.
- </div>
-
- <h4>Build Targets</h4>
- <div>
- There are a range of binary targets, mostly for ease of development and testing:
- <ul>
- <li><b>cmon</b> -- monitor</li>
- <li><b>cosd</b> -- OSD storage daemon</li>
- <li><b>cmds</b> -- MDS metadata server</li>
- <li><b>cfuse</b> -- client, mountable via FUSE</li>
- <li><b>csyn</b> -- client sythetic workload generator</li>
- <li><b>cmonctl</b> -- control tool</li>
- <p>
- <li><b>fakesyn</b> -- places all logical elements (MDS, client, etc.) in a single binary, with synchronous message delivery (for easy debugging!). Includes synthetic workload generation.</li>
- <li><b>fakefuse</b> -- same as fakesyn, but mounts a single client via FUSE.</li>
- </ul>
- </div>
-
- <h4>Runtime Environment</h4>
- <div>
- Few quick steps to get things started. Note that these instructions assume either that you are running on one node, or have a shared directory (e.g. over NFS) mounted on each node.
-
- <ol>
- <li>Checkout, change into the <tt>ceph/src</tt> directory, and build. E.g.,
-<pre>
-git clone git://ceph.newdream.net/ceph.git
-cd ceph
-./autogen.sh
-./configure # of CXXFLAGS="-g" ./configure to disable optimizations (for debugging)
-cd src
-make
-</pre>
-
- <li>Create a <tt>log/</tt> dir for various runtime stats.
-<pre>
-mkdir log
-</pre>
- <li>Identify the EBOFS block devices. This is accomplished with symlinks (or actual files) in the <tt>dev/</tt> directory. Devices can be identified by symlinks named after the hostname (e.g. <tt>osd.googoo-1</tt>), logical OSD number (e.g. <tt>osd4</tt>), or simply <tt>osd.all</tt> (in that order of preference). For example,
-<pre>
-mkdir dev
-ln -s /dev/sda3 dev/osd.all # all nodes use /dev/sda3
-ln -s /dev/sda4 dev/osd0 # except osd0, which should use /dev/sd4
-</pre>
- That is, when an osd starts up, it first looks for <tt>dev/osd$n</tt>, then <tt>dev/osd.all</tt>, in that order.
-
- These need not be "real" devices--they can be regular files too. To get going with fakesyn, for example, or to test a whole "cluster" running on the same node,
-<pre>
-# create small "disks" for osd0-osd3
-for f in 0 1 2 3; do # default is 4 OSDs
-dd if=/dev/zero of=dev/osd$f bs=1048576 count=1024 # 1 GB each
-done
-</pre>
- Note that if your home/working directory is mounted via NFS or similar, you'll want to symlink <tt>dev/</tt> to a directory on a local disk.
- </div>
-
-
- <h4>Starting up a full "cluster" on a single host</h4>
- <div>
- You can start up a the full cluster of daemons on a single host. Assuming you've created a set of individual files for each OSD's block device (the second option of #3 above), there is a <tt>start.sh</tt> and <tt>stop.sh</tt> script that will start up on port 12345.
-<p>
-One caveat here is that the ceph daemons need to know what IP they are reachable at; they determine that by doing a lookup on the machine's hostname. Since many/most systems map the hostname to 127.0.0.1 in <tt>/etc/hosts</tt>, you either need to change that (the easiest approach, usually) or add a <tt>--bind 1.2.3.4</tt> argument to cmon/cosd/cmds to help them out.
-<p>
-Note that the monitor has the only fixed and static ip:port in the system. The rest of the cluster daemons bind to a random port and register themselves with the monitor.
- </div>
-
- <h4>Mounting with FUSE</h4>
- <div>
- The easiest route is <tt>fakefuse</tt>:
-<pre>
-modprobe fuse # make sure fuse module is loaded
-mkdir mnt # or whereever you want your mount point
-make fakefuse && ./fakefuse --mkfs --debug_ms 1 mnt
-</pre>
- You should be able to ls, copy files, or whatever else (in another terminal; fakefuse will stay in the foreground). Control-C will kill fuse and cause an orderly shutdown. Alternatively, <tt>fusermount -u mnt</tt> will unmount. If fakefuse crashes or hangs, you may need to <tt>kill -9 fakefuse</tt> and/or <tt>fusermount -u mnt</tt> to clean up. Overall, FUSE is pretty well-behaved.
-
-If you have the cluster daemon's already running (as above), you can mount via the standalone fuse client:
-<pre>
-modprobe fuse
-mkdir mnt
-make cfuse && ./cfuse mnt
-</pre>
- </div>
-
- <h4>Running the kernel client in a UML instance</h4>
- <div>
- Any recent mainline kernel will do here.
-<pre>
-$ cd linux
-$ patch -p1 < ~/ceph/src/kernel/kconfig.patch
-patching file fs/Kconfig
-patching file fs/Makefile
-$ cp ~/ceph/src/kernel/sample.uml.config .config
-$ ln -s ~/ceph/src/kernel fs/ceph
-$ ln -s ~/ceph/src/include/ceph_fs.h include/linux
-$ make ARCH=um
-</pre>
- I am using <a href="http://uml.nagafix.co.uk/Debian-3.1/Debian-3.1-AMD64-root_fs.bz2">this x86_64 Debian UML root fs image</a>, but any image will do (see <a href="http://user-mode-linux.sf.net">http://user-mode-linux.sf.net</a>) as long as the architecture (e.g. x86_64 vs i386) matches your host. Start up the UML guest instance with something like
-<pre>
-./linux ubda=Debian-3.1-AMD64-root_fs mem=256M eth0=tuntap,,,1.2.3.4 # 1.2.3.4 is the _host_ ip
-</pre>
- Note that if UML crashes/oopses/whatever, you can restart quick-and-dirty (up arrow + enter) with
-<pre>
-reset ; killall -9 linux ; ./linux ubda=Debian-3.1-AMD64-root_fs mem=256M eth0=tuntap,,,1.2.3.4
-</pre>
- You'll need to configure the network in UML with an unused IP. For my debian-based root fs image, this <tt>/etc/network/interfaces</tt> file does the trick:
-<pre>
-iface eth0 inet static
- address 1.2.3.5 # unused ip in your host's netowrk for the uml guest
- netmask 255.0.0.0
- gateway 1.2.3.4 # host ip
-auto eth0
-</pre>
- Note that you need install uml-utilities (<tt>apt-get install uml-utilities</tt> on debian distros) and add yourself to the <tt>uml-net</tt> group on the host (or run the UML instance as root) for the network to start up properly.
- <p>
- Inside UML, you'll want an <tt>/etc/fstab</tt> line like
-<pre>
-none /host hostfs defaults 0 0
-</pre>
- You can then load the kernel client module and mount from the UML instance with
-<pre>
-insmod /host/path/to/ceph/src/kernel/ceph.ko
-mount -t ceph 1.2.3.4:/ mnt # 1.2.3.4 is host
-</pre>
-
- </div>
-
- <h4>Running fakesyn -- everything one process</h4>
- <div>
- A quick example, assuming you've set up "fake" EBOFS devices as above:
-<pre>
-make fakesyn && ./fakesyn --mkfs --debug_ms 1 --debug_client 3 --syn rw 1 100000
-# where those options mean:
-# --mkfs # start with a fresh file system
-# --debug_ms 1 # show message delivery
-# --debug_client 3 # show limited client stuff
-# --syn rw 1 100000 # write 1MB to a file in 100,000 byte chunks, then read it back
-</pre>
- One the synthetic workload finishes, the synthetic client unmounts, and the whole system shuts down.
-
- The full set of command line arguments can be found in <tt>config.cc</tt>.
- </div>
-
- </div>
-</div>
-
+++ /dev/null
-
-<div class="mainsegment">
- <h3>Current Roadmap</h3>
- <div>
- Here is a brief summary of what we're currently working on, and what state we expect Ceph to take in the foreseeable future. This is a rough estimate, and highly dependent on what kind of interest Ceph generates in the larger community.
-
- <ul>
- <li><b>Q1 2008</b> -- Basic in-kernel client (Linux)
- <li><b>Q2 2008</b> -- Snapshots
- <li><b>Q2 2008</b> -- User/group quotas
- </ul>
-
- </div>
-
- <h3>Tasks</h3>
- <div>
- Although Ceph is currently a working prototype that demonstrates the key features of the architecture, a variety of features need to be implemented in order to make Ceph a stable file system that can be used in production environments. Some of these tasks are outlined below. If you are a kernel or file system developer and are interested in contributing to Ceph, please join the email list and <a href="mailto:ceph-devel@lists.sourceforge.net">drop us a line</a>.
-
- <p>
-
- <h4>Snapshots</h4>
- <div>
- The distributed object storage fabric (RADOS) includes a simple mechanism of versioning objects, and performing copy-on-write when old objects are updated. In order to utilize this mechanism for implementing flexible snapshots, the MDS needs to be extended to manage versioned directory entries and maintain some additional directory links. For more information, see <a href="http://www.soe.ucsc.edu/~sage/papers/290s.osdsnapshots.pdf">this tech report</a>.
- </div>
-
- <h4>Content-addressable Storage</h4>
- <div>
- The underlying problems of reliable, scalable and distributed object storage are solved by the RADOS object storage system. This mechanism can be leveraged to implement a content-addressible storage system (i.e. one that stores duplicated data only once) by selecting a suitable chunking strategy and naming objects by the hash of their contents. Ideally, we'd like to incorporate this into the overall Ceph file system, so that different parts of the file system can be selectively stored normally or by their hash. Ideally, the system could (perhaps lazily) detect duplicated data when it is written and adjust the underlying storage strategy accordingly in order to optimize for space efficiency or performance.
- </div>
-
- <h4>Ebofs</h4>
- <div>
- Each Ceph OSD (storage node) runs a custom object "file system" called EBOFS to store objects on locally attached disks. Although the current implementation of EBOFS is fully functional and already demonstrates promising performance (outperforming ext2/3, XFS, and ReiserFS under the workloads we anticipate), a range of improvements will be needed before it is ready for prime-time. These include:
- <ul>
- <li><b>NVRAM for data journaling.</b> Actually, this has been implemented, but is untested. EBOFS can utilize NVRAM to journal uncommitted requests much like WAFL does, significantly lowering write latency while facilitating more efficient disk scheduling, delayed allocation, and so forth.
- <li><b>RAID-aware allocation.</b> Although we conceptually think of each OSD as a disk with an attached CPU, memory, and network interface, it is more likely that the actual OSDs deployed in production systems will be small to medium sized storage servers: a standard server with a locally attached array of SAS or SATA disks. In order to properly take advantage of the parallelism inherent in the use of multiple disks, the EBOFS allocator and disk scheduling algorithms have to be aware of the underlying structure of the array (be it RAID0, 1, 5, 10, etc.) in order to reap the performance and reliability rewards.
- </ul>
- </div>
-
-
- <h4>Native kernel client</h4>
- <div>
- The prototype Ceph client is implemented as a user-space library. Although it can be mounted under Linux via the <a href="http://fuse.sourceforge.net/">FUSE (file system in userspace)</a> library, this incurs a significant performance penalty and limits Ceph's ability to provide strong POSIX semantics and consistency. A native Linux kernel implementation of the client in needed in order to properly take advantage of the performance and consistency features of Ceph. We are actively looking for experienced kernel programmers to help guide development in this area!
- </div>
-
- <h4>CRUSH tools</h4>
- <div>
- Ceph utilizes a novel data distribution function called CRUSH to distribute data (in the form of objects) to storage nodes (OSDs). CRUSH is designed to generate a balanced distribution will allowing the storage cluster to be dynamically expanded or contracted, and to separate object replicas across failure domains to enhance data safety. There is a certain amount of finesse involved in properly managing the OSD hierarchy from which CRUSH generates its distribution in order to minimize the amount of data migration that results from changes. An administrator tool would be useful for helping to manage the CRUSH mapping function in order to best exploit the available storage and network infrastructure. For more information, please refer to the technical paper describing <a href="publications.html">CRUSH</a>.
- </div>
-
- <p>The Ceph project is always looking for more participants. If any of these projects sound interesting to you, please <a href="http://lists.sourceforge.net/mailman/listinfo/ceph-devel">join our mailing list</a>.
- </div>
-</div>
-
-
-<b>Please feel free to <a href="mailto:sage@newdream.net">contact me</a> with any questions or comments.</b>
+++ /dev/null
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
-<html>
-<head>
- <title>Ceph - Petascale Distributed Storage</title>
- <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
- <meta name="keywords" content="ceph, distributed, storage, file system, posix, object-based storage, osd">
- <meta name="description" content="The Ceph file system, a scalable open source distributed file system with POSIX semantics.">
- <link rel="stylesheet" href="ceph.css" type="text/css">
-</head>
-<body>
-<br>
- <div id="banner"><a href="index.html"><img src="images/ceph-logo1.jpg" border=0 width=244 height=84></a></div>
-
- <table><tr>
- <td id="navcolumn">
- <div class="navsegment">
- <h4>Main</h4>
- <ul>
- <li><a href="index.html">Home</a></li>
- <li><a href="http://sourceforge.net/projects/ceph">SF Project Page</a></li>
- </ul>
- </div>
-
- <div class="navsegment">
- <h4>Documentation</h4>
- <ul>
- <li><a href="overview.html">Overview</a></li>
- <li><a href="publications.html">Publications</a></li>
- <li><a href="http://wiki.soe.ucsc.edu/bin/view/Main/PetascaleObject-basedStorage">Wiki</a></li>
- </ul>
- </div>
-
- <div class="navsegment">
- <h4>Development</h4>
- <ul>
- <li><a href="tasks.html">Tasks and Roadmap</a></li>
- <li><a href="source.html">Getting Started</a></li>
- <li><a href="http://ceph.newdream.net/git">Browse GIT repository</a></li>
- </ul>
- </div>
-
- <div class="navsegment">
- <h4>Lists</h4>
- <ul>
- <li><a href="https://lists.sourceforge.net/lists/listinfo/ceph-devel">Development List</a></li>
- <li><a href="http://sourceforge.net/mailarchive/forum.php?forum_name=ceph-devel">Archives</a></li>
- <li><a href="https://lists.sourceforge.net/lists/listinfo/ceph-commit">Commit List</a></li>
- </ul>
- </div>
-
- <div class="navsegment">
- <h4>Related Sites</h4>
- <ul>
- <li><a href="http://ssrc.cse.ucsc.edu/">Storage Systems Research Center</a></li>
- </ul>
-
- </div>
-
- <a href="http://sourceforge.net"><img src="http://sflogo.sourceforge.net/sflogo.php?group_id=145953&type=1" width="88" height="31" border="0" alt="SourceForge.net Logo" /></a>
- </td>
-
- <td id="maincolumn">
-
---body--
-
- </td>
- </tr></table>
-</body>
-</html>