Section 3 describes the design decisions that drove the techniques detailed in Section 4 on metadata partitioning and Section 5 on hotspot mitigation. Section 6 reports on design lessons we learned, and Section 7 describes our implementation. Section 8 experimentally evaluates the system. Sections 9 and 10 describe future work and related work. Section 11 summarizes and concludes. Many more details can be found in our paper [1].

Author:Nanos Turan
Language:English (Spanish)
Published (Last):14 December 2012
PDF File Size:2.88 Mb
ePub File Size:3.8 Mb
Price:Free* [*Free Regsitration Required]

Section 3 describes the design decisions that drove the techniques detailed in Section 4 on metadata partitioning and Section 5 on hotspot mitigation. Section 6 reports on design lessons we learned, and Section 7 describes our implementation.

Section 8 experimentally evaluates the system. Sections 9 and 10 describe future work and related work. Section 11 summarizes and concludes. Many more details can be found in our paper [1]. In this paper, we address the specific sub-goal of automated load balancing.

Farsite expects that machines may fail and that network connections may be flaky, but it is not designed to gracefully handle long-term disconnections. Each machine functions in two roles, a client that performs operations for its local user and a server that provides directory service to clients. To perform an operation, a client obtains a copy of the relevant metadata from the server, along with a lease [15] over the metadata.

For the duration of the lease, the client has an authoritative value for the metadata. If the lease is a write lease, then the server temporarily gives up authority over the metadata, because the lease permits the client to modify the metadata.

After the client performs the metadata operation, it lazily propagates its metadata updates to the server. When the load on a server becomes excessive, the server selects another machine in the system and delegates a portion of its metadata to this machine.

From the perspective of the directory service, there is little distinction between directories and data files, other than directories have no content and data files may not have children. Decades of experience by Unix and Windows users have shown that fully functional rename is part of what makes a hierarchical file system a valuable tool for organizing data. It is slightly more involved because, for example, one can create a file only if its parent exists, so in the flattened name space, one could create a new name only if a constrained prefix of the name already exists.

Some prior distributed file systems have divided the name space into user-visible partitions and disallowed renames across partitions; examples include volumes in AFS [19], shares in Dfs [25], and domains in Sprite [27].

Instead, we designed our metadata service to present a semantically seamless name space. Client X renames file C to be a child of file G, as shown in Fig. No single server is directly involved in both rename operations, and each independent rename is legal. Yet, if both renames were allowed to proceed, the result would be an orphaned loop, as shown in Fig.

Once we had a mechanism for scalable name-space consistency, it seemed reasonable to use this mechanism to provide a consistent name space for all path-based operations, not merely for renames. Since we employ a lease-based framework, we attach expiration times — typically a few hours — to all leases issued by servers. When a lease expires, the server reclaims its authority over the leased metadata.

If a client had performed operations under the authority of this lease, these operations are effectively undone when the lease expires. Farsite thus permits lease expiration to cause metadata loss, not merely file content loss as in previous distributed file systems.

However, these situations are not as radically different as they may first appear, because the user-visible semantics depend on how applications use files. Under Outlook, content-lease expiration can cause email folder metadata updates to get lost; whereas under maildir, expiring a file-system metadata lease can cause the same behavior.

Our intent had been to partition file metadata among servers according to file path names. Each client would maintain a cache of mappings from path names to their managing servers, similar to a Sprite prefix table [43]. The client could verify the authority of the server over the path name by evaluating a chain of delegation certificates extending back to the root server.

To diffuse metadata hotspots, servers would issue stale snapshots instead of leases when the client load got too high, and servers would lazily propagate the result of rename operations throughout the name space. Partitioning by path name complicates renames across partitions, as detailed in the next section. In the absence of path-name delegation, name-based prefix tables are inappropriate. Similarly, if partitioning is not based on names, consistently resolving a path name requires access to metadata from all files along a path, so delegation certificates are unhelpful for scalability.

Stale snapshots and lazy rename propagation allow the name space to become inconsistent, which can cause orphaned loops, as described above in Section 3. For these reasons, we abandoned these design ideas. One approach is to partition by file path name, as in Sprite [27] and the precursor to AFS [32], wherein each server manages files in a designated region of name space.

The former approach — migration without delegation — is insufficient, because if a large subtree is being renamed, it may be not be manageable by a single server. We avoid these problems by partitioning according to file identifiers, which are not mutable.

All three systems need to efficiently store and retrieve information on which server manages each identifier. AFS addresses this problem with volumes [34], and xFS addresses it with a similar unnamed abstraction. All files in a volume reside on the same server. Volumes can be dynamically relocated among servers, but files cannot be dynamically repartitioned among volumes without reassigning file identifiers.

Specifically, we considered four issues: To maximize delegation-policy freedom, regions of identifier space should be partitionable with arbitrary granularity. To permit growth, each server should manage an unbounded region of file-identifier space. For efficiency, file identifiers should have a compact representation.

Also for efficiency, the dynamic mapping from file identifiers to servers should be stored in a time- and space-efficient structure. Each server manages identifiers beginning with a specified prefix, except for those it has explicitly delegated away to other servers. The file identifier space is thus a tree, and servers manage subtrees of this space. At any moment, some portion of each file identifier determines which server manages the file; however, the size of this portion is not fixed over time, unlike AFS file identifiers.

For example, in the partitioning of Fig. This arbitrary partitioning granularity addresses issue 1 above. Because file identifiers have unbounded sequence length, each server manages an infinite set of file identifiers.

This addresses issue 2. Variable-length identifiers may be unusual, but they are not complicated in practice. Our code includes a file-identifier class that stores small identifiers in an immediate variable and spills large identifiers onto the heap.

Class encapsulation hides the length variability from all other parts of the system. To store information about which servers manage which regions of file-identifier space, clients use a file map, which is similar to a Sprite prefix table [43], except it operates on prefixes of file identifiers rather than of path names.

The file map is stored in an index structure adapted from Lampson et al. With respect to the count of mapped prefixes, the storage cost is linear, and the lookup time is logarithmic, thereby addressing issue 4.

For example, in Fig. This rule tends to keep files that are close in the name space also close in the identifier space, so partitioning the latter produces few cuts in the former. This minimizes the work of path resolution, which is proportional to the number of server regions along a path.

Renames can disrupt the alignment of the name and identifier spaces. For example, Fig. Unless renames occur excessively, there will still be enough alignment to keep the path-resolution process efficient. Our protocol does not support POSIX-style rename, which enables a fourth file to be overwritten by the rename, although our protocol could possibly be extended to handle this case.

These files may be managed by three separate servers, so the client must obtain leases from all three servers before performing the rename. Before the client returns its leases, thus transferring authority over the metadata back to the managing servers, its metadata update must be validated by all three servers with logical atomicity. Each follower validates its part of the rename, locks the relevant metadata fields, and notifies the leader. The leader decides whether the update is valid and tells the followers to abort or commit their updates, either of which unlocks the field.

While a field is locked, the server will not issue a lease on the field. Since a follower that starts a multi-server operation is obligated to commit if the leader commits, a follower cannot unlock a field on its own, even to timeout a spurious update from a faulty client.

Instead, the leader centrally handles timeouts by setting a timer for each notification it receives from a follower. The leader, which manages the destination server, can afford to check the one non-local condition, namely that the file being moved is not an ancestor of the destination. This check is facilitated by means of a path lease, as described in the following section. More generally, Farsite provides name-space consistency for all path-based operations.

In particular, it makes the root server responsible for providing leases to all interested parties in the system. Our solution to this problem is the mechanism of recursive path leases. Path leases are recursive, in that they are issued to other files, specifically to the children of the file whose path is being leased; a path lease on a file can be issued only when the file holds a path lease from its parent.

A path lease is accessible to the server that manages the file holding the lease, and if the file is delegated to another server, its path lease migrates with it. The recursive nature of path leases makes them scalable; in particular, the root server need only deal with its immediate child servers, not with every interested party in the system. When a rename operation causes a server to change the sequence of files along a path, the server must recall any relevant path leases before making the change, which in turn recalls dependent path leases, and so on down the entire subtree.

The time to perform a rename thus correlates with the size of the subtree whose root is being renamed. This implies that renames near the root of a large file system may take considerable time to execute.

Such renames have considerable semantic impact, so this slowness appears unavoidable. Nonetheless, to plausibly argue that our system can reach such a scale, we believe it necessary to address the problem of workload hotspotting. However, even non-commutative sharing can result in hotspotting if the metadata structure induces false sharing. We avoid false sharing by means of file-field leases and disjunctive leases.

For the latter fields, since there are infinitely many potential child names, we have a shorthand representation of lease permission over all names other than an explicitly excluded set; we call this the infinite child lease.

File-field leases are beneficial when, for example, two clients edit two separate files in the same directory using GNU Emacs [36], which repeatedly creates, deletes, and renames files using a primary name and a backup name. So, to process an open correctly, a client must determine whether another client has the file open for a particular mode. For example, if some client X has a file open for read access, no other client Y can open the file with a mode that excludes others from reading.

In Farsite, this false sharing is avoided by applying disjunctive leases [12] to each the above fields. For a disjunctive leased field, each client has a Boolean self value that it can write and a Boolean other value that it can read. The other value for each client x is defined as: where the summation symbol indicates a logical OR. When client Z opens the file for read access, it sets its self value to TRUE, but this does not change the other value that Y sees.


Distributed directory service in the farsite file system

Mujas This technique replicates partitions so that there are two copies of every partition and these two are stored on adjacent We call this phenomenon an insert storm. Show Context Citation Context The file system has successfully met our storage needs. For extreme scale, indexing structures should have four properties: Ceph [46] is an object-based research cluster file system HowardDavid A. Hat Global File System.



Nikojin The Sprite network operating system John K. BlueSky stores data persistently in a cloud storage provider such as Amazon S3 or Windows Azure, allowing users to take advantage of the reliability and large storage farsiet of cloud providers and avoid the need for dedicated server hardware. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. OusterhoutAndrew R. Farsite — P2P Foundation The advantages of a user-space implementation are ease of implementation and portability across various file systems: We experimentally show that Farsite can dynamically partition file-system metadata while maintaining full file-system semantics. Hat Global File System. The distinguishing feature of our distributed index is that each server expands its portion of the index without any central co-ordination or synchronization between servers or clients.



Related Articles