The Google File System has inspired many Distributed File Systems (Big Table and HDFS to name a few). It’s interesting to note that GFS gives up strong consistency (it promises to be eventually consistent) for the sake of simplicity and performance. Also GFS was developed for certain kinds of applications (example mapreduce) and IS NOT suitable for most applications. Moreover Google has replaced GFS with Colossus as a response to changing needs.
Need Case and Assumptions
- Hundreds or thousands of inexpensive computers accessed by a large number of clients. One Master Server. Components fail often.
Multi GB files are common. Millions of 100MB sized files. Each file contains application objects like web documents (my understanding is that the files are similar to zipped files. One large file containing many small files). Arguments for large files
- Fewer large sized files are better than billions of Kilobyte sized files (storing metadata and file mappings becomes a pain as we will see later on)
- Redesign IO operations and block sizes due to the specific needs of the applications
- Most of the file mutations append data. Random writes in a file is almost zero.
Once a file is written, it is mostly read. Most of the times the reads are sequential. Surprisingly a large number of applications can benefit from such an assumption. Data analysis applications passing through large chunks of data, mapreduce applications reading large files during reduction stage etc. Therefore there are two types of read operations.
- Large Streaming Reads - Reads 100s of KBs of data sequentially.
- Random Reads - Reads a few KBs at some offset of a file.
Applications & GFS have to complement each other’s design. This increase flexibility.
- Normally clients requesting data cache it. Clients using GFS should not cache it as the data is large and might be subject to change. (Clients only cache metadata like location to a particular file on a chunk)
- Familiar but does not implement POSIX.
- Create, Read, Open, Delete.
- Snapshot and Record Append.
- Record Append supports simultaneous appends from multiple clients without blocking.
- 1 master many chunkservers
- Uniquely identified by 64 bit integers
- Store files on local disk as Linux files
- Replicated. Normally 3 times (can be configured)
- They take care of reading and writing data. The master gives the client the chunkserver location and the file byte range.
- HeartBeat messages - Periodic communication with chunkservers. Gives instructions and finds out more about their state.
- Metadata (map individual file to a chunk and byte range, location of each individual chunk, replicas store etc)
- Simplified actions when it comes to file mapping and replication. Single leaders and masters are known to reduce complexity (Eg Raft Consensus Algorithm).
- Minimum involvement with the clients. Say the client wants to read or write to a file, it gets the location from the master and deals with that chunk server. Also this metadata (i.e server location (with details of their replicas( and byte offset for a particular file) is cached. So the client needn’t bother the master frequently for metadata.
Chunk or Block Size
- Key design parameter. They’ve designed each block size to use 64MB. To help you contrast :- A normal linux system uses 4KB as it’s block size.
- Larger block size reduces client master interaction. It can get the required details about file locations in one request. It is not uncommon for clients to cache metadata for multi TB datasets hence even random reads can be performed without the involvement of the master.
- Lazy space allocation is used as a counter measure to internal fragmentation. Read this StackOverflow question for more details. You can also read about Sparse Files.
- Since clients are probably going to read data from a chunkserver for a long period of time (remember the base assumption, files are large and data is read sequentially), it can make use of a persistent TCP connection.
- Small files can turn the chunkserver which house them into hot spots (active with lots of incoming requests). This problem can be alleviated by having more replicas of such files. This StackOverflow discussion might be of use. The requests are therefore distributed over a larger number of servers.
- They had an executable hosted (it fit in a single block). It was used by hundreds of machine on startup at the same time and this created problems. They solved this issue by increasing the replication factor and staggering the application start times (instead of starting all the applications that needed the file at once, they spread it out to reduce the number of concurrent requests). This is another such occassion where they’ve resorted to making modifications to the client to keep the architecture simple.
I’ll cover the remaining sections in the future :)