I’ve tried matching the sections and the content appearing below them as closely as possible to the original paper. However I’ve shuffled sections in a few instances for the sake of brevity and ease of understanding.
Three types of Metadata. The first two are persistent and are also replicated on remote machines. All mutations to the first two are logged to an operation log
- file to chunk mapping
- location of chunk replicas (not persistent)
In-Memory Data Structures
- Stored in memory thus fast.
- Periodic scan of state made possible because it’s efficient (enables garbage collection and re-replication).
- Approx 64 bytes of data for each 64 MB chunk.
- Drawback - Master limited by the amount of RAM. But it’s easily overcome by adding more RAM.
- No persistent record of chunks that hold the replicas. They’ve reasoned that the chunkserver ultimately decides if it has that chunk or not. The portion of the disk might have gone bad or might have vanished for some reason. Hence a persistent model only makes things harder as everytime the chunks in a chunkserver changes the master has to be notified and updated about the same. With hundreds of servers it’d contribute considerably to technical debt.
- Master collects this data during startup or whenever a new chunkserver joins.
- Master can monitor changes with the heartBeat message it sends.
- Record of important metadata changes.
- Helps in defining the order of concurrent changes. Maintains a logical timeline.
- Files and chunks and their versions are identified by the logical times at which they were created. (Master maintains a version number for each chunk to identify up-to-date chunks and stale ones. Whenever the master gives the client permission to append, it updates the version number and informs all up-to-date replicas as well. So if a chunk server is down, it misses the version update. When the failed chunkserver is up again, the master can garbage collect that particular chunk as it’s an older version.)
- Should be stored reliably and only shown to clients when the change is persistent. Since it’s cardinal, it is replicated across machines. Logs are batched before being written to disk on the local master as well as remote machines to reduce burden.
- The master can restore it’s state by replaying these logs. Even during re-replication when logs are read from disk, they are batched to reduce it’s impact.
- When logs size increased beyond a threshold, the master creates a checkpoint (it’s in the form of a binary tree that maps files to chunks and requires no parsing) and stores it to disk.
- Incoming requests to mutate files should not be blocked while the master is creating a checkpoint. To tackle this the master switches to a new log file in a separate thread. Creating such a checkpoint takes a minute or so for a cluster with a few million lines. This new checkpoint includes all the mutations before it. (I’m not sure as to how it’s done)
- Recovery only needs the latest complete checkpoint and associated log files. Older checkpoints and log files can be deleted (though Google keeps them safe in case of disasters).
- The code written to help the master recover after failure detects and skips incomplete checkpoints (so failure during checkpointing is taken care of).
As mentioned earlier weak consistency (different clients may see different data for small periods of time. They’ve ensured that it doesn’t drastically effect the clients though.) for performance and simplicity.
Let’s start by explaining a few terms for those who aren’t familiar with GFS. The terms consistent, defined, undefined, inconsistent defined a file region or chunk in this context.
- Consistent - A file region is consistent if all the clients see the same data no matter which chunkserver they read from.
- Defined - It is a region after a file mutation. The client can see the mutation in it’s entirety. Say one client wants to append data and there is no other client appending to that chunk, it is defined. By implication defined also means consistent.
- Undefined - When many clients mutate a region successfully, it is not possible to see what one particular client has written. It however is consistent.
Failed concurrent writes leads to chunks that are undefined and inconsistent.
Mutations are either writes (new files) or appends to existing files.
- Mutations are applied in the same order to all the replicas.
- Stale replicas do not participate in mutations (recall version numbers for each chunk).
- GFS lets a client write new data at a location that the client seems fit. However clients cannot append to a file as per their wish. GFS gives them a file offset to write to (GFS may include some padding to take care of concurrent writes to the file).
- They’ve also mentioned that some data might get corrupted (many writers writing to the same region) in the process but the amount of data which can get corrupted (inconsistent as explained earlier) pales in comparison to the total data. Applications (normally work on the entire data set) can still work uneffected.
How are data inconsistencies dealt with?
- If you recall clients cache chunk location data. So it might read data from a stale replica. When this happens agreed that it doesn’t get the complete file but it doesn’t get outdated or wrong data either. Moreover the cache timeout limits the possibility. When the cache expires and the master is contacted again it gets updated info.
- Applications (i.e clients) can distinguish between defined and undefined regions.
- GFS identifies stale servers by handshaking (recall heartBeats) and ensures data integrity by checksumming.
Also keep in mind that the clients have to accommodate to ensure the overall success of the Distributed File System.
- Mutations only by appending. No overwriting of existing data in chunks.
- If the writer(client writing to the filesystem) generates a new file, it renames it to something permanent after it’s done or periodically checkpoints it. These checkpoints can also include checksum or hashes.
- The readers (clients reading the data) only read data upto the latest checkpoint.
- Client side checkpointing also ensures that readers don’t read data that writers feel are incomplete. Say the writer creates a checkpoint (indicates that it is done writing that portion) and then starts appending (still in progress), the reader only reads upto the checkpoint.
- The clients have to deal with duplicates (as GFS promises to append at least once, so it might append a record more than once) and padding (GFS may or may not let a writer append immediately after file end). The writers have to write some extra information (special bits or patterns) to indicate the start of an actual record. This helps avoid padding. Unique identifiers can be used for each record so when readers access the data they can avoid the identifiers they’ve already encountered.
This part would have given you a hint of what happens behind the hood. I’ll cover more details about the flow in the next part.