Does NetApp's Write Anywhere File Layout (WAFL) suffer the same type of garbage collection issues that traditional log structured file systems (e.g. STK's Iceberg) or flash transition layers encounter? Why or why not?
- Here's some background on LSF systems.
There has been substantial marketing FUD thrown at WAFL performance over the years making the claim that over time, the performance of NetApp arrays degrades due to the nature of WAFL. For example, this performance analysis from HP was posted on Wikibon today by Calvin Zito.
In response to claims that WAFL performance degrades over time, NetApp has published this summary of a 48-hour sustainability test under the SPC benchmark.
Thanks to Val Bercovici for providing these additional useful links in response to Wikibon's request to help shed light on this issue:
These links provide insight into the challenges of managing LSF performance in general and WAFL specifically. The first analysis points out some of the nuances of managing WAFL in a world where applications assume data needs to be laid out on disks in a sequential manner-- a predominant approach for the vast majority of disk arrays.
Essentially, NetApp uses historical increases in processing power, intelligent algorithms and its fifteen years of experience managing this issue to minimize the problem and turn it into an advantage-- allowing NetApp for example to enable thin provisioning during benchmarks-- which is rarely done by vendors.
This second link provides insight as to how WAFL minimizes the need for write cache in RAID operations. Importantly, it also confirms that WAFL, like other LSF systems suffers performance degradation as the array's storage utilization increases. The following statement applies for WAFL and to Wikibon's knowledge, every LSF system:
...WAFL, will do very little work when the array has plenty of free space and therefore perform much better. A side effect, though, is that as the array free space shrinks, WAFL’s will have to do more work to find free space, and that more work will translate into lower performance.
Wikibon's Take
It seems conceptually clear that WAFL, which performs writes to any available free space, must at some point perform so-called garbage collection to re-organize and re-allocate free space. When an array that uses this technique is less full, users will see no performance degradation because there is plenty of free space available for writes. As the array's capacity becomes increasingly utilized, the array must work harder to find free space and this will negatively impact performance-- as NetApp itself admits.
Our understanding is that NetApp performs free space collection as a background task in an effort to minimize performance degradation. However as the array becomes more full, the window for free space management will shrink and could negatively impact performance.
Competitive vendor analysis and subsequent NetApp responses (e.g. it's 48-hour sustainability test) attempt to highlight or refute that WAFL has a fragmentation issue. Perhaps a more interesting benchmark would be to show performance of a WAFL-based array at variable utilization levels from very low to very high to assess the performance impact.
The bottom line is there are always trade-offs in any technology choice. NetApp's WAFL innovation brings dramatic simplicity and very often efficiency benefits to users. Ironically, however, in certain workloads and environments, some of these benefits will be negated by the fact that as a WAFL array gets filled up, users will need to allocate more headroom to maintain consistent performance.
The reality is that many NetApp customers will never see this problem as they're not running high volume transaction environments where performance is king. Nonetheless, users should be aware of the inverse relationship between capacity and performance for any storage array, not just WAFL-based arrays. By understanding this relationship and the value of applications running on the arrays, trade-offs can be assessed and ROI maximized.
Action Item:
Footnotes: