A Critique of Seltzer's 1993 USENIX Paper
John Ousterhout / john.ousterhout@eng.sun.com
The paper "An Implementation of a Log-Structured File System For UNIX",
by Margo Seltzer et al., appears in the Winter 1993 USENIX Conference
Proceedings (pages 307-326). The paper describes a new implementation
of a log-structured file system (LFS) in the BSD UNIX kernel, then
presents a performance comparison between three file systems: the new
LFS implementation; FFS, the traditional BSD file system; and EFS, a
modification of FFS that supports extent-based allocation.
The BSD implementation of LFS differs from the original Sprite
implementation in several ways, some of which are definite improvements.
The most notable improvement in BSD-LFS is a new approach to cleaning
that uses optimistic concurrency control and a user-level cleaner;
I believe that this is a much better way to do cleaning than the approach
used in Sprite, and we adopted Seltzer's approach in Sprite's Zebra
file system.
However, several of the differences between BSD-LFS and the Sprite
implementation are steps backwards (see below for details).
More importantly, the performance measurements in Section 5 of the paper
are seriously flawed. The comparisons between BSD-LFS, FFS, and
EFS suffer from three general problems: a bad implementation of
BSD-LFS, a poor choice of benchmarks, and a poor analysis of the
benchmarks. Combined together, these problems invalidate many of
the paper's conclusions; LFS is a much better file system architecture
than this paper suggests.
Fortunately, a newer paper by Seltzer in the Winter '95 USENIX,
"File System Logging Versus Clustering: A Performance Comparison",
corrects many of the flaws in the 1993 paper.
Although the newer paper also has flaws,
its conclusions are much closer to the truth.
Thus I recommend that the quantitative results of the 1993 paper be
disregarded in favor of those in the 1995 paper.
The sections below elaborate on the three general problems with
Seltzer's implementation and measurements.
Poor BSD-LFS Implementation
At the time this paper was written the BSD-LFS implementation
contained a number of flaws that affected its performance:
-
The system did not implement fragments, so the smallest block size
was 4KB or 8KB, compared to 512 or 1024 bytes for EFS and FFS. This
resulted in an unnecessary 4-8x reduction in LFS performance for
small files.
-
BSD-LFS contained a bug causing it to layout segments backwards
on disk. This damages read performance during sequential reads
by wasting almost a full disk rotation between each block of a
file and the next block. The performance penalty is most noticeable
for medium-sized files.
-
BSD-LFS stores the access time in inodes, rather than putting it
in the inode map as in LFS. This means that the inode must be
rewritten each time the file is read, so the inode tends to migrate
away from the file on disk, causing long seeks during read accesses.
-
BSD-LFS flushes indirect blocks and inodes to disk unnecessarily,
resulting in excess writes and increased cleaner overhead
(this problem is described in more detail in my
critique of the 1995 USENIX paper).
The first two of these problems were fixed for Seltzer's 1995 paper,
while the other two bugs are still present in the system. I could
not find a mention of these problems in the 1993 paper, yet they
invalidate much of the data in Figures 9 and 12 and affect the other
measurements as well.
Poor Benchmark Choice
The goal in choosing benchmarks should be to find ones that reflect
as closely as possible the real-world usage of a system. This is
a difficult task for file systems and I don't know of any "perfect"
file system benchmarks, but the selection for the paper is particularly
bad:
-
The paper does not evaluate small-file performance at all, even though
most files in real-world applications are small. For example, file
sizes less than 100 Kbytes are barely visible at the left edges of
Figures 9 and 12. The software development workload uses small files,
but it is mostly CPU bound so it doesn't demonstrate differences in
file system performance for small files. LFS is at its best for small
files, so by omitting small-file benchmarks the paper biases against
LFS.
-
Section 5.2.1, on raw file system performance, is based on a most
unusual benchmark that overwrites existing file data rather creating
new data, and does the writes synchronously. Furthermore, it appears
not to include the time to actually open the file being read or
written, and it apparently accesses a single file repeatedly rather than using
a collection of files. It is difficult to imagine a real
application that would generate this workload, and the benchmark
certainly doesn't reflect typical behavior. The data in Figures 9
and 12, where EFS appears to be uniformly better than LFS, is an
artifact of the bad benchmark and the bugs in the BSD-LFS implementation.
In the '95 USENIX paper some of the BSD-LFS bugs have been fixed and
a different benchmark is used, which is much more representative of
actual file system use; in that paper LFS performs substantially
better than EFS or FFS.
-
The Andrew benchmark used in Section 5.2.2 is mostly CPU bound,
so it doesn't provide much information on file system performance.
The multi-user version of the benchmark seems totally artificial
to me, since a single user is unlikely to run multiple compilations
simultaneously.
-
All of the benchmarks use a file system cache of only 1 Mbyte.
Considering that the machine had a total of 16 Mbytes of memory,
this is an relatively small cache size, and it impacted the transaction
processing results in Section 5.2.3. LFS was designed for systems that
use caching agressively, so restricting the cache size biases against
LFS. The 1995 USENIX paper used a
more realistic cache size and as a result LFS performed much better
relative to EFS in the transaction processing tests.
In discussions about the paper, Seltzer and one of her co-authors
admitted that they intentionally chose benchmarks that were
unfavorable to LFS and omitted benchmarks where LFS performed best.
They justified this by claiming that the original LFS
paper by Rosenblum and myself used benchmarks that favored
LFS and they were trying to present the other side of the story.
Although I disagree with this assessment of Rosenblum's and my paper,
the proper response is to present a fair set of benchmarks, not an
unfair set biased the other way.
Poor Analysis
Unfortunately, the analyses in the paper do not address the
deficiencies in the BSD-LFS implementation and the benchmarks.
As a result, readers are left with the conclusion that LFS is
architecturally inferior to EFS. In addition, several important
factors are omitted in the discussions of the benchmark. Here
are some examples of problems:
-
The paper is not clear enough in stating that the measurements
in Figure 9 are entirely an artifact of a bad benchmark and bugs
in the LFS implementation. There is some explanation in the text,
but it is hard to follow and the figure caption contains no
indication that the results are meaningless. Readers will remember
the figure, which suggests a 1.5-2x advantage for EFS, not the
fine print in the text, which states that in fact the two systems
are equal.
-
The explanation for Figures 9 and 12 doesn't discuss the impact
of a 56-Kbyte I/O limit, which accounts for most of the performance
differences at large transfer sizes.
-
Figure 14 consists of a blow-up of the Andrew benchmark measurements,
and the paper elaborates on this blow-up to "explain" the performance
differences. In fact the differences at high degrees of multiprogramming
are statistically insignificant; in a class project at Berkeley
these measurements were repeated and it was found that the results
vary enough from run to run to cover all of the differences in the
figure.
-
In the discussion of transaction performance, the paper does not
satisfactorily explain the cleaning performance for LFS. For example,
it measures only a single disk utilization so it isn't possible to
see how LFS performance varies with utilization.
-
The descriptions of the benchmarks are not sufficient to
reproduce the measurements. For example, it isn't mentioned in
Section 5.2.1 that the same file is used over and over again, and
Section 5.2.3 doesn't say what the disk utilization was for LFS.
Conclusion
Of all the performance numbers in the paper, only the single-user
Andrew performance numbers are particularly useful. All of the
other measurements are inadequate for one of the reasons given
above. I recommend that readers ignore the measurements in this
paper and use instead those in Seltzer's 1995 USENIX paper, which
are better designed and reflect bug fixes in BSD-LFS.
The paper concludes that "FFS (with read and write clustering)
provides comparable and sometimes superior performance to our LFS".
This may have been true for certain benchmarks on the flawed version
of BSD-LFS that existed when the paper was written, but it is not
true for realistic benchmarks on a well-tuned LFS implementation.
Seltzer's newer paper shows that in fact BSD-LFS is much faster than
EFS over a wide variety of operating conditions (as much as an order
of magnitude in places), and even at its worst it is only a few
percent slower than EFS.