Irregularities in File-Size Distributions
Kylie M. Evans and Geoffrey H. Kuenning


In simulation, modeling, analysis, and design of file systems, it is often necessary to understand how the sizes of file are statistically distributed. Previous researchers have usually hypothesized that file sizes follow a lognormal pattern. Although the lognormal distribution has quite consistently failed to fit observed data, it is closer than any other simple and well-known distribution, and thus has been used by default.

However, the failure to fit has been troubling. In addition, the advent of multimedia files (such as MP3s and video material) has the potential to significantly change previously observed distributions.

The Study

To investigate the effect of multimedia on file sizes, and to better quantify the observed distributions, we undertook a study of college- and student-owned machines at Harvey Mudd College. We collected data on a number of machines, fitted curves to them, and characterized the results.

Our raw data will soon be available for download for use by other researchers.

Our paper on the study appeared in SPECTS 2002 and is available for download.

Curve Shapes

Because of the wide range of file sizes and the shape of the observed distributions, we worked with the logarithm of the file size rather than the size itself. This transformation has the useful property that a lognormal distribution is converted to a normal (Gaussian) distribution, simplifying modeling.

Contrary to the findings of other researchers, we found that a lognormal curve could not reasonably explain the observed data. In particular, the kurtosis ("peakiness") of the distribution was high, indicating that the observed distribution was narrower and with a taller peak than a lognormal distribution. Instead, we chose the lambda distribution to account for the observed kurtosis (and occasional skew) in the data.


Perhaps the most interesting result of our study is that real-world machines have significant irregularities in their distributions. For example, one machine studied had its primary mode at about 256 bytes, caused by a large number of HTML files produced by a Macintosh Powerpoint presentation. Another machine had a significant secondary mode due to MP3 files.

Previous studies have dealt with these irregularities by either ignoring them completely, or by averaging large numbers of machines to smooth out the variation. However, we believe that these techniques mask the behavior observed in real-world file systems, and suggest instead that modeling techniques should be developed to account for these deviations from mathematically elegant distributions.

More information is available from our SPECTS 2002 paper on the project.

Back to Geoff Kuenning's home page.

This page maintained by Geoff Kuenning.