TreeFix: Statistically Informed Gene Tree Error Correction Using Species Trees
Matthew D. Rasmussen,
Mukul S. Bansal, and
Systematic Biology. 2013. doi: 10.1093/sysbio/sys076
Address correspondence to: Yi-Chieh Wu (yjw at mit.edu) and Manolis Kellis (manoli at mit.edu)
Additionally, if you use the default module for computing the test statistic, please cite
RAxML-VI-HPC: Maximum Likelihood-based Phylogenetic Analyses
with Thousands of Taxa and Mixed Models. Bioinformatics 22(21):2688-2690, 2006
TreeFix is a phylogenetic method for improving gene tree
reconstructions using a test statistic for likelihood equivalence and
a species tree aware (reconciliation) cost function.
The default TreeFix package is meant for use in eukaryotic genomes.
For prokaryotes, use TreeFix-DTL.
The TreeFix package includes the Python source code, modified RAxML
source code, as well as several library interfaces for Python.
A detailed README and sample dataset are also included.
- Python (2.5.4 or greater): http://python.org
- C compiler (gcc)
- SWIG (1.3.29 or greater): http://www.swig.org
- Numpy (1.5.1 or greater): http://www.numpy.org
If Numpy is not found, TreeFix uses Python's built-in 'random' module.
- Scipy (0.7.1 or greater): http://www.scipy.org
If Scipy is not found, TreeFix uses internal libraries to approximate
the normal distribution (so p-values may be slightly off).
- Additionally, Python modules are required for computing
(1) the p-value for likelihood equivalence and
(2) the reconciliation cost.
By default, TreeFix computes p-values based on the Shimodaira-Hasegawa (SH) test statistic with RAxML site-wise likelihoods.
This is included in the main TreeFix package.
Modules based on other phylogenetic programs or using other test statistics may be added in the future.
For more test statistics, see CONSEL.
By default, TreeFix uses maximum parsimony reconciliation (MPR) and computes the duplication-loss cost.
This is included in the main TreeFix package.
Modules based on other reconciliation models may be added in the future.
A fairly thorough tutorial with detailed installation instructions, descriptions of command line options,
and step-by-step instructions for using TreeFix is
In our paper, we evaluated TreeFix on two clades of species, the 12 Drosophila and 16 fungi, and using the same datasets used to evaluate SPIMAP. This included 5351 real gene families across the 16 fungal genomes, as well as 1000 simulated gene families (generated under the SPIMAP model) across each clade. We also evaluated TreeFix on simulated gene families with simulated species trees generated using a range of speciation rates and tree sizes. Note that TreeFix uses many of the same conventions as SPIMAP, so please refer to its website for more detail on any of these files. Additional simulated datasets available upon request.
Real fungi dataset: real-fungi.tar.gz
5351 gene families downloaded from the January 2009 release of SYNERGY.
Simulated fungi dataset: sim-fungi.tar.gz
Reconstructions: TreeFix (long),
Relation files: real-fungi-rel.tar.gz
Simulated fly dataset: sim-flies.tar.gz
Each tarball contains 1000 simulated gene families under various duplication and loss rates.
TreeFix was evaluated using a duplication and loss rate equal to that seen in real data (e.g. DUPRATE=1, LOSSRATE=1).
Species trees: fungi.stree,
Species maps: fungi.smap,
fungi2.smap (for real dataset),
Species abbreviations: fungi.names.txt,
TreeFix requires a species tree and species map. We use the species trees estimated by Butler2009 (fungi) and Tamura2004 (flies).
Additionally, we provide the species map that specifies which genes belong to which species,
and the species name abbreviations used in *.stree and *.smap.
Simulated species trees and species map: sim-stree.tar.gz
Dataset for simulated species trees: sim-stree.tar.gz
For the configuration files, the data is stored in the following structure: sim/TREESIZE-SPECRATE/STREE.stree
For the gene families, the data is stored in the following structure: sim-stree/TREESIZE-SPECRATE/STREE/FAMID/FAMID.EXT
- TREESIZE: number of extant species (5, 10, 20, 50, 100)
- SPECRATE: speciation rate in events/myr (0.05, 0.1, 0.2, 0.5, 1)
- STREE: species tree number (0-9)
- FAMID: the gene family ID (0-99)
- EXT: filename extension, which can be one of the following
- align: DNA alignment in FASTA format
- fasta: DNA sequence data in FASTA format
- info: extra information about the simulation process
- times.tree: simulated gene tree with branch lengths in units of time
- tree: simulated gene tree with branch lengths in substitutions per site
Corresponding relation files are stored in the following structure: sim-stree/TREESIZE-SPECRATE/STREE.rel.txt
- (Butler2009) Butler, G.; Rasmussen, M. D.; Lin, M. F.; Santos, M. A. S.; Sakthikumar, S.; Munro, C. A.; Rheinbay, E.; Grabherr, M.; Forche, A.; Reedy, J. L.; Agrafioti, I.; Arnaud, M. B.; Bates, S.; Brown, A. J. P.; Brunke, S.; Costanzo, M. C.; Fitzpatrick, D. A.; de Groot, P. W. J.; Harris, D.; Hoyer, L. L.; Hube, B.; Klis, F. M.; Kodira, C.; Lennard, N.; Logue, M. E.; Martin, R.; Neiman, A. M.; Nikolaou, E.; Quail, M. A.; Quinn, J.; Santos, M. C.; Schmitzberger, F. F.; Sherlock, G.; Shah, P.; Silverstein, K. A. T.; Skrzypek, M. S.; Soll, D.; Staggs, R.; Stansfield, I.; Stumpf, M. P. H.; Sudbery, P. E.; Srikantha, T.; Zeng, Q.; Berman, J.; Berriman, M.; Heitman, J.; Gow, N. A. R.; Lorenz, M. C.; Birren, B. W.; Kellis, M. & Cuomo, C. A. Evolution of pathogenicity and sexual reproduction in eight Candida genomes. Nature, 2009, 459, 657-662.
- (Tamura2004) Tamura K, Subramanian S, Kumar S (2004) Temporal patterns of fruit fly (Drosophila) evolution revealed by mutation clocks. Mol Biol Evol 21: 36-44.
Last updated 06/19/14.