Accurate Gene Tree Reconstruction Using TreeFix and TreeFix-DTL: A Tutorial
Taught by Mukul S. Bansal during the Workshop on New Methods for Phylogenomics and Metagenomics on February 17, 2013.
TreeFix and TreeFix-DTL
are programs for reconstructing very accurate gene trees. TreeFix is
designed for reconstructing eukaryotic gene trees (where horizontal
gene transfer is assumed to be negligible) and TreeFix-DTL for
prokaryotic gene trees. Both programs take as input a multiple sequence
alignment for the gene family, a maximum likelihood gene tree (which
can be constructed, for example, using RAxML or PhyML), and a known
rooted species tree topology for that gene family. The idea is to use
the species tree topology to guide the reconstruction of the gene tree
and to balance sequence and species tree information through a
statistical hypothesis testing framework. TreeFix assumes that
discordance between the gene tree and species tree topologies is due to
gene duplication and gene loss, while TreeFix-DTL assumes that the
discordance is due to gene duplication, horizontal gene transfer, and
gene loss. These programs are currently the best performing programs
for gene tree reconstruction, outperforming even the most sophisticated
species tree aware Bayesian methods. An additional advantage of TreeFix
and TreeFix-DTL is that they do not require species divergence times or
any other parameters such as rates of gene duplication or gene loss.
Moreover, they are scalable to gene trees with hundreds of leaves.
The goal of this
tutorial is to instruct participants on how to reconstruct highly
accurate gene trees using TreeFix and TreeFix-DTL.
By attending this tutorial, participants will be able to: (1)
Appreciate the importance of reconstructing gene trees accurately, (2)
understand why reconstructing gene trees accurately can be a
challenging problem, (3) understand the ideas and principles underlying
TreeFix and TreeFix-DTL, and (4) confidently use both programs on their
own datasets.
Software requirements
TreeFix and TreeFix-DTL can be easily installed on Windows, Mac OS, or Linux. The basic requirements are as follows:
Given these
requirements, installation is easiest on Linux. To install on Windows,
users must install and use the Cygwin environment
( Installation on a Mac is straightforward once
the basic requirements are met, though it can be a slight hassle to
install SWIG on Mac OS.
In addition, the
speed and accuracy of TreeFix and TreeFix-DTL may be slightly improved
if the following optional python packages are installed.
- Numpy (1.5.1 or greater):
If Numpy is not found, TreeFix and TreeFix-DTL use Python's built-in 'random' module.
- Scipy (0.7.1 or greater):
If Scipy is not found, TreeFix and TreeFix-DTL use internal libraries
to approximate the normal distribution (so p-values may be slightly
Installation instructions
You can download TreeFix from and TreeFix-DTL from
Both packages contain a file called INSTALL.txt with detailed
instructions on how to install the software. Participants may choose to
carefully read this file and proceed with the installation on their
own. Alternatively, participants can follow the simple step-by-step
installation instructions given below. Tutorial participants should
install at least one of TreeFix or TreeFix-DTL on their computers, and
are encouraged to install both. If installing both software packages,
we recommend that TreeFix be installed first.
Detailed step-by-step installation instructions now follow:
Step-by-step installation instructions for TreeFix:
- Create a new directory called TreeFix in your home directory.
- Download TreeFix from and copy it to the newly created directory.
- Extract TreeFix from the tarball and enter the extracted folder.
tar -xvzf treefix-1.1.7.tar.gz
cd treefix-1.1.7
- Run the installation scripts.
python build
python install
If both the above steps were successfully executed, then TreeFix is now
installed and ready to be used and you may proceed to the installation
instructions for TreeFix-DTL below.
- If users do not have system permissions to install in the default
location then the install step above will fail. If this happens then
the --prefix flag can be used to specify the directory where TreeFix
should be installed. Thus, if the build step above succeeded but the
install step failed, then please execute the following command:
python install --prefix=~/TreeFix/sw
Finally, if you used the --prefix option above, then to
ensure that the operating system can find the newly installed scripts
and executables, set the PATH and PYTHONPATH variables to the
installation directory as follows:
export PATH=$PATH:~/TreeFix/sw/bin
export PYTHONPATH=$PYTHONPATH:~/TreeFix/sw/lib/python2.6/site-packages/
We recommend adding the two lines above to the .bashrc,
.bash_profile, or another similar file. Otherwise, you will need to
execute the two lines above each time you start a new command line
session to use TreeFix. Also note that "python2.6" in the PYTHONPATH
may change depending on the Python
version installed.
Step-by-step installation instructions for TreeFix-DTL:
- Create a new directory called TreeFix-DTL in your home directory.
- Download TreeFix-DTL from and copy it to the newly created directory.
- Extract TreeFix-DTL from the tarball and enter the extracted folder.
tar -xvzf treefixDTL-1.0.1.tar.gz
cd treefixDTL-1.0.1
- Run the installation scripts.
python build
python install
NOTE: If you installed TreeFix using the --prefix option,
then TreeFix-DTL must be installed to the same directory where TreeFix
was installed. This can be done as follows:
python build
python install --prefix=~/TreeFix/sw
- If you did not install TreeFix and are unable to install
TreeFix-DTL in its default location, please follow the instructions
given in step 5 of the installation instructions for TreeFix (taking
care to replace "TreeFix" with "TreeFix-DTL" in the commands).
Please email Mukul Bansal if you are unable to successfully install TreeFix or TreeFix-DTL.
Datasets for testing
You may check if TreeFix and TreeFix-DTL installed correctly by invoking the TreeFix and TreeFix-DTL executables as follows:
The commands above
will prompt TreeFix and TreeFix-DTL to display their respective help
messages with details on how to use the programs.
Also, TreeFix and
TreeFix-DTL each include a small test dataset that you can use to learn
how to use these programs. These are available in the following
cd ~/TreeFix/treefix-1.1.7/examples/
cd ~/TreeFix-DTL/treefixDTL-1.0.1/examples/
Details on how to
use TreeFix and TreeFix-DTL are available in the file called in
those directories. Next, we provide step-by-step instructions on how to
execute TreeFix and TreeFix-DTL on the test datasets.
Analyzing the test dataset using TreeFix:
To analyze the test dataset using TreeFix, execute the following commands:
cd ~/TreeFix/treefix-1.1.7/examples/
treefix -s config/fungi.stree -S config/fungi.smap -A .nt.align -o .nt.raxml.tree -n .nt.raxml.treefix.tree -V 1 -l sim-fungi/0/0.nt.raxml.treefix.log sim-fungi/0/0.nt.raxml.tree
TreeFix should
require less than a minute to execute on the dataset above. The
reconstructed gene tree will be available in the folder
~/TreeFix/treefix-1.1.7/examples/sim-fungi/0/ as the file
Analyzing the test dataset using TreeFix-DTL:
To analyze the test
dataset using TreeFix-DTL, execute the following commands (but also see
the additional instructions below if you are using Cygwin):
cd ~/TreeFix-DTL/treefixDTL-1.0.1/examples/
treefixDTL -s config/S1.stree -S config/S.smap -A .pep.align -o .pep.raxml.boot.tree -n .pep.raxml.treefixDTL.tree -V 1 -e "-m PROTGAMMAJTT" -l sim/G1/G1.pep.raxml.treefixDTL.log sim/G1/G1.pep.raxml.boot.tree
TreeFix-DTL should
require about three hours to execute on the dataset above. The
reconstructed gene tree will be available in the folder
~/TreeFix-DTL/treefixDTL-1.0.1/examples/sim/G1/ as the file
If executing
TreeFix-DTL on Cygwin, an additional temporary working directory must
be created and TreeFix-DTL must be informed of the location of this
working directory using an additional command line parameter. Thus, if
using Cygwin, a revised set of commands for executing TreeFix-DTL on
the test dataset is as follows:
cd ~/TreeFix-DTL/treefixDTL-1.0.1/examples/
mkdir tmp
treefixDTL -s config/S1.stree -S config/S.smap -A .pep.align -o .pep.raxml.boot.tree -n .pep.raxml.treefixDTL.tree -V 1 -e "-m PROTGAMMAJTT" -l sim/G1/G1.pep.raxml.treefixDTL.log -E "--tmp ./tmp" sim/G1/G1.pep.raxml.boot.tree
Further details on the command line options used in the commands above are given below.
Explanation of Command Line Options
A complete list of
available command line options for TreeFix and for TreeFix-DTL can be
obtained by using the -h option (as shown above), and we recommend that
users read through these help messages to familiarize themselves with
the kinds of options available. Here, we describe in detail the most
important and fundamental command line options. Each of the options
described below is applicable to both TreeFix and TreeFix-DTL.
-s <species tree>, --stree=<species tree>
specifies the location of the species tree file (in newick format)
-S <species map>, --smap=<species map>
specifies the location of the file mapping gene names to species names
-A <alignment file extension>, --alignext=<alignment file extension>
alignment file extension (default: ".align")
-o <old tree file extension>, --oldext=<old tree file extension>
old tree file extension (default: ".tree")
-n <new tree file extension>, --newext=<new tree file extension>
file extension for the file where the reconstructed gene tree will be written (default: ".treefix.tree")
-l <log file>, --log=<log file>
log filename. Use '-' to display on stdout.
-V <verbosity level>, --verbose=<verbosity level>
verbosity level of the log file (0=quiet, 1=low, 2=medium, 3=high)
The default value is 0 (i.e., no log file will be created), but we recommend setting this value to 1.
-e <extra arguments to module>, --extra=<extra arguments to module>
extra arguments to pass to the program that computes likelihoods (the default implementations of TreeFix and TreeFix-DTL use RAxML)
The primary use of this option will be to pass along the RAxML likelihood model to be used by TreeFix or TreeFix-DTL. Further details appear below.
--niter=<# iterations>
number of search iterations to be performed (default: 100 for TreeFix and 1000 for TreeFix-DTL)
Further details on the proper use of this option appear below.
As illustrated above
in the TreeFix and TreeFix-DTL command blocks for executing the test
datasets, each command block begins with the name of the program
(treefix or treefixDTL), followed by required and optional command line
options, and ends with the specification of the file containing the
maximum likelihood gene tree (constructed previously using, for
example, RAxML or PhyML).
File Naming conventions:
While the species tree
file (specified using the -s option) and the name mapping file
(specified using the -S option) may have arbitrary names, both TreeFix
and TreeFix-DTL expect the alignment file and the maximum likelihood
gene tree file to be in the same directory and to share a common
prefix. This prefix is indirectly specified using the -o option. For
instance, if the name of the maximum likelihood tree file is and if the command block says "-o",
then the prefix is inferred to be G1. Additionally, suppose the -A and
-n options are invoked as follows: "-A .align -n .treefix.tree", then
the programs will assume that the alignment file is called G1.align and
that the final reconstructed gene tree should be written to the file
Increasing the number of search iterations:
As explained above, the
command line option --niter can be used to specify the number of search
iterations performed by TreeFix and TreeFix-DTL. The higher the number
of iterations, the more accurate the final reconstructed gene tree. For
TreeFix-DTL the default number of iterations is set to 1000. This
should work well for a range of gene tree sizes and nicely balances
accuracy and running time. For TreeFix, however, the default number of
iterations is only 100, which is appropriate for small gene trees (say
with no more than a couple dozen leaves), but may be too small to
effectively reconstruct larger gene trees. Thus, when reconstructing
larger gene trees using TreeFix, we recommend increasing the number of
iterations to 1000. In general, we strongly recommend that the --niter
option be used only to increase the number of search iterations
compared to the default value; reducing the number of iterations will
negatively impact the accuracy of these programs.
The -e option:
By default, the
likelihood module used by TreeFix and TreeFix-DTL assumes a GTRGAMMA
model of sequence evolution. To change this, add the following to the
treefix or treefixDTL command: -e '-m <model>'
Note that the
specified model must be supported by RAxML. The TreeFix-DTL command
block for executing the test datasets, shown above, illustrates the use
of this option.
Changing the parameters of the reconciliation model (if necessary):
Both TreeFix and
TreeFix-DTL allow users to change the parameters used for performing
the reconciliation step. In general, however, we recommend that users
make use of the default parameters since these have been tested to work
well for a variety of scenarios. If needed, these parameters can be
changed as follows.
TreeFix: By default,
the reconciliation cost module used by TreeFix assumes equal costs
(D=1, L=1) for inferred (duplication-loss) events. To change this, add
the following to the treefix command:
-E '-D <dup cost> -L <loss cost>'
TreeFix-DTL: By
default, the reconciliation cost module used by TreeFix-DTL uses costs
D=2, T=3, and L=1 for the reconciliation. To change this, add the
following to the treefixDTL command:
-E '-D <dup cost> -T <trans cost> -L <loss cost>'
Note that the costs must be non-negative. And be sure to watch the quotes.
Using TreeFix and TreeFix-DTL in Practice
Reconstructing highly accurate gene trees using TreeFix or TreeFix-DTL in practice entails the following simple steps:
- Obtain (or compute) a multiple sequence alignment for the gene family of interest.
- Obtain (or compute) a rooted species tree for the species in the gene family.
- Construct a maximum likelihood gene tree for the gene family
(using your favorite maximum likelihood phylogeny program, e.g., RAxML
or PhyML).
- Arbitrarily root the maximum likelihood gene tree. (TreeFix and TreeFix-DTL require as input a rooted maximum likelihood gene tree. The actual position of the root is unimportant.)
- Create a file that maps the gene tree leaf labels to species
tree leaf labels. Examples of the format of such a file are available
as part of the test datasets discussed above.
- Depending on whether the gene family is eukaryotic or
prokaryotic, execute either TreeFix or TreeFix-DTL on the rooted
maximum likelihood gene tree, using the appropriate command line
options as described above.
- Upon termination, TreeFix and TreeFix-DTL will write the
topology of the reconstructed gene tree to the specified file. Note
that this reconstructed gene tree will not have any branch lengths
specified; if needed, branch lengths can be easily computed for the
reconstructed gene tree using software such as RAxML.
The following paper
describes the computational and statistical framework used by TreeFix
and TreeFix-DTL, and demonstrates the performance and accuracy of
The following paper describes the Duplication-Transfer-Loss reconciliation model used by TreeFix-DTL.
The paper describing TreeFix-DTL in detail and evaluating its performance is currently under review.
- Reliably Reconstructing Highly Accurate Prokaryotic Gene Trees and its Impact on Deciphering Microbial Evolution
Mukul S. Bansal, Yi-Chieh Wu, Eric J. Alm, and Manolis Kellis.
Under review.
Last updated on January 21, 2012