eMPRess

What is eMPRess

eMPRess is a software tool for reconciling pairs of phylogenetic trees such as host-parasite, host-symbiont, and species-gene trees under the Duplication-Transfer-Loss (DTL) model.  The eMPRess tool was developed at Harvey Mudd College and is the successor for our Jane reconciliation tool.  eMPRess has many features that are based on new and efficient algorithms.  Read more about those features below.

eMPRess takes two undated binary phylogenetic trees (e.g., host and parasite, host and symbiont, species and gene) and an association of their tips as input.  eMPRess addresses several important issues that are generally not supported in other existing tools.  Among them are:


eMPRess video - tutorial and use case

This 20-minute video tutorial provides a brief primer on the reconciliation problem and demonstrates the eMPRess workflow and functionality.

A touch of theory (recommended before getting started)

Maximum Parsimony Reconciliation

eMPRess, Jane, and most other reconciliation tools use a maximum parsimony approach for finding a "best" mapping of the parasite/symbiont tree) onto the host tree.  In this formulation, each type of event:  duplication, transfer, and loss, have a non-negative cost specified by the user.  The objective is to find reconciliation that minimizes the total cost of the constituent events weighted by their event costs.  Cospeciation is considered a "null event" and therefore has cost preset to zero.

Event Costs

Event costs are notoriously difficult to estimate.  Many tools have default event costs (e.g., Jane's defaults are 1, 2, and 1 for duplication, transfer, and loss, respectively) and studies are often performed using just the default values.  However, different event costs can lead to different solutions and thus different conclusions.  For example, if one event has a much lower cost than others, a maximum parsimony reconciliation is likely to favor solutions with more of those kinds of events.  

eMPRess's "View cost space" feature uses a technique called Pareto-optimal event counts to show you the impact of different event costs and to allow you to select event costs in a principled and systematic way.  Specifically, note that event costs are just relative amounts; there no intrinsic meaning to a  unit of cost, so choosing duplication, transfer, and loss of 2, 3, 1 respectively is the same as choosing costs of 200, 300, and 100; the ratios of the costs are the same in both cases.  The "View cost space" feature in eMPRess  fixes the cost of a loss at 1.0 and then examines the range of costs of duplication and transfer events relative to this cost of 1 for losses.  (Recall that cospeciation is a null event and thus has a fixed cost of zero.)

The plot that is displayed by "View cost space" divides up the duplication and transfer cost space into color-coded regions.  For any combination of costs in the same region, we will get the same set of MPRs.  In other words, in a given color-coded region, it suffices to choose just one point - that is one combination of costs.   For example, see this figure which is the event cost regions for the gopher-louse dataset.

Dated versus Undated Trees

eMPRess, Jane, and many other tools assume that the trees are undated.  That is, while branch lengths may be provided in the newick input files, they are not used in the reconciliation process.  Branch lengths - if given - are not assumed to correspond to actual dates when speciation events occurred in the host and parasite/symbiont trees.  

Time-Consistency

While a parent node in a tree clearly occurred before its children, the order of the two children is assumed not to be known.  In general, the order of nodes that are not ancestrally related to one another is not known.  Consider a reconciliation of a parasite/symbiont tree onto a host tree and consider any particular parasite/symbiont species node p.  That node p is mapped by the reconciliation to some host node h (or, perhaps, to the edge terminating at h).  Clearly, no descendant of p should be mapped by the reconciliation to an ancestor of h.  Any reconciliation that satisfies this condition is said to be weak time-consistent.

Why "weak"?  There is another constraint that we also wish to satisfy and this one has to do with transfer (aka host switch) events and the fact that the trees are undated.  When a transfer event occurs involving a parasite node p, one of its children, say p', is transferred to a branch in the host tree that is not ancestrally related (that is, not an ancestor nor a descendant) to the host on which p is mapped.  We say that p "takes off" from the host branch on which it resides and that p' "lands" on a branch somewhere else on the tree.  The place where p' lands is called the "landing site."

Because the tree is not dated, we don't know if the landing site is contemporaneous with the take-off site.  In theory, the take-off and landing sites should be contemporaneous, but there's no way to know for sure.  We say that a reconciliation is strong time-consistent if it is not only weak time-consistent but also if there exists some ordering of the internal nodes of the host tree that guarantee that for every transfer event, the take-off and landing sites are contemporaneous.

Ideally, we would like strongly time-consistent reconciliations.  Here is some good news and bad news:  Even finding weakly time-consistent maximum parsimony reconciliations is computationally intractable (NP-hard).  Jane uses a heuristic that only considers strongly time-consistent reconciliations, but doesn't guarantee that they are truly maximum parsimony reconciliations (i.e., their total events costs may be higher than optimal).  eMPRess, and most other tools, use much faster exact algorithms that do guarantee maximum parsimony but with the possibility that the resulting reconciliations are not time-consistent.  eMPRess, however, checks each solution that it finds and indicates whether it is strongly time-consistent (the best outcome), weakly time-consistent, or not even weakly time-consistent.

Dealing with many MPRs

The number of MPRs for a given dataset and a fixed set of event costs can be huge.  In some datasets that we have explored, there have been more than 10e50 (1 with 50 zeros after it) MPRs.  Nguyen et al. have proposed computing a median MPR in such cases.  The median is an MPR that is, roughly speaking, in the "middle" of the space of MPRs and is thus a plausibly good representative.  More precisely, the distance between two MPRs is the number of events in which they disagree and a median MPR is one that minimizes the total distance to all other MPRs.

In general, there's not just one median.  For example, consider the numbers 1, 2, 3, 4.  Both 2 and 3 are medians.  In higher-dimensional spaces (such as the space of all MPRs), there can be many medians - in fact a huge number of medians.  But, a median is still presumably more representative than a completely random MPR.  Thus, in "View reconciliations", if "One MPR" is selected, eMPRess chooses a random median.  Since there are many medians, in general, you won't necessarily see the same MPR each time you do this!

There's another useful feature in the event that the number of MPRs is large.  That option is to cluster the space of MPRs into groups based on similarity.  In the "View solution space" pull-down menu, choose "Clusters".  A window pops up to allow you to enter the number of clusters that you desire to construct (which can be any number between 1 - which means no clustering - and the total number of MPRs).  In our experience, 2 or 3 clusters is generally sufficient.  Then, eMPRess uses a clustering algorithm that clusters MPRs according to their distance from one another, using the distance measure described above.  eMPRess displays a histogram of the distances between all pairs of MPRs in the first row, the distances between all MPRs within each of the two clusters in the second row, and so forth up to the maximum number of clusters that you've specified.

Finally in "View Reconciliations", you can choose "One per cluster", which will display one randomly selected median reconciliation in each cluster.

This set of features provides a systematic way to find best representative sets of MPRs when the space of MPRs is too large to be adequately represented by a single MPR.

Download and Install eMPRess

Software license information

eMPRess Software

Copyright (C) 2020 Libeskind-Hadas Research Group, Harvey Mudd College

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or  any later version.  This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more detailshttps://www.gnu.org/licenses.

Register (Optional)

If you would like to be notified of updates or announcements regarding eMPRess, please complete this form.

One-click installation for the eMPRess GUI 

If you plan to exclusively use the graphical user interface version of eMPRess, you may be able to perform a quick-and-easy one-click install.  If this installer doesn't work on your platform, please use the Install empress from GitHub instructions below.

MacOS

The current one-click installer for eMPRess is no longer maintained; please use the "Install eMPRess from GitHub" instructions below.

Linux

Windows

Install eMPRess from GitHub

Please refer to Install Empress for Development wiki on GitHub.

Please see the documentation for details on running both the GUI and the CLI.

Sample data

This zip file contains four sample datasets, each comprising a host, parasite/symbiont, and mapping (mapping of the tips of the two trees).   

Fig-wasp dataset from from Weiblen GD and Bush GW, Speciation in fig pollinators and parasites. Molecular Ecology 2002, 11, 1573-1578.

Gopher-louse dataset from  Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of parasites and their hosts. Nature 1988, 332:258-259.

Seabird-louse dataset from Paterson AM, Wallis GP, Wallis LJ, Gray RD, Seabird louse coevolution: complex histories revealed by 12S rRNA sequences and reconciliation analyses. Systematic Biology 2000, 49, 383-399.

Finches and brood parasites from Sorenson MD, Balakrishnan CN, Payne RB, Clade-limited colonization in brood parasitic finches (Vidua spp.) Systematic Biology, 2004, 53, 140-153.

Documentation

Input Files

Three files are required as input:  The host tree, the parasite (symbiont) tree, and a tip mapping.  

The host and parasite trees must be in newick format in which all leaves and internal nodes are named and have unique names (no names repeated anywhere).  The files must have the extensions .nwk.  These trees can have branch length information, but eMPRess ignores it.  However, species names should have no whitespace in them.  The newick standard used here is that whitespace should be replaced with an underscore symbol.   For example  Diomedea epomophora should, instead, be Diomedea_epomophora.

The mapping is a text file that ends with the extension .mapping and specifies the association of the tips of the parasite tree to the tips of the host tree.  Each line in the file is of the form:

parasiteTipName : hostTipName

Note that this mapping must associate each parasite tip with at most one host tip.  It is fine for a parasite tip not to be mapped to any host tip, but a parasite tip cannot be mapped to more than one host tip.  Similarly, it is fine for a host tip not to be mapped from any parasite tip.  Finally, it's fine for multiple parasite tips to be mapped to the same host tip.

Running eMPRess through the Graphical User Interface

Documentation on running eMPRess through the GUI is available here.

Running eMPRess through the Command Line Interface

Documentation on running eMPRess through the CLI is available here.

Credits and citing eMPRess

Many people contributed to the development of eMPRess, both in the development of the algorithms and the implementation of the software tool.  

If you use eMPRess in your work, please cite

"eMPRess: A Systematic Cophylogeny Reconciliation Tool" by S. Sanitchaivekin, Q. Yang, J.  Liu, R. Mawhorter, J. Jiang, T. Wesley, Y-C. Wu, and Ran Libeskind-Hadas, in preparation.

The algorithms employed in eMPRess were published in these papers:

"Pareto-Optimal Phylogenetic Tree Reconciliation" by R. Libeskind-Hadas, Y-C Wu, M. Bansal, and M. Kellis, Bioinformatics, Volume 30, Issue 12, 15 June 2014, Pages i87–i95, https://doi.org/10.1093/bioinformatics/btu289

"An Efficient Exact Algorithm for Computing All Pairwise Distances Between Reconciliations in the Duplication-Transfer-Loss Model" by S. Santichaivekin, R. Mawhorter, and R. Libeskind-Hadas, BMC Bioinformatics, 2019 Dec 17;20(Suppl 20):636. doi: 10.1186/s12859-019-3203-9

"Hierarchical Clustering of Maximum Parsimony Reconciliations" by R. Mawhorter and R. Libeskind-Hadas, BMC Bioinformatics, 26 Nov 2019, 20(1):612 DOI: 10.1186/s12859-019-3223-5 

The eMPRess code base was developed by S. Santichaivekin, R. Mawhorter, J. Liu, Q. Yang, J.Jiang, T. Wesley, Y-C Wu, and R. Libeskind-Hadas with additional contributions by C. Ngo, P. Andrews, S. Sehra, Adrian Garcia, Alberto Garcia, D. Makhervaks, and Z. Witzel.


FAQ

Why doesn't my newick file load?

Make sure that the trees don't have polytomies. 

My files load, but eMPRess can't find reconciliations.  Why?

Make sure that all of the nodes have different names.

Feedback, known issues, reporting bugs, etc.

The current version of eMPRess (version 1.1) has some known limitations or bugs listed below.  If you find others, or would like to give us feedback or suggestions, please complete this feedback form.

Here are some issues that we're aware of in the current version of eMPRess:

The development of eMPRess was supported by grant 1905885 from the National Science Foundation to Harvey Mudd College.