Reconciliation Feasibility in the Presence of Gene Duplication, Loss, and Coalescence with Multiple Individuals per Species
Address correspondence to: Yi-Chieh Wu (yjw AT cs DOT hmc DOT edu)
PLCT is a package for understanding gene tree evolution through
gene duplications, losses, and coalescence with multiple samples per species.
In our paper, we evaluated feasibility using
6798 real gene families across 7 ape genomes, as well as
simulated gene families across the 12 Drosophila clade.
Species trees: apes.stree,
Species maps: apes.smap,
Species abbreviations: flies.names.txt
PLCT does not require a species tree. However, we provide the species trees, the species maps
that specify which genes belong to which species, and the species name abbreviations for reference.
Real ape dataset
Filenames have the format FAMID.HAPLOTYPE.EXT.
Alignments are in FASTA format
and trees in Newick format.
- FAMID: the gene family ID from Ensembl
- HAPLOTYPE: the haplotype number (1,2)
- EXT: align for alignments, tree for gene trees
Simulated fly dataset
Each gene family is stored in its own directory sim-flies/POP-RATE/FAMID.
Each directory has the following files:
the effective population size (1e6-100e6)
the (duplication and loss) rate multiplier (1x,2x,4x),
where 1x is the rate observed in real data
- FAMID: the gene family ID (0-499)
- the true reconciliation in DLCoal format:
the gene tree in Newick format with named internal nodes
the reconciliation mapping between the gene tree (*.coal.tree)
and the locus tree (*.locus.tree)
the locus tree in Newick format
the reconciliation mapping between the locus tree (*.locus.tree)
and the species tree (.stree)
the set of daughter nodes
- FAMID.coal.align: the simulated alignment in FASTA format
- FAMID.coal.raxml.tree: the reconstructed gene tree in Newick format
- the reconciliations, alignments, and trees for multiple samples per species
Last updated 07/20/16.