STAR-MP: Species Tree informed Architecture Reconstruction

STAR-MP

Species Tree informed Architecture Reconstruction - Maximum Parsimony

Paper

Evolution at the Subgene Level: Domain Rearrangements in the Drosophila Phylogeny
Yi-Chieh Wu, Matthew D. Rasmussen, and Manolis Kellis.
Molecular Biology and Evolution. 2012. doi: 10.1093/molbev/msr222

Address correspondence to: Yi-Chieh Wu (yjw at mit.edu) and Manolis Kellis (manoli at mit.edu)

Download

STAR-MP is a phylogenetic method for reconstructing architecture evolution based on a known species tree, extant architectures, and (reconstructed) module (domain) phylogenies.

2011.06.15

STAR-MP 1.0: starmp-1.0.tar.gz (source code)

Requirements

Python (2.5.4 or greater): http://python.org
Numpy (1.5.1 or greater): http://numpy.scipy.org
Networkx (0.99 or greater): http://networkx.lanl.gov/

Supplemental data

Analysis of 9 Drosophila genomes

In our paper, we considered domain architecture rearrangements in 9 fully sequenced Drosophila species (FlyBase May 2009 Release). We used BLAST and OrthoMCL to determine modules and modules families, and connected components to determine architecture families.

Species tree: flies9.stree
Species map: flies9.smap
Species abbreviations: flies9.names.txt

STAR-MP requires a species tree and species map. We use the species tree estimated by Tamura2004. Additionally, we provide the species map that specifies which genes belong to which species, and the species name abbreviations used in *.stree and *.smap. See SPIDIR and SPIMAP for more detail on these files.
Gene names: flies9.ids.txt

Our files use the FlyBase peptide (e.g. dmel_FBpp0079164) as unique gene ids. Users who with to use alternative identifiers can use this tab-delimited file to map the peptide id to a (1) CG protein id (CG7562-PA), (2) common protein name (e.g. Trf-PA), (3) FlyBase gene id (dmel_FBgn0010287), (4) CG gene id (CG7562), (5) short gene name (Trf), or (6) long gene name (TBP-related factor).
Modules and module families: regs.tar.gz
Each line provides the gene, the start and end position (1-indexed) of the module, and the module family.
Architecture families: fams.txt
"Merge/split" architecture families: fams.ms.txt, fams.ms.tar.gz
Conservative "merge/split" architecture families: fams.ms.cons.txt

Each line in the text files lists the genes belonging to a single architecture family.

To focus on gene fusions and fissions, the architecture families were filtered to a set of "merge/split" families, in which one species has a gene with two connected modules and another species has a gene with at least one of these modules unconnected. STAR-MP was used to reconstruct the evolutionary histories of these families. These families are indexed by their line number in "fams.ms.txt", and for each family, we have provided the architecture family (*.fam), the (100 bootstrapped) gene trees as reconstructed by SPIMAP (*.nt.uniq.trees), the architecture scenario as reconstructed by STAR-MP (*.mp), and a figure of this reconstructed architecture scenario (*.mp.svg).

Finally, to limit the effect of genome annotation errors, we also considered a conservative set of "merge/split" families, in which no genes within the family are adjacent, no genes are at the ends of scaffolds, and no genes have transitive BLAST hits through alternatively spliced forms.

Mechanisms for domain architecture rearrangement

In addition, we considered three possible mechanisms for module rearrangement and catalogued ~9000 Drosophila genes involved in such mechanisms.

Fusion/fission of adjacent genes: nbrs.txt
Two adjacent genes merge into a single gene, or a single gene splits into two genes.
File description: Column 1 lists the fused genes, column 2 the fragmented genes, and the related file nbrs_genes.txt lists whether the genes are supported by experimental evidence (col 1: gene, col 2: EST support, col 3: mRNA-seq support), where 'T' indicates consistency, 'F' indicates inconsistency, and 'U' indicates no evidence.
Large-loop mismatch repair or replication slippage: dupmerge.txt
Large-loop mismatch repair or replication slippage results in a merged gene located between the ancestral split (but not necessarily adjacent) genes.
File description: Column 1 lists the (fused) child gene, and column 2 lists the parental genes.
Retrotransposition and exon shuffling: retro.txt
A retrotransposed copy of a gene combines with exons from another gene.
File description: Column 1 lists the fused or fragmented gene that contains the retrotransposed element, and column 2 lists the parental genes.
Duplication-degeneration: dupdeg.txt
A chromosomal segment duplicates, and alternative portions of the duplicates are lost.
File description: Columns 1-2 list the (fragmented) child genes, column 3 indicates whether either of those genes is located on the end of a scaffold, and column 4 lists the parental genes.

References

(Tamura2004) Tamura K, Subramanian S, Kumar S (2004) Temporal patterns of fruit fly (Drosophila) evolution revealed by mutation clocks. Mol Biol Evol 21: 36-44.

Last updated 03/08/13.