CCAPD: Consolidated Curated Alignments for Phylogenetics Database
The CCAPD is a multiple sequence alignment database of 1215 protein
sequences (including variants) - in 28 homologous groups, retrieved from 217
species. 233 of the protein sequences have structures associated, and all 28
homologous groups have at least two sequences with structures associated
with them. The database concentrates on sequences and species of
phylogenetic utility, including for the use of phylogenetics in structural
biology. CCAPD consolidates manually-curated structurally-informed
alignments from 3D_ali, HOMSTRAD, and Pfam as well as locally-performed
structural alignments with manual review. Uncertain regions in the alignment
(due to unreliable structural information, disagreements between databases,
or other reasons) have been determined with manual review.
Interpretation of files
The primary alignment file format is the HTML display version produced using
the showalign program (with the locally-created
ESIMILARITY matrix) from EMBOSS. (This was then processed
further using locally-created Perl
programs.)
Showalign format interpretation
Interpretation of the primary showalign format files
- Reliable regions in the structural alignment are marked in blue. Regions that were not structurally alignable due
to gaps in the structures (intrinsically disordered regions) are marked in red. Areas of uncertain alignment (as determined by
structural alignment distances, disagreements between databases, and manual
review) are in black.
- Groups of lines seperated by the "=" symbol are
65%+ identical to a sequence with known structure; these are "clusters".
Areas not in blue are only aligned within
clusters.
- Sequence origin color coding:
- Archaeal sequences are marked with a purple A
- Sequences from fungi or metazoa (animals) are marked with a red *
- Plant sequences are marked with a green P
- Sequences from bacteria marked with a blue B
Further information on sequence names ("entries") can be found in the group files
(below).
Alternative showalign formats
Two alternative versions of the showalign-produced alignment files are also
available. (These are recommended for users of non-graphical browsers.)
- The first version uses the HTML tag "STRONG"
for structurally-aligned regions and the "EM" tag for
intrinsically disordered regions.
- The second version shows only the structurally-aligned regions
differently (unless the browser is capable of displaying
text as red).
Sequence origins are coded in these files using the same characters as
above, but without colors.
Plotcon plots of residue conservation
Plots of degree of residue conservation versus the position along the
alignment are also available (in postscript and PDF formats). The plots were
produced using the EMBOSS plotcon program with an all-positive version of the ESIMILARITY matrix (and a window size of 20
residues). Gaps - including at the ends - are considered the same as a
nonconservative amino acid subsititution.
Groups (of homologous proteins)
- Alcohol Dehydrogenase 1 (ADH1), Alpha Isozyme for primates
- Alcohol Dehydrogenase 1 (ADH1), Beta Isozyme for primates
- Alcohol Dehydrogenase 1 (ADH1), Gamma Isozyme for primates
- Catalase
- Cellulase A (glycosyl hydrolase 5; exo-1,3-beta-glucanase)
- Cellulase B (glycosyl hydrolase 6; 1,4-beta-cellobiohydrolase)
- Cellulase C (glycosyl hydrolase 7; 1,4-beta-cellobiosidase or endo-1,4-beta-glucanase)
- Cellulase F (glycosyl hydrolase 10; endo-1,4-beta-xylanase)
- Cellulase G (glycosyl hydrolase 11; endo-1,4-beta-xylanase)
- Cellulase H (glycosyl hydrolase 12; endo-beta-1,4-glucanase)
- Cinnamyl Alcohol Dehydrogenase
- Copper/Zinc-containing SuperOxide Dismutase (Copper/Zinc SOD)
- Glutathione-S-Transferase (GST) Class Pi
- Glutathione-S-Transferase (GS) Class Sigma or Glutathione-requiring prostaglandin D synthase
- Glutathione-S-Transferase (GST) Class Zeta
- Hemoglobin V/Alpha
- Myoglobin
- Orotidine-5'-phosphate Decarboxylase
- Poly(A) Polymerase
- Rad51/RadA/RecA
- Sorbitol Dehydrogenase
- TATA-Binding Protein (TBP or TF2D)
- Triosephosphate Isomerase (TPIS)
- Ubiquitin-conjugating enzymes, E2 family
- eIF2a (eukaryotic Initiation Factor 2a)
- eIF4e (eukaryotic Initiation Factor 4e)
- eIF6 (eukaryotic Initiation Factor 6)
- eTF2a (eukaryotic Termination Factor 2a)
Further information
A listing of species with sequences in CCAPD is
available. It includes information on species name variants and species merged
(due to gene flow evidence and/or a high likelihood of confusion among
sequence depositors) for some purposes. The choice of species names used is
purely for the sake of maximizing recognizability, and may not correspond
to current phylogenetic/taxonomic thinking.
Please see http://www.drallensmith.org/research/dissertation.final.pdf
for:
- More information on the alignment methodology
- Further details on the proteins selected, including citations
for information given in the group files (above)
- An example usage of a prior version of the database for
phylogenetic work
Under http://www.drallensmith.org/research/
are also examples of NEXUS-format data files produced using an earlier version
of the database (also available at that location) and some of the programs
used to produce CCAPD. Other programs can be found
here; all are available under an open-source (GNU
Affero GPL Version 3) license. Further data files can be found here.
Some further alignment formats are available under the group files, and more
formats are in progress, although unfortunately most extant alignment
formats have difficulty showing areas of uncertainty. The MSF-format
sequence files, as used by the plotcon and showalign
programs from EMBOSS and the local program consensus.weights.pl, contain
weights for the sequences. These weights are at the present time largely
arbitrary, but in the future MrBayes will be used to derive
distances from which better weights can be found. The weights do not affect the
alignments in any event; they are only of importance for the consensus
sequences and for which letters are capitalized in the display.
CCAPD was created by Dr. Allen
Smith with Dr. Peter Kahn (for
structural alignment reviews) and Dr. Theodore Chase,
Jr. (for alcohol dehydrogenases). Dr. Karl Kjer inspired
both our use of structural alignments for phylogenetics and the creation of
CCAPD for public usage.
This is viewable in
Any Browser
and is Valid
HTML 4.01.
This webpage, and all other files in or
under this directory (except for those explicitly copyrighted or licensed
otherwise), are licensed (copyright 2001-2008) by Allen Smith under a
Creative
Commons Attribution-ShareAlike 2.5 License; also available is a
text
version of the Legal Code of the license. (Of course,
factual material is not
copyrightable - fortunately!)