Once data in this format has been pasted into the Phylogenator Work area, it can be processed in the following ways. Clicking on the 'Show Diffs' button replaces the most frequent character in each data column by a '.', and the second most frequent character in the column by a '*'. Other characters are left as is. This exposes the true phylogenetic information content of the array of base positions. Try this to see what is meant. This operation, like any other Phylogenator operation, can be undone by clicking on the 'Restore' button (try this). (Sorry, only single-level restore is available.)
Clicking on the 'Show Diffs' button also puts the consensus sequence of the data originally in the work area into the small Phylogenator input field seen immediately underneath the Work Area.
The columns with the most significant evidential weight are those which contain the most bases differing from the majority. These can be moved to the left of the array by clicking on the 'Sort-Cols-by-*. Try this by clicking first on 'Show Diffs', and then immediately on 'Sort-Cols-by-*. Permuting species to bring those with the greatest number of apparent mutations to the top will make the situation even clearer. This is done by clicking on 'Sort' (try this). These simple oerations on the Harris-Hey data makes the two chimpanzees stand out quite clearly; they can then be selected and deleted, following which the data array can be simplified by dropping all columns in which no differences remain (click on 'Diffs Only' for this), or more drastically by clicking on 'Doubles Only' (which removes all columns in which no bases differ, or in which only one species differs from the others.) Once all this has been done, the data which remains (23 columns, out of the original 4,220 in the case of the Harris-Hey data) contains the essential evidence for parsimony-based phylogeny inferences.
In the Harris-Hey case, a most parsimonious phlyogeny tree can easily be guessed in the final reduced data that we have described. To prepare this, you may want to space out the groups seen in the data by inserting blank lines in appropriate positions. Click here to see the result. From this data, it is not hard to guess the following tree:
._____ S1 | |_____ S10 | |_____ S11 | |_____ S12 | |_____ S13 | | ._____ S14 | | | |_____ S16 |_____| | |_____ S17 | | | |_____ S5 ._____| | |_____ S15 | | | |_____ S18 | | | |_____ S2 | | | |_____ S21 | | | |_____ S22 | | | |_____ S23 | | | |_____ S3 | | | |_____ S32 | | ____| |_____ S7 | | ._____ S19 | | | |_____ S27 | | | ._____|_____ S28 | | | | | |_____ S35 | | | | | |_____ S36 | | | | ._____ S20 | | ._____| | | | |_____ S24 | |_____| | | | ._____ S26 |_____| | | | |_____|_____ S34 | | | |_____ S9 | | ._____ S28 | | | ._____|_____ S30 | | | | | |_____ S8 | | |_____| ._____ S31 |_____| | |_____ S33 | |_____ S6
However, fixing on this tree forces us to classify various '*' in the processed data as accidental 'double mutations' to be ignored. Click here to see these marked with a '?'. By arbitrarily converting these seven scattered embarassments to dots, we (like other happily data-fudging students?) get an edited data set which breaks cleanly between subtrees of the tree displayed.
One can weave the following interpretation around this tree. Its uppermost half subtree is noticeably shallower than its lower half. This can be seen as the footprint of a population that spread out of its ancestral area relatively recently, and so has not had time do accumulate more than a scattering of isolated mutations (but not yet any more deeply stuctured subtreees.) In contrast the lower subree is more deeply stuctured, and so may represent a population that has remained in its ancestral area.
The 'Count' tool can be used to find about 37 mutations unique to the two Chimpanzees in the Harris-Hey data set as given above. If this is taken (from what archeological evidence there is on the time of divergence between man and the apes) to represent the mutations accumulated during about 4 million years, and if we note that most of the twigs in the upper half of the tree seen above are separated from the root of that half by only a single mutation, we can estimate the time since the top half subtree began to diverge as being roughly 4,000,000/37, and so about 111,000 years.
'Lace' and 'Unlace' are provided to allow manipulation of phylogeny datasets involving more bases that can fit on a single line. If a data set to be examined the involves more bases than can fit on single lines in the Phylogenator work area, Phylogenator expects it to be represented in 'interlaced' format. In this format, the data is reperesented by a succession of blocks, all of the same length, separated by empty lines. The j-th line of block n + 1 continues the sequence of bases for the n-th species. Clicking here will insert a block of interlaced data (mitochondrial Cytochrome Oxidase 2 data for a group of 9 Hominids) into the Phylogenator work area, and so will let you see exactly what data in interlaced format looks like. 'Unlace' takes data in laced format and regroups it so that all the data for each single species falls together isn a single contiguous blocks, sucessive blocks being separated by empty lines. 'Laced' data is the format in which prealigned sequences can most readily be obtained from the NCBI 'Pops' database, which is the official master site for posting aligned and unaligned nucletide data collections covering entire population groups (try their 'endangered tortoises' collection). If instead ('naturally prealigned') data for corresponding genes of separate species is obtained from the NCBI Nucleotide database or from a protein database, it may be easier ot start with 'Unlaced' data and to Lace it before starting to examine it using Phylogenator. Try using 'Lace' and 'Unlace' several times in sucession to see how they work. Note that 'Lace' automatically prefixes each line of each data block other than the first with the number of its starting bases, followed by a ';'; such prefixes are always harmless, since Phylogenator understand that the data in a line is always the string of characters folloing the last semicolon in the line.
'Hide' and 'Show' make it easier to work with data in interlaced format. 'Hide' conceals all blocks of interlaced data past the first, but retails the logical connection between the lines of the first block and the associated lines in subsequent blocks. (The titles prefixed to each of the lines in the first block are used to maintain this association, hence none of these titles should be changed between a 'Hide' operation and the next following 'Show' operation, which restores the hidden data. Once 'Hide' has been used to reduce interlaced data to a single visible block, the lines which remain can be permuted using ordinary cut-and-paste operations, and some of some of them can even be deleted or duplicated; 'Show' will then restore the data in the appropriate permuted or otherwise modified arrangement. Try using 'Hide' and 'Show' to see how they work.
Phylogenator provides a few additional utility tools. 'Transpose' interchanges the rows and columns of the data matrix. This eases manual column-edit operations. 'New Window' opens a new copy of Phylogenator and copies the existing data into it. This is a useful way of allowing backtracking if multiple manual operations (e.g. deletion of suspected 'outlier' species) are to be tried. 'Trim' eliminates the prefixed names from a data set, but puts them (in their semicolon-separated form) into the small Phylogenator input field seen immediately underneath the Work Area.) 'Untrim' reverses this operation, and also provides an easy way of prefixing data lines by species names (just paste them, as a semicolon-separated list, into the small Phylogenator text field, and click 'Untrim'.
The following sequence of operations can often be used to ease visual examination of the evidence for a conjectured phylogeny tree. (i) First use "Show Diffs" to highlight variant base positions only; (ii) then use "Doubles Only" to prune the data set; (iii) use "Hide" and then "Show" to put the species being examined into their leaf order in the conjectured phylogeny tree. After these preliminary steps, (a) use "Sort-Cols-by-*" as a first step toward to bringing together the base positions of greatest evidential weight; (b) use "Trim" to remove the species names temporarily; (c) "Transpose", and then Immediately "Sort", and then "Transpose" again, to bring significant base differences to the left, in the priority order of the leaf arrangement introduced by step (iii); (d) re-introduce the temporarily removed species names by using "Untrim". The "Sort Cols" operation is provided as a one-click convenience for executing steps (a-d) together.
If no tree suggests itself, the 'Sort-by-Diffs' operation may suggest one. This operation tries to put the species present into an order which brings related species close together. This is done by starting with whatever species comes first, and then sucessively moving the species whose associated data sequence lies closest to a species already placed into the next available positon. This operation also writes the name of the species mos closely related to another, which might make a good line 1 for a subsequent rerun of 'Sort-by-Diffs', into the small Phylogenator text area. (Since the code for this operation makes special use of the unusual character 'ß', you should avoid all use of this tricky-to-type character.)
You may want to use these operations to examine the mitochondrial Cytochrome Oxidase 2 evidence for the following popular hominid phylogeny: ((human,(chimpanzee,pigmychimp)),gorilla,(baboon,gibbon)). It will be clear that the evidence is somewhat muddy. You can click here to get and analyze the somewhat longer mitochondrial Cytochrome Oxidase 1 dataset for the same hominids, simplified by elimination of all no-difference and 1-difference base positions. This is provided for your phylogenizing pleasure. More aligned sequences to experiment with are available here. Prof. Joe Felsenstein of the University of Washington provides the very nice 'Phylip' collection of phylogeny tools, and the University of University of Arizona maintains a very useful general catalog of such tools.
Phylogenator works as well for amino acid sequences (proteins), and for other DNA-based sequence footprints as it does for sequences of bases. The aligned sequences collection mentioned above contains a sample protein that you can use to verify this.