When sequencing genomes such as human it is helpful to know the location of copies of various repeat family sequences such as Alu Jurka,J., Walichiewicz,J. and Milosavljevic,A. (1992) Prototypic sequences for human repetitive DNA J. Mol. Evol. 35, 286-291. The program repe is being used successfully for locating Alu segments in readings derived from human DNA, but could also be used for other repeat families. See section Using Repe for Other Sequence Families. It compares a batch of readings named in a file of file names against a library of Alu sequences. It reports the extent of Alu in each reading, including those with Alu at both ends. As input the program takes two files of file names: one containing the names of all the readings to screen, and the other, the names of the files in the Alu library. Its output is two new files of file names: one containing the names of all the readings judged to contain Alu and one the names of those that do not. It also modifies all the files found to contain Alu segments to writing TG records (see section Experiment file records) to define their extent. These readings can be treated in special ways during assembly and the TG records get converted to ALUS tags (see section Tag types) when the readings are entered into the assembly database.
The algorithm used is similar to that of vector_clip but the only parameter required is "Cutoff score". See section Screening Against Vector Sequences. This value is the proportion of the best group of diagonals that is covered by hits composed of four character words. Currently, as for vector_clip, the best diagonal is found and then combined with the scores for the three either side.
The program also writes a "log" file which contains the names of every reading processed, the top score it obtained and the Alu file that gave that value. This file is useful for establishing an appropriate cutoff score and monitoring the program when it is used during production. The default cutoff score of 0.6 has been set for the standard Alu files currently in use.
The final input required by the program is the tag type to be created for each match found. Tag types have four character names and the default type is ALUS. Any four character string can be entered, and typing only carriage return will cause the string ALUS to be used. To enable GAP4 to use new tag types users should enter them into their personal tag databases: (see section Configure the tag database).
The following shows typical dialogue.
repe v2.0: repeat examination program. April 96 Author: Rodger Staden Copyright: Medical Research Council, UK ? Input file of gel reading file names=files ? Input file of repeat file names=ALUNAMES ? Output file of passed file names=alu.pass ? Output file of failed file names=alu.fail ? Log file name=files.log ? Cutoff score (0.00-1.00) (0.60) = Default Four letter tag type=ALUS ? Four letter tag type=
For both output files of file names the program writes out the reading name, top score, the top score for any part of the reading that was not covered by the top match, the number of bases at one end of the reading that are not found to match Alu. An example is shown below.
h4a02.s1 0.97 0.89 0 h4a03.s1 0.89 0.00 0 h4a01.s1 0.75 0.00 129
The first line shows a read (h4a02.s1) with top score 0.97 and score for the other end of the reading of 0.89 and consequently 0 sequence at either end that does not contain Alu. The next line is for a reading that scores 0.89 and the match is over the whole reading. The next line is for a reading that has a score of 0.75 at one end and 0.0 at the other with 129 bases free of Alu.
One assembly strategy would be to assemble the readings in the "Output file of passed file names" first. Next sort the "Output file of failed file names" (e.g. sort -n -r +3 -o alu.fail.sorted alu.fail) on the amount of non-Alu sequence at one end. Next use the sorted file of file names (here alu.fail.sorted) as a file of file names for assembly. Sorted in this way, the readings with the most non-Alu will be assembled first, and those with least last. Also seesection Masked Assembly Mode.
The following error messages can be generated.
1 Error opening experiment file (the file could not be opened) 2 Error getting gel reading (no sequence found in experiment file) No repeat files found (no repeat files found, probably an error in the file of repeat file names) Insufficient data in repeat files (repeat files too short) Empty line in file of file names Error reading file of file names