Discussion:
Question about the RepeatMasker track
(too old to reply)
Marco Santagostino
2011-06-13 16:32:23 UTC
Permalink
Dear Sirs,

were can I find the parameters used to generate the RepeatMasker track?
The problem is as it follows: I need to take from the horse genome a
certain repetitive element, and I'm supposed to classify all the hits
found according to their identity (with respect to the consensus
sequence). Some collegues of mine already took all the sequences with at
least 98% of identity by BLAST search, so, now I'm supposed to find
those which have a lower identity, but I can't find out how to set up
the Table Browser so that it finds the elements with the identity that I
chose. How do I set up the table browser so, for exemple, it recovers
only the elements with at least 90% of identity with the consensus sequence?

Thank you,

Marco Santagostino
--
Marco Santagostino, PhD
Lab. Molecular and Cellular Biology
Dept. Genetics and Microbiology, University of Pavia
Ferrata street, 1 - 27100 Pavia, Italy
Tel.: +39 0382 985540
Fax: +39 0382 528496
e-mail: ***@unipv.it
Greg Roe
2011-06-13 22:25:15 UTC
Permalink
Hi Marco,

There is some information on the track info page for the RepeatMasker
track. Click the track title. There is also some info in the downloads'
README file: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
(see esp. chromOut.tar.gz)

To set up the Table browser so it recovers only the elements with at
least 90% of identity with the consensus sequence....

(For some background, a definition of RepeatMasker output columns can be
found here: http://repeatmasker.org/webrepeatmaskerhelp.html )

The 2nd, 3rd and 4th columns of the .out files are useful:

15.6 = % substitutions in matching region compared to the consensus
6.2 = % of bases opposite a gap in the query sequence (deleted bp)
0.0 = % of bases opposite a gap in the repeat consensus (inserted bp)

In our database table, those are multiplied by 10 in order to get
integer parts-per-thousand, and called milliDiv (substitutions),
milliDel and milliIns.

The simplest % identity measurement is milliDiv only -- if you wish, you
can factor in milliDel and milliIns too.

So, to get % identity >= 90% in the Table Browser, create a filter with
milliDiv >= 900 (since it is parts per thousand).

Please let us know if you have any additional questions: ***@soe.ucsc.edu

-
Greg Roe
UCSC Genome Bioinformatics Group
Post by Marco Santagostino
Dear Sirs,
were can I find the parameters used to generate the RepeatMasker track?
The problem is as it follows: I need to take from the horse genome a
certain repetitive element, and I'm supposed to classify all the hits
found according to their identity (with respect to the consensus
sequence). Some collegues of mine already took all the sequences with at
least 98% of identity by BLAST search, so, now I'm supposed to find
those which have a lower identity, but I can't find out how to set up
the Table Browser so that it finds the elements with the identity that I
chose. How do I set up the table browser so, for exemple, it recovers
only the elements with at least 90% of identity with the consensus sequence?
Thank you,
Marco Santagostino
Marco Santagostino
2011-07-05 17:09:58 UTC
Permalink
Dear Sirs,

I worked a bit with the RepeatMasker Track, but I found that, oddly, the
consensus sequence of the transposable element (which we are
investigating) used to mask the genome (and in general used by
RepeatMasker) is different from that annotated in RepBase (and which we
used for some preliminary analysis). I can download the hit list
generated by BLAST using "our" consensus sequence, but I don't have the
coordinates in the ordered horse genome for each BLAST hit, I have just
the coordinates in the contig sequences; is there a way to submit this
hit list (in csv or txt format, or whatever) to Table Browser and
retrieve the coordinates in the horse genome for each hit?

Thanks,

Marco
Post by Greg Roe
Hi Marco,
There is some information on the track info page for the RepeatMasker
track. Click the track title. There is also some info in the
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (see esp.
chromOut.tar.gz)
To set up the Table browser so it recovers only the elements with at
least 90% of identity with the consensus sequence....
(For some background, a definition of RepeatMasker output columns can
be found here: http://repeatmasker.org/webrepeatmaskerhelp.html )
15.6 = % substitutions in matching region compared to the consensus
6.2 = % of bases opposite a gap in the query sequence (deleted bp)
0.0 = % of bases opposite a gap in the repeat consensus
(inserted bp)
In our database table, those are multiplied by 10 in order to get
integer parts-per-thousand, and called milliDiv (substitutions),
milliDel and milliIns.
The simplest % identity measurement is milliDiv only -- if you wish,
you can factor in milliDel and milliIns too.
So, to get % identity >= 90% in the Table Browser, create a filter
with milliDiv >= 900 (since it is parts per thousand).
-
Greg Roe
UCSC Genome Bioinformatics Group
Post by Marco Santagostino
Dear Sirs,
were can I find the parameters used to generate the RepeatMasker track?
The problem is as it follows: I need to take from the horse genome a
certain repetitive element, and I'm supposed to classify all the hits
found according to their identity (with respect to the consensus
sequence). Some collegues of mine already took all the sequences with at
least 98% of identity by BLAST search, so, now I'm supposed to find
those which have a lower identity, but I can't find out how to set up
the Table Browser so that it finds the elements with the identity that I
chose. How do I set up the table browser so, for exemple, it recovers
only the elements with at least 90% of identity with the consensus sequence?
Thank you,
Marco Santagostino
--
Marco Santagostino, PhD
Lab. Molecular and Cellular Biology
Dept. Genetics and Microbiology, University of Pavia
Ferrata street, 1 - 27100 Pavia, Italy
Tel.: +39 0382 985540
Fax: +39 0382 528496
e-mail: ***@unipv.it
Mary Goldman
2011-07-06 23:31:27 UTC
Permalink
Hi Marco,

We are unsure if your contig names are from the Broad Institute (the
organization who performed the sequencing) or NCBI. Can you please check
the Assembly track and see if your contig names match the ones in this
track (here is a link for equCab2:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=equCab2&g=gold)? If they do,
you can download the data in this track to convert your coordinates. If
not, please send us an example of your contig names and we can see if we
have a conversion file.

Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group
Post by Marco Santagostino
Dear Sirs,
I worked a bit with the RepeatMasker Track, but I found that, oddly, the
consensus sequence of the transposable element (which we are
investigating) used to mask the genome (and in general used by
RepeatMasker) is different from that annotated in RepBase (and which we
used for some preliminary analysis). I can download the hit list
generated by BLAST using "our" consensus sequence, but I don't have the
coordinates in the ordered horse genome for each BLAST hit, I have just
the coordinates in the contig sequences; is there a way to submit this
hit list (in csv or txt format, or whatever) to Table Browser and
retrieve the coordinates in the horse genome for each hit?
Thanks,
Marco
Post by Greg Roe
Hi Marco,
There is some information on the track info page for the RepeatMasker
track. Click the track title. There is also some info in the
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (see esp.
chromOut.tar.gz)
To set up the Table browser so it recovers only the elements with at
least 90% of identity with the consensus sequence....
(For some background, a definition of RepeatMasker output columns can
be found here: http://repeatmasker.org/webrepeatmaskerhelp.html )
15.6 = % substitutions in matching region compared to the consensus
6.2 = % of bases opposite a gap in the query sequence (deleted bp)
0.0 = % of bases opposite a gap in the repeat consensus (inserted bp)
In our database table, those are multiplied by 10 in order to get
integer parts-per-thousand, and called milliDiv (substitutions),
milliDel and milliIns.
The simplest % identity measurement is milliDiv only -- if you wish,
you can factor in milliDel and milliIns too.
So, to get % identity>= 90% in the Table Browser, create a filter
with milliDiv>= 900 (since it is parts per thousand).
-
Greg Roe
UCSC Genome Bioinformatics Group
Post by Marco Santagostino
Dear Sirs,
were can I find the parameters used to generate the RepeatMasker track?
The problem is as it follows: I need to take from the horse genome a
certain repetitive element, and I'm supposed to classify all the hits
found according to their identity (with respect to the consensus
sequence). Some collegues of mine already took all the sequences with at
least 98% of identity by BLAST search, so, now I'm supposed to find
those which have a lower identity, but I can't find out how to set up
the Table Browser so that it finds the elements with the identity that I
chose. How do I set up the table browser so, for exemple, it recovers
only the elements with at least 90% of identity with the consensus sequence?
Thank you,
Marco Santagostino
Mary Goldman
2011-07-13 21:26:24 UTC
Permalink
Hi Marco,

It looks like your contig names are from NCBI and, unfortunately, we do
not have a way to convert those names into our assembled chromosome name
space. If you search on NCBI for your contig name, it will give you the
scaffold name, which you can then use with our scaffold track/table to
convert into assembled chromosome name space.

Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group
Hello,
I think that contig names are different, here is an exemple of the
gi|194214692|ref|NW_001867430.1|Eca8_WGA28_2 Equus caballus chromosome
8 genomic contig, reference assembly (based on EquCab2)
subject ids, % identity, alignment length, mismatches, gap opens, q.
start, q. end, s. start, s. end, evalue, bit score
gi|194214692|ref|NW_001867430.1|Eca8_WGA28_2 98.25 228 4
0 1 228 835135 835362 2e-109 399
I suppose the contig name is NW_001867430.1 , which is also the
accession number, but seems it doesn't match those found in
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN
---------------------------------------------------------------------------------------------------
browser <http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr8:30966059-30968286&db=equCab2&ss=../trash/hgSs/hgSs_genome_7264_570d30.pslx+../trash/hgSs/hgSs_genome_7264_570d30.fa&hgsid=201920499> details <http://genome.ucsc.edu/cgi-bin/hgc?o=30966058&g=htcUserAli&i=../trash/hgSs/hgSs_genome_7264_570d30.pslx+..%2Ftrash%2FhgSs%2FhgSs_genome_7264_570d30.fa+YourSeq&c=chr8&l=30966058&r=30968286&db=equCab2&hgsid=201920499> YourSeq 2228 1 2228 2228 100.0% 8 + 30966059 30968286 2228
which should be placed in "contig_21911" according to GenomeBrowser (I
do not know the exact position).
All the best,
Marco
Post by Greg Roe
Hi Marco,
We are unsure if your contig names are from the Broad Institute (the
organization who performed the sequencing) or NCBI. Can you please
check the Assembly track and see if your contig names match the ones
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=equCab2&g=gold)? If they
do, you can download the data in this track to convert your
coordinates. If not, please send us an example of your contig names
and we can see if we have a conversion file.
Best,
Mary
------------------
Mary Goldman
UCSC Bioinformatics Group
Post by Marco Santagostino
Dear Sirs,
I worked a bit with the RepeatMasker Track, but I found that, oddly, the
consensus sequence of the transposable element (which we are
investigating) used to mask the genome (and in general used by
RepeatMasker) is different from that annotated in RepBase (and which we
used for some preliminary analysis). I can download the hit list
generated by BLAST using "our" consensus sequence, but I don't have the
coordinates in the ordered horse genome for each BLAST hit, I have just
the coordinates in the contig sequences; is there a way to submit this
hit list (in csv or txt format, or whatever) to Table Browser and
retrieve the coordinates in the horse genome for each hit?
Thanks,
Marco
Post by Greg Roe
Hi Marco,
There is some information on the track info page for the RepeatMasker
track. Click the track title. There is also some info in the
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (see esp.
chromOut.tar.gz)
To set up the Table browser so it recovers only the elements with at
least 90% of identity with the consensus sequence....
(For some background, a definition of RepeatMasker output columns can
be found here: http://repeatmasker.org/webrepeatmaskerhelp.html )
15.6 = % substitutions in matching region compared to the consensus
6.2 = % of bases opposite a gap in the query sequence (deleted bp)
0.0 = % of bases opposite a gap in the repeat consensus (inserted bp)
In our database table, those are multiplied by 10 in order to get
integer parts-per-thousand, and called milliDiv (substitutions),
milliDel and milliIns.
The simplest % identity measurement is milliDiv only -- if you wish,
you can factor in milliDel and milliIns too.
So, to get % identity>= 90% in the Table Browser, create a filter
with milliDiv>= 900 (since it is parts per thousand).
-
Greg Roe
UCSC Genome Bioinformatics Group
Post by Marco Santagostino
Dear Sirs,
were can I find the parameters used to generate the RepeatMasker track?
The problem is as it follows: I need to take from the horse genome a
certain repetitive element, and I'm supposed to classify all the hits
found according to their identity (with respect to the consensus
sequence). Some collegues of mine already took all the sequences with at
least 98% of identity by BLAST search, so, now I'm supposed to find
those which have a lower identity, but I can't find out how to set up
the Table Browser so that it finds the elements with the identity that I
chose. How do I set up the table browser so, for exemple, it recovers
only the elements with at least 90% of identity with the consensus sequence?
Thank you,
Marco Santagostino
--
Marco Santagostino, PhD
Lab. Molecular and Cellular Biology
Dept. Genetics and Microbiology, University of Pavia
Ferrata street, 1 - 27100 Pavia, Italy
Tel.: +39 0382 985540
Fax: +39 0382 528496
Loading...