Digest for genome@soe.ucsc.edu - 10 updates in 10 topics

g***@soe.ucsc.edu

2014-09-11 17:06:29 UTC

=============================================================================
Today's topic summary
=============================================================================

Group: ***@soe.ucsc.edu
Url:
https://groups.google.com/a/soe.ucsc.edu/forum/?utm_source=digest&utm_medium=email/#!forum/genome/topics

- GENCODE gtf [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/cc7fa5fe83720933
- Table browser looking up sequences by positions [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/cb3113b186616b0
- Genome browser and exon data conventions [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/a20686ead53bb3aa
- should be simple in table browser [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/321f65b11d9c4794
- BAM file [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/b0cc1c6f8abbc1d9
- Genome Build #37 [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/28757a88d0c65bbe
- hCD44v6 sequence [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/21cbf6f6e8556d77
- Please register PhyloCSF track hub [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/d0f82a1d27a088fb
- hg38 with decoy sequences [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/399dba1ed3e134d2
- cwc22 gene in the rat [1 Update]
http://groups.google.com/a/soe.ucsc.edu/group/genome/t/1d489434eef58a79

=============================================================================
Topic: GENCODE gtf
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/cc7fa5fe83720933
=============================================================================

---------- 1 of 1 ----------
From: "Steve Heitner" <***@soe.ucsc.edu>
Date: Sep 11 10:05AM -0700
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/ef75841c038a8182

Hello, Ephraim.

There is no specific order in a GTF file, so it should not be a problem to
cat both files into a single file. Regarding the gene symbols being a part
of the GTF output, this is a limitation of the way the Table Browser creates
GTF output. If you would like the gene symbols to be a part of your GTF
files, it will require some scripting on your part. We cannot provide
advice on creating a script, but if you would like to use the Table Browser
to provide output that will equate transcript ID to gene symbol and RefSeq
ID for use in your script, you can follow these instructions:

For GENCODE:

1. As your output format, select "selected fields from primary and related
tables"

2. Click the "get output" button

3. In the "Select Fields from mm10.wgEncodeGencodeCompVM2" section, check
the "name" and "name2" checkboxes

4. Click the "get output" button

For UCSC Genes:

1. As your output format, select "selected fields from primary and related
tables"

2. Click the "get output" button

3. In the "Select Fields from mm10.knownGene" section, check the "name"
checkbox

4. In the "mm10.kgXref fields" section, check the "geneSymbol" and "refseq"
checkboxes

5. Click the "get output" button

Please contact us again at ***@soe.ucsc.edu if you have any further
questions. Questions sent to that address will be archived in a
publicly-accessible forum for the benefit of other users. If your question
contains sensitive data, you may send it instead to genome-***@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

From: Trakhtenberg, Feliks
[mailto:***@childrens.harvard.edu]
Sent: Sunday, September 07, 2014 1:11 PM
To: Jonathan Casper
Cc: ***@soe.ucsc.edu
Subject: RE: [genome] GENCODE gtf

Hello,

Thank you for the advice. My goal is to predict novel genes/transcripts. I
would like to compile a comprehensive mouse GTF, so that it does not turn
out that the novel transcripts I find in my RNAseq have already been
predicted in some major database. So, I thought that merging Gencode and
UCSC Genes would provide such comprehensive set. Please let me know if this
is insufficient.

Using the intersection tool you recommended below, even with no overlap
selection, there are about 8k UCSC Gene transcripts not in the Gencode. Does
the Table Browser have an option for merging these entries with the Gencode
GTF? If not, would this command "cat out.gtf0[0-1] > merged.gtf" produce a
GTF that is compatible with the Table Browser?

The UCSC Gene GTF produced by the Table Browser reports gene and transcript
IDs like this: gene_id "uc007aet.1"; transcript_id "uc007aet.1". However, it
does not add to the entry the original database (e.g., RefSeq) accession nor
gene name. Gencode GTF from the Table Browser also missing the gene names.
How could I have the original database IDs and the gene names included in
the UCSC Gene GTF produced by the Table Browser, and the gene names included
in the Gencode GTF from the Table Browser?

Thanks,
Ephraim

_____

From: Jonathan Casper [***@soe.ucsc.edu]
Sent: Thursday, August 14, 2014 9:10 PM
To: Trakhtenberg, Feliks
Cc: ***@soe.ucsc.edu
Subject: Re: [genome] GENCODE gtf

Hello Ephraim,

Our engineers comment that it is difficult to advise you on how to combine
gene sets without knowing what you're trying to accomplish specifically.
Different gene sets use different predictive models, making it hard to
combine them in a scientifically meaningful way.

That said, you can use the UCSC Table Browser intersection tool to get a
list of entries found in UCSC Genes but not in GENCODE.

1. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables
2. Use the following settings

clade: Mammal
genome: Mouse
assembly: Dec. 2011 (GRCm38/mm10)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: genome

3. Click the "intersection: create" button
4. On the "Intersect with UCSC Genes" page, set the following options:

group: Genes and Gene Predictions
track: GENCODE Genes VM2 (or V3, after it is released)
table: Basic (wgEncodeGencodeBasicVM2)

If you decide after reading the GENCODE track page that the Comprehensive
table would be more useful to you, that is also an option.

5. Choose to return "All UCSC Genes records that have no overlap with
GENCODE Genes VM2"

Note that the "no overlap" requirement here is fairly strict. You may wish
to instead restrict to UCSC Genes records with no more than 50% overlap, for
example, depending on your needs.

6. Click "submit" to return to the main Table Browser page

Note that the output format has been changed to BED. You can leave it in
that way or change to GTF output. Just remember that the GTF output of the
UCSC Table Browser will not exactly match the format of your GENCODE GTF
file.

7. Click "get output"

We also have command line tools that will perform this kind of operation,
but they are not designed to work with files in GTF. If you would like to
explore this alternative, the relevant programs are called "featureBits" and
"overlapSelect". They are available as part of the kent utilities on our
download server at http://hgdownload.soe.ucsc.edu
<http://hgdownload.soe.ucsc.edu/> . We provide precompiled binaries for
these utilities at http://hgdownload.soe.ucsc.edu/admin/exe/, but only for a
few computer architectures. You may need to download the source code and
compile these tools yourself if your computer is not listed there. You can
run each program by itself on a command line with no arguments to see a
description of how to use it.

As for your other question, RefSeq is a curated set of transcripts drawn
from GenBank. Like GenBank, it is quite possible that there will be RefSeq
transcripts that are not represented in GENCODE.

I hope this is helpful. If you have any further questions, please reply to
***@soe.ucsc.edu or genome-***@soe.ucsc.edu. Questions sent to those
addresses will be archived in publicly-accessible forums for the benefit of
other users. If your question contains sensitive data, you may send it
instead to genome-***@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

On Tue, Aug 12, 2014 at 11:26 AM, Trakhtenberg, Feliks
<***@childrens.harvard.edu> wrote:

Hello,

Regarding your answer in point 4 below, is it possible to identify which
UCSC Genes track transcripts from GenBank are not found in Ensembl and
GENCODEv3? I would like to add them to the GENCODE gtf but do not want
redundancies.

What about Refseq transcripts - might there also be some that are included
in the UCSC Genes track but not in GENCODEv3, similar to how you explained
about the GenBank transcripts?

Thank you,

Ephraim

_____

From: Steve Heitner [***@soe.ucsc.edu]
Sent: Monday, August 11, 2014 5:46 PM
To: Trakhtenberg, Feliks; ***@soe.ucsc.edu
Subject: RE: [genome] GENCODE gtf

Hello, Ephraim.

To address all of your questions:

1. We recommend that you get the GTF files from GENCODE
(http://www.gencodegenes.org). The Table Browser generates least common
denominator GTFs for a lot of tracks and will not contain all of the
information available in the official GENCODE GTFs.

2. The GENCODE mouse V3 track will hopefully be available this month
(August 2014).

3. For information regarding the different GENCODE subtracks available at
UCSC, I recommend reading through the description page at
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=mm10
<http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=mm10&g=wgEncodeGencodeVM2>
&g=wgEncodeGencodeVM2.

4. Concerning whether or not the GENCODE track contains everything
contained in the UCSC Genes track, I don't believe this can be answered
definitively. The UCSC Genes track is based on GenBank while the GENCODE
track is based on Ensembl. Because these are constructed using completely
different methods, you will find in many cases that GenBank contains items
that Ensembl does not and vice versa.

Please contact us again at ***@soe.ucsc.edu if you have any further
questions. All messages sent to that address are archived on a
publicly-accessible Google Groups forum. If your question includes
sensitive data, you may send it instead to genome-***@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

From: Trakhtenberg, Feliks
[mailto:***@childrens.harvard.edu]
Sent: Sunday, August 10, 2014 3:12 PM
To: ***@soe.ucsc.edu
Subject: [genome] GENCODE gtf

Hello,

I would appreciate if some could explain why the GENCODE gtf generated
through the Table Browser is lacking gene, transcript, UTR, and
Selenocysteine rows, which are present in the original GENCODE file. I plan
to use this gtf for Tophat/Cufflinks RNA-seq analysis and just wanted to
make sure I am using the right file.

When will the GENCODE mouse V3 be available through the Table Browser?

Is the table option called Comprehensive have the most of GENCODE
transcripts, including those that are only predicted? Or other GENCODE
tables, such as pseudogenes, have additional transcripts?

Is everything that is in the UCSC Gene table also included in the
Comprehensive GENCODE table?

Thank you

Ephraim Trakhtenberg, PhD
--
--
--

=============================================================================
Topic: Table browser looking up sequences by positions
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/cb3113b186616b0
=============================================================================

---------- 1 of 1 ----------
From: Kevin Lopez <***@virginia.edu>
Date: Sep 11 12:43PM -0400
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/a0a4e7e474e29e77

Hello,
I am trying to retrieve the nucleotide sequence by inputting the positions
of the chromosome. I know one way to do this is to add the positions on
genome browser, click view and then click Dna. However, this method only
works to find the nucleotide sequences for positions one at a time. I have
a list of 10,000 positions. I was wondering if there was a way to find the
nucleotide sequences of all these positions all at once. I tried using
table browser, but it gives me back a region larger than I want.
Thank you.
Best,
Kevin Lopez
University of Virginia Researcher

=============================================================================
Topic: Genome browser and exon data conventions
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/a20686ead53bb3aa
=============================================================================

---------- 1 of 1 ----------
From: "Turgay Aytac" <***@scidesktop.org>
Date: Sep 11 11:40AM -0400
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/b5c00a710631bf9c

Hi,

I am a newcomer to the genomic research (an old one in data related fields)
and I would like to verify a couple of things about the gene/exon data I
have been downloading for research purposes:

I downloaded all HG38 gene/transcript data and as far as I understood from
the data structure information on the web site, exon starts and ends are 0
based and the end point is not inclusive. Is that really the case? Because
when I review one through the genome browser (and check against the
chromosome reference), I see that starts and ends are marked on exactly the
same bases denoted by the gene-transcript table I downloaded and correspond
to the +1 bases on the reference chromosome when we consider the first one
as 0th base. Additionally, exon ends seem to be inclusive.

Example:

HG38

NBPF1 on chromosome 1 with 29 exons

Exon #12 starts at 16,577,316 and ends at 16,577,419

These numbers seem to be +1 based when compared to the chromosome sequence
and genome browser displays the exon on the exact same locations.

Another observation about AA sequence displayed below the genomic sequence:
I could not verify the corresponding AA when I track the sequence codon by
codon matching the displayed AA (right below the sequence of bases as far as
I understood). Example: On the same exon, locations 16577387-16577389
displays codon AGA and the corresponding AA seems to be S (??).

Thanks in advance.

Sincerely,

Turgay Aytac

=============================================================================
Topic: should be simple in table browser
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/321f65b11d9c4794
=============================================================================

---------- 1 of 1 ----------
From: "LaFramboise, William A" <***@upmc.edu>
Date: Sep 11 02:54PM
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/2c50230e17437b51

Pardon my abilities but I am trying to submit a list of genomic positions ( I have many) like the following

chr1 987462 987463
chr1 1149055 1149056
chr1 1196236 1196237
chr1 1234068 1234069
chr1 1263077 1263078
chr1 1288539 1288540

and get back the gene names and/or symbols encoded for each region.

Following your directions in the Table Browser and "help" list from previous questions for submission of a list of positions I used the following settings:

Mammal
Human
Feb. 2009 (hg19)
Genes and Gene Predictions
UCSC Genes
knownGene

output format: all fields from selected table

I get an output of

#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID

However, I cannot get gene names nor relate this back to my input file to interrogate with the protein IDs. May I trouble you for insight as to settings to reveal gene names associated with these positions?

Many thanks,

Bill

=============================================================================
Topic: BAM file
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/b0cc1c6f8abbc1d9
=============================================================================

---------- 1 of 1 ----------
From: Jonathan Casper <***@soe.ucsc.edu>
Date: Sep 10 03:25PM -0700
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/7c4d1b42a6af2a83

Hello Luca,

Thank you for your question about loading BAM files as custom tracks. I am
sorry to hear that your institution will not provide you with hosting space
for your data. I hope you can work out another solution. It is your choice
how to upload your data - all 10 GB or only part of it. The UCSC Genome
Browser will only have access to the data that you upload. The important
requirement is that whatever you upload, it must be a sorted, indexed BAM
file, and the index file must also be in the same directory. More
information about how to sort and index BAM files is available on the BAM
data format page at http://genome.ucsc.edu/goldenPath/help/bam.html.

For online offers of web hosting space, other members of the community may
have suggestions about what worked for them. It is important that whichever
offer you choose, it must be able to support byte range requests. That is
how the UCSC Genome Browser is able to display your data without having to
first download all 10 GB. More information about byte range requests is
available in the answers to the last two questions of our custom tracks
help page at http://genome.ucsc.edu/goldenPath/help/customTrack.html.

I hope this is helpful. If you have any further questions, please reply to
***@soe.ucsc.edu or genome-***@soe.ucsc.edu. Questions sent to those
addresses will be archived in publicly-accessible forums for the benefit of
other users. If your question contains sensitive data, you may send it
instead to genome-***@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

On Tue, Sep 9, 2014 at 3:53 PM, Luca Cavallone <***@gmail.com>
wrote:

=============================================================================
Topic: Genome Build #37
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/28757a88d0c65bbe
=============================================================================

---------- 1 of 1 ----------
From: Jonathan Casper <***@soe.ucsc.edu>
Date: Sep 10 02:46PM -0700
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/c5c195287ca53a72

Hello Michael,

Thank you for your question about downloading sequence data from the UCSC
Genome Browser. Are you including introns in your output selection with the
Table Browser? If so, I'm not surprised that the query times out. With
introns, that winds up being a massive amount of sequence data that spans
nearly half the genome, with frequent repetition. Unfortunately there is
also no way to perform this query using the public mysql server if you want
to include intron sequence. You will most likely need to download the
genomic sequence file and the gene set from our download server at
http://hgdownload.cse.ucsc.edu/downloads.html#human (see the links for
"Full dataset" and "Annotation database" under the GRCh37 heading). Then
you will need to find or write your own tool to do the upper/lower case
assignment, as there are no appropriate tools in the kent repository.

If you do not need intron sequence, I suggest that you re-run your query on
the UCSC Table Browser and make sure that you uncheck the "Introns" box in
the "Sequence Retrieval Region Options:" section. The resulting file is
only around 224 MB.

Note also that there are a number of gene sets available for the GRCh37
assembly; you will have to determine which one is most appropriate for your
needs. Among the options are UCSC Genes (in the file knownGene.txt.gz),
GENCODE Genes Basic V19 (in the file wgEncodeGencodeBasicV19.txt.gz), and
RefSeq Genes (in the file refGene.txt.gz). More information about these
tracks can be found by clicking on the track name on our main browser page
at http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19, then scrolling down to
read the track description.

I hope this is helpful. If you have any further questions, please reply to
***@soe.ucsc.edu or genome-***@soe.ucsc.edu. Questions sent to those
addresses will be archived in publicly-accessible forums for the benefit of
other users. If your question contains sensitive data, you may send it
instead to genome-***@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

On Tue, Sep 9, 2014 at 11:34 AM, Michael Johnson <

=============================================================================
Topic: hCD44v6 sequence
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/21cbf6f6e8556d77
=============================================================================

---------- 1 of 1 ----------
From: Luvina Guruvadoo <***@soe.ucsc.edu>
Date: Sep 10 11:09AM -0700
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/7adb108f835794c3

Hello Anna,

May I ask what are the exact steps you are taking to download the sequence?

If you have any further questions, please reply to ***@soe.ucsc.edu. All
messages sent to that address are archived on a publicly-accessible forum.
If your question includes sensitive data, you may send it instead to
genome-***@soe.ucsc.edu.

- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group

On Wed, Sep 10, 2014 at 6:21 AM, Anna Stornaiuolo <

=============================================================================
Topic: Please register PhyloCSF track hub
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/d0f82a1d27a088fb
=============================================================================

---------- 1 of 1 ----------
From: Jonathan Casper <***@soe.ucsc.edu>
Date: Sep 10 10:42AM -0700
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/918b6b34693e6143

Hello Irwin,

Thank you for making this data available! I have passed your track hub to
our engineering team for review and inclusion in the UCSC Genome Browser
list of public track hubs. If you haven't already, you may also find it
helpful to read our list of track hub guidelines at
http://genomewiki.ucsc.edu/index.php/Public_Hub_Guidelines. It contains a
list of common suggestions and requests that we make of hub providers
before adding their data to the list. My initial impression, however, is
that your hub is already well organized and looks great.

I hope this is helpful. If you have any further questions, please reply to
***@soe.ucsc.edu or genome-***@soe.ucsc.edu. Questions sent to those
addresses will be archived in publicly-accessible forums for the benefit of
other users. If your question contains sensitive data, you may send it
instead to genome-***@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

On Mon, Sep 8, 2014 at 5:53 PM, Irwin Jungreis <***@csail.mit.edu>
wrote:

=============================================================================
Topic: hg38 with decoy sequences
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/399dba1ed3e134d2
=============================================================================

---------- 1 of 1 ----------
From: Jonathan Casper <***@soe.ucsc.edu>
Date: Sep 10 10:30AM -0700
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/1a65bc1a8863d7ad

Hello Albert,

Thank you for your question about the hg38 human genome assembly. Many
details about this assembly are available on our gateway page at
http://genome.ucsc.edu/cgi-bin/hgGateway when you select the human/hg38
assembly. In particular, this assembly does include two versions of an
analysis set that contains Epstein-Barr virus sequence as a decoy. More
information is available in NCBI's README file for the analysis set at
ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/README_ANALYSIS_SETS
and
in the parent README for the hg38 assembly at
ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/README
.

You may also be interested in NCBI's Insights site, which gives more
information about NCBI resources and the work underlying them. You can find
the posts tagged "GRCh38" at
http://ncbiinsights.ncbi.nlm.nih.gov/tag/grch38/.

I hope this is helpful. If you have any further questions, please reply to
***@soe.ucsc.edu or genome-***@soe.ucsc.edu. Questions sent to those
addresses will be archived in publicly-accessible forums for the benefit of
other users. If your question contains sensitive data, you may send it
instead to genome-***@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

=============================================================================
Topic: cwc22 gene in the rat
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/t/1d489434eef58a79
=============================================================================

---------- 1 of 1 ----------
From: Jonathan Casper <***@soe.ucsc.edu>
Date: Sep 10 10:17AM -0700
Url: http://groups.google.com/a/soe.ucsc.edu/group/genome/msg/5d1d81affffa05ee

Hello Sha,

Thank you for your question about finding genes in the rat genome. Where
are you searching for the CWC22 gene? It appears in different places
depending on which genome assembly you are looking at. In the rn4 genome
assembly, you can find CWC22 in the RGD Genes track. In the rn5 genome
assembly, it is part of the Ensembl Genes track (RGD Genes is not provided
for the rn5 assembly). rn6 is still quite a new assembly, and does not have
much annotation available yet. There are no gene annotations for CWC22 in
the rn6 assembly. We do intend to add more gene annotation to the rn6
assembly in the future, and a gene track that contains CWC22 will likely be
part of that.

I hope this is helpful. If you have any further questions, please reply to
***@soe.ucsc.edu or genome-***@soe.ucsc.edu. Questions sent to those
addresses will be archived in publicly-accessible forums for the benefit of
other users. If your question contains sensitive data, you may send it
instead to genome-***@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page:
https://groups.google.com/a/soe.ucsc.edu/forum/?utm_source=digest&utm_medium=email/#!forum/genome/join
.
To unsubscribe from this group and stop receiving emails from it send an email to genome+***@soe.ucsc.edu.

To unsubscribe from this group and stop receiving emails from it, send an email to genome+***@soe.ucsc.edu.