Agenda for the 22
ndAnnual International Sequence Database
Collaborative Meeting Tuesday May 12, 2009Wednesday May 13, 2009
Tuesday May 12, 9:0012:00 (break 10:3010:45)
Chair: Ilene Mizrachi Minutes: EBI
Welcome and Reports
EBIDDBJ
GenBank
Followup Items
1.
Use of institute codes in structured specimen vouchers (EBI)
We would like an update on the use of institute codes in structured specimen vouchers. We find that we spend a lot of time verifying vouchers for correct institute code, collection codes and voucher identifiers. It would be helpful if the complete code table could be made publicly available. Are there any plans to do this?
DISCUSSION
Bob started by explaining that in his opinion submitters do not understand how to use institute codes. Often submitters are using their own abbreviations which may not match or may mismatch with existing abbreviations. Bob said that this problem would be alleviated by making the list of institutions publicly available and searchable. He also asked if other databases are experiencing any problems with dealing with institution codes and how are they solving them. Scott explained that submitters often try to provide collection codes, even though this is optional, and the collection codes tend in many cases to be wrong. NCBI has considered sending automatic reports back to submitters containing NCBI's
interpretations of the submitted institute codes. Scott says that NCBI is happy to make the list of institute codes public. NCBI already has an internal search page, but no external search page.
Redundancies of institute codes were briefly discussed. The redundancies in the list are a reflection of the historic use of the institute codes prior creation of current format list. NCBI considers their
submission they have processed.
CONCLUSION
NCBI agreed to publish the institute codes through Entrez. There will be a link to this service from the INSDC web site.
TIMELINES
NCBI will provide an Ftp dump this week. The Entrez search will be ready in 6 months
CONTACTS
2.
TSA standards (EBI)
Rules on the acceptance of TSA submissions and standard of sequence quality were discussed and agreed at collab. 2008 and in subsequent e‐mails:
1. TSA submissions must be registered with an INSDC project id. 2. The submitter must own the primary sequence data.
3. Primary sequence data must reside in EST, trace or SRA and be publicly available. 4. The submitter must provide instructions for assemblies.
5. The TSA entry must have at least 1x coverage by primary sequence at each base.
6. Regions of a TSA record can be assembled from a single EST or read so that coverage is only 1x. 7. Limits to ambiguity in the assembled TSA record should be:
• the allowable percent of bases that are 'n' should be less than 5%
• TSA record can have a stretch of no more than 5 n's in a row
8. The DE line must start with 'TSA:' and mandatory keywords, "TSA” and “Transcriptome Shotgun Assembly", must be present.
We propose in addition further rules governing acceptable data quality; particularly the percentage of non‐'n' ambiguous bases (m, w, r, y, k, s, v, h, d, b =<5%) and the mandatory inclusion of at least one feature annotation to improve on and add value to this resource.
DISCUSSION
they are either submitted into the HTC division (low quality sequences) or into the correct taxonomic division. A discussion followed about ambiguous bases in TSA entries. NCBI said that because TSA are computer generated any rules for restricting ambiguous bases would pose constraints on the underlying sequences. Guy pointed out that there are two sources of ambiguity: ambiguity in the source sequences and ambiguity rising from disagreements in alignments. In practice, these two cases are very difficult to be differentiated from each others. NCBI has not seen any cases of ambiguity in the TSA sequences and proposed that the topic would be discussed only after ambiguity is observed in TSA entries for the first time. Susan Schafer expressed her opinion that annotation in TSA entries does not in many cases add any value to the sequences ((e.g. for transcriptome assembly) but instead would just add to the badly annotated entries in the archive. The collaboration discussed about the merits of unannotated TSA vs. mapped short read sequences and noted that storing mapping files could be more cost effective than storing the TSA sequences. However, submitters want to make the transcripts searchable and currently they can do this only by submitting assembled transcripts into the TSA division as search against mapping files is not currently possible. NCBI suggested that in the future, possible in a one year scale, the analysis object might be used as a better alternative for TSA sequences. Ruth suggested that the collaboration could consider removing the ASSEMBLY/AS lines. The motive behind the AS line removal proposal is the difficultly EBI is having with validating the ASSEMBLY/AS lines and in generating them for the submitters. Susan countered that the assembly lines are essential evidence for the assembly and as such should be preserved in the flat files. Guy pointed out that the AS/ASSEMBLY lines are already optional for TSA entries. NCBI strong position is that assemblies (including TSA) should be validated upon submission; in some cases NCBI provides the ASSEMBLY/AS line coordinates to the users, and in other cases, submitters provide the coordinates to NCBI. NCBI mentioned that they support assembly submissions in several different formats. Guy proposed that the archives would always build the assemblies for the submitters. Ilene replied that there is a large number of ways to build (different) assemblies each of which could be valuable enough to be captured in the archives and which can't be easily regenerated by the archives. Guy expressed his worry about INSDC's possible inability to store assemblies being produced from the short read sequence experiments. This is highly relevant for TSA entries, because EST sequence submissions are being replaced by short read submissions into short read archive but their assemblies are being submitted into the TSA division. This short read submissions for ESTs are made possible by the ability to now search against these sequences using BLAST. NCBI
reminded the collaboration about the existence of NCBI's trace assembly archive and that it might play a important role for INSDC assembly storage in the future. For short reads the collaboration discussed that it may be unfeasible to store assembly instructions to individual reads regardless of the storage strategy.
CONCLUSION
The collaboration will add a warning if TSA sequences contain >= 5% of ambiguous bases. The decision to reject TSA entries based on ambiguous codes will be deferred until such cases are being detected in TSA entries. The AS/ASSEMBLY lines will continue to be used in TSA entries for EST assemblies
TIMELINES
NCBI will activate the warning message by the end of the week. EBI will do the same within one month.
CONTACTS
3.
Project data exchange (EBI)
We were very encouraged to receive the new project XML dump as it provides a starting point for us to contribute curated content to this important dataset. We propose an exchange mechanism for project data based on DDBJ, EBI or NCBI ownership of project records that combines project identifier
assignment and locus_tag prefix reservation from the NCBI webservice with INSDC flatfile‐like exchange for the remaining project data.
DISCUSSION
Guy proposed a flat file like exchange of project records with project id and locus_tag registration being done by NCBI and all other project data being exchanged on a regular daily basis using FTP. Ilene outlined the two options available for us: either have a master database at NCBI and provide means to DDBJ/EBI to edit the content, or adopt a flat file like exchange of the project XMLs. EBI expressed a strong preference for the flat file exchange model. Karl preferred the model where DDBJ/EBI is given editorial access to the project data stored in NCBI. Guy stressed that because EBI is recruiting a metagenomics curator position there is an immediate need for EBI to start editing project entries. A third option was discussed in which the XMLs are exchanged using a web services rather than FTP. EBI expressed a strong preference for the FTP exchange model. According to Kousaku, DDBJ has a need to edit project XML records as well. When asked, NCBI confirmed that the authority of project ID mappings should be the INSDC entries, not the project XML entries. EBI reminded the participants about a
previous collaboration meeting decision to include project IDs into the live lists. Mark agreed that this would be useful for checking that all project ID mappings in the flat files have been captured as part of the data exchange. Project relationship exchange mechanism was also briefly discussed. Rasko
expressed a preference that this information would be present in the project XML files themselves.
CONCLUSION
NCBI (Karl) will investigate timelines for enabling project XML. Project IDs will be added to the live lists.
TIMELINES
Timeline for finishing the investigation for enabling project XML exchange is 1 week. Timeline for adding project IDs into the live lists is 6 months.
CONTACTS
4.
Use of project ID for TSA and EST (EBI)
We wish to clarify the use of project IDs for TSA projects. One option would be to assign projects for both TSA and EST data, the former project being a parent of the latter. The alternative would be to assign a single TSA project for the entirety of the data.
DISCUSSION
NCBI has not been using parent‐child projects for TSA and underlying EST entries nor has NCBI been assigning project IDs to EST entries. However, NCBI is assigning project IDs to TSA entries.
CONCLUSION
NCBI will add project IDs to EST entries assembled as TSA entries.
TIMELINES
Timeline is 6 months for ESTs used in TSA entries to have project IDs.
5.
Add /db_xref to project (EBI)
In order to enrich the content of project records, EBI proposes that we adopt a /db_xref‐like system for the representation of cross‐references in project records. This would allow us to share a common registry of cross‐reference resources across sequence and project entries.
DISCUSSION
Both NCBI and DDBJ are fine with the proposal to add search tool neutral database cross‐references into project entries. The suggested simple format would be name + identifier.
CONCLUSION
Ilene will check if the project XML has already a structure to represent database cross‐references.
TIMELINES
TODO: I failed to minute a timeline decision.
CONTACTS
6.
Add /haplogroup as source qualifier
Haplotype is a combination of alleles at multiple loci that are transmitted together on the same chromosome. A haplogroup is assigned from a combination of haplotypes. Haplogroup is a group of similar haplotypes that share a common ancestor with a single nucleotide polymorphism mutation.
The majority of submitters of complete human mitochondrial genomes provide information about their haplogroup rather than their haplotype. Stable mtDNA polymorphic variants clustered together in specific combination form a haplogroup.
source 1..16570
/organism="Homo sapiens" /organelle="mitochondrion" /mol_type="genomic DNA" /db_xref="taxon:9606" /note="haplogroup: K1a1b2"
We would like to propose the addition of a /haplogroup to the list of legal source qualifiers
Tuesday May 12, 1:005:00 (break 3:003:15)
Chair: Yoshio Tateno Minutes: GenBank
7.
Proposal to add three new /exception cases (GenBank)
More and more genomes are being sequenced by large and small groups, and many of these groups are also annotating the genomes. Sometimes the submitter wants to annotate what we describe as
mutations. Our experience has been that in most cases the submitter does not have the resources to do more sequencing to address the potential sequence problems, and they want to annotate these
problematic coding regions since the presence of the proteins is required for proteomic analyses.
We propose adding three new exceptions to use for three cases where the genomic sequence doesn't encode the "expected" protein. The last two of these will trigger prefacing the product name in our retrieval tools for GenPept & fasta definition lines with "LOW QUALITY PROTEIN:" as a warning to database users that the protein may have problems. The protein definition line will look like this:
DEFINITION LOW QUALITY PROTEIN: succinate dehydrogenase, C subunit [Campylobacter jejuni].
Here are the three cases:
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
[1] When the sequence is high quality but there is evidence (eg transcript or peptide evidence) that a particular protein is made. eg Drosophila annotation from FlyBase
/exception= "annotated by transcript or proteomic data"
We currently use the exception "reasons given in citation" for this case, but think that "annotated by transcript or proteomic data" is more accurate.
For this situation, the submitter provides the desired translation includes this exception. Protein definition lines are not touched in this case.
Here is an example:
LOCUS AE014134 23011544 bp DNA linear INV 15‐JAN‐2009 DEFINITION Drosophila melanogaster chromosome 2L, complete sequence.
CDS complement(join(1792057..1792209,1792261..1793268)) /gene="Gr22b"
/locus_tag="Dmel_CG31931" /gene_synonym="CG31931" /gene_synonym="CR31931" /gene_synonym="Dmel\CG31931" /gene_synonym="Gr22bP"
/gene_synonym="Gr2940.4"
/note="CG31931 gene product from transcript CG31931‐RA; transition; CG31931‐PA; Gr22b‐PA; gustatory receptor 2940.4;
translated product replaced" /codon_start=1
/product="gustatory receptor 22b" /protein_id="AAX52653.1"
/db_xref="GI:61678278"
/db_xref="FLYBASE:FBgn0045500"
/translation="MFGSSREIRPYLARQMLKTTLYGSWLLGIFPFTLDSGKRIRQLR
[2] When a heterogenous population was sequenced, eg tick (Ixodes) or the phase variation gene of a bacterial culture
/exception= "heterogenous population sequenced"
"LOW QUALITY PROTEIN:" will be at the beginning of protein definition lines
These proteins will be encoded by the underlying sequence, but there will be adjustments to the CDS to adjust for sequence problems. Therefore, submitters should use transl_except="X" at internal stop codons encoded by the sequence and a joined CDS (=fake introns) to adjust for indels, rather than importing a translation.
Here's an example where we propose using this exception:
LOCUS AL111168 1641481 bp DNA circular BCT 23‐OCT‐2008 DEFINITION Campylobacter jejuni subsp. jejuni NCTC 11168 complete genome.
CDS 1..1243
/locus_tag="Cj0031"
/coded_by="join(AL111168.1:46424..49000, AL111168.1:49002..50156)"
/inference="protein motif:Prosite:PS00092"
/note="Original (2000) note: Cj0031, probable DNA
restriction/modification enzyme, N‐terminal half, len: 867 <<stuff deleted>>
macromolecules ‐ DNA replication,restriction/modification, recombination and repair"
/product="putative type IIS restriction/modification enzyme"
/protein_id="CAL34212.1"
[3] When the sequence is of low‐quality, eg multiple bacterial strains are sequenced and compared to a reference strain
/exception= "low‐quality sequence region"
This is like case [2], and has the same prefacing of the protein definition line:
"LOW QUALITY PROTEIN:"
Here's an example from a new submission:
CDS join(8207..8824,8824..9066)
/note="Features C_000070007, C_000070008 joined to form C_000070129 to correct 454 sequencing error (C_000070007 identified by Glimmer3 with a raw score of 13.99, C_000070008 identified by Glimmer3 with a raw score of 5.19). Gene function extrapolated from CJJ81176_0465 (CP000538) after blastn alignment (98.84 percent identity over a 861 bp alignment)."
/product="succinate dehydrogenase, C subunit"
Discussion and Conclusion
It was agreed that only the first case, "annotated by transcript or proteomic data", should be a new /exception text. The timeline for implementation is October, but we would like to require that an /inference qualifier be used in conjunction with this /exception case. This inference would be the “similar to” type and would indicate the EST/cDNA/protein source of the translation. Karen will do a reality‐check to see if this is feasible by conferring with FlyBase to see if they would be able to meet this requirement (they are the group that we know is already in this situation).
Karen/Jun/Bob will correspond and write definitions for all of the /exception cases.
NCBI agreed to add the check for short introns (<10bp) to their validator (this has been done).
After much discussion, the other two proposed /exception values were not approved.
However, a new qualifier, /artificial_location , was adopted, for use with coding regions that fall into classes [2] and [3] of the initial NCBI proposal.
The reasoning behind this choice is that the conceptual translation of the coding regions in those situations *does* match the presented /translation, and thus /exception isn't really appropriate.
However, the conceptual translation match is only possible because the location of the coding regions have been artificially adjusted. So a new qualifier, which directly indicates the unusual nature of the coding region locations, is a more direct way of handling this annotation problem.
It was agreed that internal stop codons would be handled with a /transl_except qualifier, and frameshifts would be handled via a joined CDS location.
The protein definition lines are outside the scope of Collab, so discussion about them was dropped.
The new qualifier and definition that we agreed upon is:
Qualifier /artificial_location
Definition indicates that location of the CDS or mRNA is modified to adjust for the presence of a frameshift or internal stop codon and
not because of biological processing between the regions. This is expected to be used only for genome‐scale annotation, either because a
heterogeneous population was sequenced, or because the feature is in a region of low‐quality sequence
The earliest implementation timeframe for EBI was 6 months.
8.
CDS/pseudo VS pseudogene (DDBJ)
Pseudogene annotation and CDS/translation
Background:
a) Usages of /pseudo qualifier for CDS feature
* CDS/pseudo is applied to many cases including not‐pseudogene. As FT‐Doc defines, if a CDS feature has /pseudo qualifier, it cannot have /translation qualifier.
* Many users likely misunderstand that CDS/pseudo is equal to "pseudogene".
b) Changing the definition of pseudogene
* The "pseudogene" definition is not commonly shared in biological communities.
* Recent transcriptome results show that it is difficult to determined if a (pseudo)gene is really transcribed or not. So, the classical definition for pseudogene is out of date.
We should avoid using the term, "pseudo", as a qualifier name for CDS feature.
# It would be OK to use /pseudo qualifier for mRNA and other 'RNA' features.
DDBJ propose to replace /pseudo qualifier for CDS feature by a new one, /disrupted.
If they assume and/or confirm that the CDS feature is pseudogene, submitters should denote it in the values of /gene, /product and/or /note.
At the meeting, we will briefly explain usages of CDS/pseudo, which can be classified into
1) pseudogenes, 2) bulk (genomic) annotation, 3) disrupted genes, or others
One of possible solution is the expansion of /exception usage. This would solve some confused cases of CDS/pseudo, but some others would still be confused.
Discussion and Conclusion
/pseudo is a confusing qualifier because it thought that the gene/CDS is a pseudogene even though the feature table definition is that the feature is a non‐functional version of the element named in the feature key. It was agreed that we would rename this qualifier /non_functional. This qualifier will become legal as of April 15, 2010 for exchange and releases.
9.
No joined CDS's on mRNA records unless exception is used (GenBank)
Several of the genome centers are submitting large amounts of mRNA data that are often not manually reviewed by the submitters or by the receiving database. This has led to the inclusion of joined CDS on mRNA sequences in order to compensate for reading frameshifts and internal stop codons in the conceptual translation of the coding region
CDS join(<3..716,713..814) /gene="DKFZp434I092" /codon_start=1
/product="hypothetical protein" /protein_id="CAB55986.2"
/db_xref="GI:50978452" /db_xref="GOA:Q8NE09" /db_xref="HGNC:24499"
/db_xref="InterPro:IPR000342"
/db_xref="UniProtKB/Swiss-Prot:Q8NE09"
We would like to propose that records with a /molecule_type of mRNA cannot be released with a CDS with a joined location unless an appropriate /exception is used.
Discussion and Conclusion
mRNA record should not have joined CDS unless there is a biological reason and should be annotated with /ribosomal_slippage or /exception. It was agreed that after June, 2009, we would prohibit the loading of new entries. EBI will look at the feasibility of retrofitting the problematic data over the next year.
10.
tRNA encoded by a circularly permuted gene (DDBJ)
Since 2005, we have used two qualifiers /ribosomal_slippage and /trans_splicing to indicate that the location of feature is not typically joined.
DDBJ accepted a rare case of exceptionally joined location, a circularly permuted tRNA.
The case is neither /ribosomal_slippage nor /trans_splicing.
Reference: Soma A et al. (2007). Science 318: 450‐453
"Permuted tRNA genes expressed via a circular RNA intermediate in Cyanidioschyzon merolae"
http://www.sciencemag.org/cgi/content/full/318/5849/450
AB304513
tRNA join(96..115,1..17,39..85) /anticodon=(pos:14..16,aa:Leu)
/experiment="Northern blot analysis, RT‐PCR and sequencing analysis"
/note="This tRNA is encoded by a circularly permuted gene in which the 3'‐half of the tRNA lies upstream of the 5'‐half on the genome."
/product="tRNA‐Leu(TAA)"
Since it is likely to have more cases similar to this, in the future, we propose to merge two qualifiers, /ribosomal_slippage and /trans_splicing, into a new one;
/unusual_join="<join type>"
Qualifier /unusual_join=
Value format "ribosomal slippage", "trans‐splicing", "permuted", "circular" "rearrangement required for product", "reasons given in citation" Example /unusual_join="ribosomal slippage"
Comment should be used on features such as CDS, mRNA and others to indicate that a location is not orderly joined.
i.e. [/unusual_join="trans‐splicing"] should be used only when the splice event is indicated in the "join" operator,
e.g. join(complement(69611..69724), 139856..140087)
Discussion and Conclusion
Although the biology of these tRNAs is interesting, there are only 11 instances in the database. Rather than add a new qualifier, it was agreed that we would use the /trans_splicing qualifier and a note that says
/note=”permuted splicing: unusual use of trans_splicing qualifier”
Over the next year, we will collect examples and re‐evaluate at the next collab meeting, if necessary.
11.
FTDoc: Meaning of the explanation, "Each qualifier should have a
single value." (DDBJ)
In FT‐Doc, we find the following explanation;
> 3.3.2 Format and conventions
<snipped>
> Each qualifier should have a single value; if multiple values are necessary, these should be represented by iterating the same qualifier...
We would like to clarify the meaning of this explanation.
The rule a), "Each qualifier should have a single value" seems to be applied to 1) "<single token>", 2) well‐formatted, and 3) controlled qualifiers and not to 4) free‐text and 5) one without value.
1) <single token>; /gene, /gene_synonym, /locus_tag, /number, /old_locus_tag, /operon, etc.
2) well‐formatted; /anticodon, /bio_material, /citation, /codon, etc.
3) well‐controlled; /chromosome, /cultivar, /organism, /variety, etc.
5) no value; /environmental_sample, /focus, /germline, /proviral, /pseudo, etc.
However, the rule b), "if multiple values are necessary, these should be represented by iterating the same qualifier." is not applied to some qualifiers that can include only one value for a feature; /gene, /locus_tag, /strain, and so on.
Instead of multiple descriptions for /gene and /locus_tag, we can use /gene_synonym and /old_locus_tag, respectively. But, in case of /strain, we usually use descriptions like following;
/strain="ATCC #### (= JCM ### = NBRC ###)"
# It should be replaced into "multiple" /culture_collection qualifiers by degrees.
Should the section, "Format and conventions", be modified?
Also, should we clarify which qualifiers can be used two or more times for a feature?
Discussion and Conclusion
It was agreed that each database would generate their own lists of which qualifiers were singly‐ occurring on a feature and then they would be compared and reconciled by email. Once this is complete, we will review which qualifiers can be multiply‐occurring on a feature. The timeline for reconciling singly‐occurring qualifiers is 6 months (October 2009). The contacts are: EMBL‐Rasko, DDBJ‐ Jun, GenBank‐Mark
12.
BARCODE status update (GenBank)
The next phase of the barcoding project (iBOL) aims to generate barcodes for 5M specimens from 500K species over the next five years. In contrast to the last phase (which generated 500K barcodes from 50K species, only 5% of which are public) the next phase will operate under a genome‐project‐like data release policy. In most cases, there will be a lag between sequencing and taxonomic identification. We have agreed on a standard formula for informal names based on barcode bin clusters that will allow iBOL to operate under an immediate data release mode.
Discussion and Conclusion
Scott Federhen presented details about phase 1 of the Barcode project. There are about 450K
within 7 days of sequencing. These submissions will contain BIN taxonomic names which are based on clustering before they are confident with name assignment. EMBL indicated that these BIN names will only be for BOLD data and not the European data.
There is a plan to set up BOLD mirror sites in the Netherlands and in China. Each group will do their own sequencing but it is unclear how the data will be submitted to INSDC. EBI is interested in making sure that the European sequencing is deposited into EMBL to assure that the sequencing centers are adequately served. Many centers are frustrated with BOLD. EBI has proposed to the funding agency that they become the DCC for the European centers since EBI already captures the sequence in their database.
DDBJ indicated that there is no government funding for Barcoding in Japan. They are receiving some data via the GBIF framework. Sequencing groups claim that DDBJ isn’t “ready” to support voucher data, geographic data, etc.
GenBank indicated that they would be using the structured comment for any new/unsupported data elements. EBI indicated that we should not store ancillary data such as images. This was agreed to.
Wednesday May 12, 9:0012:00 (break 3:003:15)
Chair: Bob Vaughan Minutes: DDBJ
13.
Strain equivalence (EMBL)
At present we represent multiple equivalent strain names in a (relatively) small number of entries. We have heard from users that this can cause problems with parsing the information in this field. At
present EBI does not check that the listed equivalences are correct ‐ in some cases this is not possible, as one or more of the strain names is internal to the submitting lab. EBI proposes that we should no longer represent equivalent strain names, and require that submitters settle on a single strain name.
DISCUSSION and CONCLUSION:
When multiple equivalent strain names described in a /strain qualifier like as “ATCC #### (= LMG ###)”, it would cause problems with parsing the information in this field at database users. Also, it is not so reliable if the listed equivalences are correct or not.
All agreed to avoid using "=", ";", or else in values of /strain qualifiers.
To describe equivalent strain names, appropriate usage of /note qualifier is recommended. As the case of EF415556,
http://www.ncbi.nlm.nih.gov/nuccore/EF415556 /note="strain coidentify: LMG 23894 = FS‐8.1"
To indicate that a strain is "type strain", a recommended description is "type strain of [species (or lower taxon) name]" in /note qualifier.
/isolate qualifier should also be used for a representative name.
Related to this issue, retrofits are optional.
14.
Strain level taxonomy nodes (EMBL/GenBank)
In the past it has been standard practice for taxonomy to create strain‐level taxonomic nodes for bacteria which have completely sequenced genomes. This has resulted in a number of inconsistencies and problems with data retrieval. EBI proposes that these nodes should be removed, and that we retrofit to ensure that the organism and strain information are represented in the correct qualifiers for consistency.
DISCUSSION:
To get information quickly for NCBI genome team or else, it has been standard practice for taxonomy database to create strain‐level taxonomy IDs for microorganisms which have completely sequenced genomes. This policy made a number of inconsistencies and problems with data retrieval.
Since project IDs have become available to index genomic data, we do not have to assign strain‐level taxonomic nodes for future genome sequences.
However, we should be careful about effects when we remove strain‐level taxonomic nodes for many microorganisms.
So, no retrofit would be made for the moment.
Considerable effects:
1) Taxonomy IDs of strain node level would have been many, perhaps, several thousands. # In fact, they are around 2,000.
2) For users, the policy change would cause confusion.
3) More strict curations for values of /strain qualifier will be required.
ACTION:
Ask some users to be affected about the effects, TrEMBL, UniProt or any other databases.
TIMELINES:
3 months
FOLLOW‐UP:
From: "Scott Federhen, NCBI" <[email protected]>
To: NLM/NCBI List collab <[email protected]>,NLM/NCBI List taxonomy <[email protected]>,
RJ Vaughan <[email protected]>, Guy Cochrane <[email protected]>,Jun Mashima <[email protected]>
Subject: strain‐level taxids for genomes Date: Fri, 22 May 2009 16:23:18 ‐0400
15.
Specimen voucher format for uniquing unclassified organisms
(EMBL)
There have been conflicts in advice from NCBI taxonomy regarding the use of specimen vouchers to unique unclassified organisms. We would like an agreement on how these values should be used, to prevent creation of multiple tax nodes.
e.g. Eutagenia sp. BMNH:12345 (our preference)
Eutagenia sp. BMNH 12345
Eutagenia sp. 12345
DISCUSSION and CONCLUSION:
Ruth explained that there were conflicts in advice from NCBI taxonomy regarding the use of specimen vouchers to unique unclassified organisms. Scott answered NCBI taxonomy would assign taxonomy nodes in the following way;
For scientific names, without colon between center code and voucher ID i.e. Eutagenia sp. BMNH 12345
For synonyms, with colon between center code and voucher ID i.e. Eutagenia sp. BMNH:12345
No more used, only voucher ID without center code i.e. Eutagenia sp. 12345
16.
Concerns about INSDXML DTD ver. 1.5 and structured COMMENT
(DDBJ)
DDBJ would like to evaluate the proposed version, INSD‐XML DTD ver. 1.5.
When DDBJ system administrators checked it, some problems were found;
1) Definition of "INSDSeq_tagset" is not found.
3) "authority,version" should be defined as "required"
Also, we should be careful to use "structured COMMENT".
1) How to control and validate user's tag definitions?
2) We should keep the equality of the contents between the text flat file and INSD‐XML
DISCUSSION and CONCLUSION:
DDBJ showed following problems in the INSDSeq‐XMLv.1.5 DTD proposed by GenBank. <INSDSeq_tagset> should be defined.
<INSDTagset_id> should be added for “field name” of structured comment. Where to describe <INSDTagsetRuleSet>?
How to link <INSDTagset> and <INSDTagsetRuleSet>?
How can we exchange <INSDTagset_authority>, <INSDTagset_version>, <INSDTagset_url> , and <INSDTagset_unit>?
Basically, GenBank accepted the DDBJ requirements to modify their DTD. Two alternatives related to XML modifications were discussed.
Disconnect XML and structured COMMENT/CC New field other than COMMENT/CC lines?
For the following two points, all agreed to use structured COMMENT/CC lines. Structured comment approach is still flexible.
There are many GenBank entries in which structured comments have been used.
Some examples for structured comment were introduced by GenBank EU577696 has a structured comment for “HIVDataBaseData”. http://www.ncbi.nlm.nih.gov/nuccore/EU577696
WGS master record, ABUB01000000 has a structured comment for “Metadata” after “normal comments”.
http://www.ncbi.nlm.nih.gov/nuccore/ABUB01000000
Some alternative formats for structured COMMENT/CC lines were shown by DDBJ and EMBL. From DDBJ, 1) fields for name and values are delimited by doubled colons, “::”. 2) the names are enclosed by doubled colons, “::”.
From EMBL, a simple indented structure for names and values as in Feature/Qualifier.
ACTION:
In a week, GenBank will provide the proposed samples for structured comment and INSDSeq‐XML DTD ver. 1.5.
TIMELINES:
CONTACTS:
GenBank: Mark Cavanaugh EMBL‐bank: Rasko Leinonen DDBJ: Jun Mashima
17.
Large scale data from patent office (DDBJ)
JPO has asked DDBJ to release about 2,500,000 entries of their data. Since it is the first time for DDBJ to accept such large scale PAT data, DDBJ would like to ask EMBL‐Bank and GenBank how they handle large scale PAT data.
DISCUSSION and CONCLUSION:
DDBJ asked EMBL‐Bank and GenBank how they handle large scale PAT data.
Both GenBank and EMBL‐Bank do not have any particular policy for large number of PAT entries. In a practical standpoint, GenBank suggests USPTO to send a representative entry for a patent application that has many sequences.
All agreed that the release policy for large number of PAT entries is up to each patent office and data bank.
18.
Data exchange for WGS (GenBank)
A recent update to DDBJ WGS project BAAB01 involved: a) a publication change; b) changes to two of the sequences; c) a changed update‐date value for every contig in the project.
At GenBank, when a publication for a WGS project is updated (for example, the addition of a PubMed ID for ABWF00000000, which occurred on 04‐MAR‐2009), even if we 'refresh' the contents of the WGS files on our FTP site, the update‐dates of the contig records themselves do *not* change.
In the DDBJ case, if we wanted to automate the processing of their new/refreshed BAAB01 data files, we would have to load EVERY ONE of the sequence records into our database. Why? The update‐date is intrinsic to the contigs. If their values change, we cannot get a byte‐identical result when we compare them to what we already have in our sequence database. As a result, all 213,289 records BAAB01 were loaded, as opposed to just the two that have sequence changes. This is an inefficient procedure.
For a publication update that affects all contigs, the update date of just the *master* changes. We exchange the new master, process it, and we're done. If there are also a handful of sequence changes, we would have to examine all of the contigs, but at least only the ones that REALLY changed would actually be loaded into our sequence database.
We raise this issue because the volume of WGS data can only grow with the decrease in sequencing costs. So perhaps it is time to explore some new directions for data exchange, so that the exchange mechanism more closely mirrors the manner in which WGS data is actually stored and maintained?
DISCUSSION and CONCLUSION
For a publication update that affects all of a WGS dataset, three banks have conventionally exchanged all entries. However, it often seemed to be redundant.
As an alternative approach, GenBank proposed to add WGS masters to the INSDC exchange, limiting to unannotated WGS entries, to update the fields that affect all entries in a WGS dataset; i.e. references, comments or else, because the volume of WGS data grows with the decrease in sequencing costs.
Other alternatives were also discussed; ‐ Utilize INSDSeq‐XML
‐ Multi‐fasta
‐ Cutting different lines from flatfile
Though it was not so sure in detail, it would take some cost for resolving this issue at EMBL and DDBJ. All agreed to inspect how to add WGS masters for INSDC exchange in each side.
TIMELINES:
Six weeks to propose some appropriate format.
CONTACTS:
GenBank: Mark Cavanaugh EMBL‐bank: Rasko Leinonen DDBJ: Yasukazu Nakamura
19.
Public clarification of WGS dataclass/division usage conventions
(EMBL)
Conventionally, finishing a whole genome shotgun project requires the migration of all sequence to non‐ WGS dataclasses/divisions, either by creation of new STD entries or new STD segment entries with CON/ANN to tie them together (case 1). We have datasets that have CON/ANN pointing to WGS
represent high‐level assemblies. Case 2 confounds two project statuses. We will propose a text that we would like to publish on INSDC.org to clarify finishing requirements to data providers.
DISCUSSION:
To clarify finishing requirements to sequence data providers, EMBL proposed the following text about usage of WGS category to publish on INSDC.org site;
The INSDC WGS dataclass/division represents transient assemblies of contigs from whole genome/metagenome sequencing projects. Whenever the sequences are reassembled, all entries from the prior build are suppressed and no attempt is made to track sequences in the assemblies.
Where scaffolds are assembled from WGS data, these should be present as CON entries representing gapped clones, and these CONs may further be assembled into
chromosomes.
Because of the volatile nature of these data (both WGS and CONs build from them) we would discourage the use of annotation on these data, but where groups have added features we would strongly encourage submitters to facilitate the tracking of CDS features using the protein_id qualifier.
For non‐metagenomic projects, we would expect that WGS entries are migrated to the main section of the database when the initial cycle of sequencing and assembly is over. This point is reached when tracking of sequences between builds is possible, when each entry represents a specific replicon, or when a project is no longer active
‐‐‐‐‐‐‐‐‐
EMBL believes that INSDC should create a subset of unannotated WGS – with definition,
accession.version and sequence differing between them only. Conventionally, finishing a whole genome shotgun project requires the migration of all sequence to non‐WGS data classes/divisions (case 1). We have datasets that have CON/ANN pointing to WGS segments (case 2), for projects that have mature data without being finished correctly, and those that have immature data for which the submitters have some reasons to tie entries together to represent high‐level assemblies. Case 2 confounds the two project statuses.
As a counterpart example, JCVI submitted annotated metagenome data. GenBank suggested that we can use project database for phase classification of WGS projects.
To distinguish this kind of metagenomic data from the other two cases mentioned above, using controlled keyword system would be helpful for EBI and UniProt.
An example of questionable WGS‐CON data set was found;
A WGS data set, ABVS01, has only one member of WGS sequence data, ABVS01000001. http://www.ncbi.nlm.nih.gov/nuccore/ABVS01000000
However, an annotated CON entry, DS995298, has only one piece entry, ABVS01000001. http://www.ncbi.nlm.nih.gov/nuccore/DS995298
Why is the sequence data treated as a normal record in “HTG phase 2” or else?
Discussion for these issues will be continued via mail.
TIMELINES:
Two weeks for conclusion.
CONTACTS:
GenBank: Karen Clark
EMBL‐bank: Nadeem Faruque DDBJ: Jun Mashima
Wednesday May 12, 1:005:00 (break 2:302:45)
Chair: Takashi Gojobori Minutes: EBI
20.
Pointers from flatfiles to SRA/ERA run data (EMBL)
EBI wishes to adopt a mechanism for pointing to sets of reads in the short read archives from AS lines (equivalent to ASSEMBLY blocks). Although for practical reasons of volume, these pointers will not provide logical connections to individual reads, they will permit the association of short read metadata (such as sample information) with assembled sequence.
DISCUSSION EBI
TPA/TSA and normal entries should be able to refer to a run within SRA/ERA (can't point toward individual reads)
can point toward run (SRR) objects use AS lines with no co‐ordinates NCBI
suggested using DR lines
use record level cross references (dblink) EBI
considered DR but update cycle is independent NCBI
could we switch your AS lines to dblink (and vice versa) EBI
might be confusing for users
NCBI
not willing to use AS lines ‐ only want one entry level cross reference EBI
don't need to use the same method
NCBI
dblink is part of record. names resource, then provides identifier EBI
could introduce new line type ‐ this would be an internal discussion for EBI DDBJ
think it's a good idea, not sure how they would implement it NCBI
agreement that TSA/TPA record should point to run where AS information isn't available DDBJ
don't know where this information would be stored EBI
wouldn't put sample references in AS lines NCBI
need a convention on naming ERA/SRA resource
how do we deal with exchanged entries from other databases where the run object doesn't exist? EBI
fails to load (agreed by all)
NCBI
TSA/TPA documentation should include the requirement for the SRA/ERA link where AS lines are not used.
EBI
how many TSA/TPA will change as a result of this?
NCBI
2 TSA (1 of ~60,000, 1 of ~20,000), no TPA DDBJ
will display in dbline EBI
will use DR line (as it's quick)
CONCLUSION
NCBI will use dblink, EBI will use DR line (subject to internal discussion), DDBJ will use dblink. TPA/TSA without conventional AS lines must have a link to SRA/ERA run (this will be added to the
documentation). Must validate that the run object exists ‐ if cross database, must be public for this to happen.
TIMELINES
CONTACTS
21.
Use of SRA/ERA sample objects for enrichment of contextual
sample information (EMBL)
Due to their extensible structure, short read sample objects provide a sensible holder for contextual sample information that might previously have been directed towards structured comments in flatfiles. Although this concept has been agreed in principle across INSDC, EBI wishes to establish a mechanism through which sequence and project records can point to SRA/ERA sample objects.
DISCUSSION NCBI
what happens if the sample object changes so that the community keyword no longer applies?
EBI
calculation on output to check whether the stamp should be there
NCBI
would you update and redistribute the record if the sample object changed? EBI
No. We would produce a mapping file (regularly) that would know why were compliant. NCBI
If a BARCODE record was updated, we would rerun the checks that it complied with the rules Slightly different, as BARCODE rules are internal to INSDC
EBI
Would treat the rules as internal in these cases
Owner of the sample object could be different to owner of the sequence record
Link between sequence and sample objects are within DR ‐ we control these, so it's outside submitter control.
NCBI
would you worry about removing a keyword without warning the submitter? EBI
No. NCBI
Keyword is more important than just DR ‐ seen by submitter as quality mark. If removed by third party, would submitter be unhappy.
Would we contact submitters? EBI
Face the update/keyword removal issue when we meet it. NCBI
tracking changes of status of keywords for compliance may be important
‐ GenBank has concerns about the synchronising validation of sample object 'quality' or 'compliance'. Ownership issues may arise in the future.
CONCLUSION
TIMELINES
CONTACTS
22.
Discussion Item Blurring the Distinction Between Traditional
Sequence Databases and Short Read Archives (GenBank)
With the advent of next generation sequencing, we have seen a huge increase in the amount of sequence data deposited at NCBI. We have noticed, particularly with EST sequence submissions, a significant increase in the size and number of submissions. Most of the EST sequencing now done using 454 technology, is being submitted to both SRA and to dbEST. There are technical reasons why
submitters want to do this including the need for accession numbers for publication, the need to retrieve and analyze all transcript sequences from a single organism and the inability to retrieve and analyze individual sequences from SRA. As we work to overcome these hurdles in SRA, we question whether it is still necessary to submit these sequences to GenBank. We would like to discuss the future of the two databases and how we can satisfy the needs of scientists to use data from both datastreams without duplicating all of the data.
DISCUSSION
NCBI
No longer need to assign so many EST accnos any more since we have been able to BLAST sequences from SRA. Less concern from submitters now the data is searchable - not fussed about making EST records.
EBI
Similar to our proposal from last year. Keen to do this - need the SRA toolkit first. (this will be discussed tomorrow in short read meeting).
DDBJ
Are NCBI trying to separate 'old' EST data from 'new' EST data? NCBI
No - want to present all data together EBI
NCBI
Kurt will be able to supply this EBI
What will NCBI make blastable? NCBI
everything - no technical issues EBI
presume this won't include variation data etc. NCBI
correct EBI
how many transcriptome sets? NCBI
around 100 - select subset of runs and blast against them (select by organism) DDBJ
do you still want EST submissions (as well as SRA)? NCBI
No. We want to be able to search across all data (no distinction between data types or sets) BLAST system is first implementation - will be discussed more tomorrow. NCBI is planning a more complex search interface in the future.
Don't know at what stage the system won't cope with numbers of reads NCBI
could EBI work with fastq files EBI
EBI has a strong preference for the toolkit – more sustainable system than using fastq.
DDBJ
if data is divided between old and new, users may get confused NCBI
may happen, but there isn't a way to deal with this.
Ideally would like to see dbEST go away, but don't plan to migrate existing data to new format. NCBI
CAGE data will also be a blur between traditional and new format data - CAGE data submission volumes are also rising rapidly.
DDBJ
can see that they would also merge, but as technique is based on old sequencing methods won't accelerate that much.
NCBI
CAGE may be replaced by new tech sequencing (particularly 454) EBI
in short reads these should be represented as tags NCBI
not sure this would work EBI
gene expression data is being represented as tags NCBI
in this case tags represent the whole of the reads DDBJ
dbEST is outside the collaboration NCBI
not entirely - data is exchanged into collab DDBJ
Will need to revisit this issue in the future
Will you make a new version of dbEST to serve all data in the future? NCBI
CONCLUSION
EBI is keen, but needs the SRA toolkit. We could stop creating ESTs in the future. DDBJ isn't sure that that they would agree to stop EST data creation. NCBI feels that EST is no longer required. Legacy EST records will remain.
INSDC now includes SRA/ERA and trace archives. This is agreed by all collaborating databases.
Agenda for the 22
ndAnnual International Sequence Database
Collaborative Meeting Tuesday May 12, 2009Wednesday May 13, 2009
Supplemental WGSRelated Agenda Items (NCBI)
W1. Multiple Products for NCBI CON Division Records: Still Needed By INSDC?
NCBI's standard representation for its CON‐division data product intended for the general public (which includes WGS scaffolds; WGS super‐scaffolds/chromosomes; HTG‐based scaffolds) is as follows:
• CONTIG join() statement, rather than an instantiated sequence
• If the scaffold/super‐scaffold/chromosome is annotated, then display that annotation, in the
coordinate system established by the CON record itself.
• Both Annotated and Un‐annotated CON records are provided in a single data product.
Here's an example of an annotated CON‐division record as presented in NCBI’s daily incremental CON‐division update:
LOCUS DP001105 769093 bp DNA linear CON 04‐MAY‐2009 DEFINITION Dasypus novemcinctus ENCODE region ENr122 genomic scaffold.
ACCESSION DP001105
VERSION DP001105.1 GI:229368688 KEYWORDS ENCODE.
....
source 1..769093
/organism="Dasypus novemcinctus" /mol_type="genomic DNA"
/db_xref="taxon:9361"
/note="ENCODE region ENr122"
mRNA join(<48165..48332,50783..50917,51711..51848,52844..52961, 58204..58346,60940..61107,62072..>62473)
/gene="SERPINB12"
/product="SERPINB12 protein (predicted)" /inference="ab initio prediction:JIGSAW:3.2"
/inference="similar to AA sequence:INSD:AAI03886.1" CDS join(48165..48332,50783..50917,51711..51848,52844..52961, 58204..58346,60940..61107,62072..62473)
/gene="SERPINB12"
/inference="similar to AA sequence:INSD:AAI03886.1" /note="similar to AAI03886.1 SERPINB12 protein (Homo sapiens)"
/product="SERPINB12 protein (predicted)" /protein_id="ACQ62981.1"
....
CONTIG join(AC198003.2:1..13874,gap(unk100),AC198003.2:13975..140515, gap(50000),AC159587.2:1..17686,gap(unk100),AC159587.2:17787..28127, gap(unk100),AC159587.2:28228..52623,gap(unk100),
AC151560.2:15842..15908,AC159587.2:52724..99606,gap(unk100), AC151560.2:62646..84285,gap(unk100),AC151560.2:84386..97538, gap(unk100),AC151560.2:97639..99167,AC196357.2:1..53124, AC186952.2:1..79891,AC152469.2:1..48782,gap(unk100),
AC152469.2:48883..108140,gap(unk100),AC152469.2:108241..108376, AC185584.2:52845..87143,gap(unk100),AC185584.2:87244..107710, gap(unk100),AC185584.2:107811..115293,AC199308.2:1..16759, gap(unk100),AC199308.2:16860..18913,gap(unk100),
AC199308.2:19014..52798,gap(unk100),AC199308.2:52899..70818, gap(unk100),AC199308.2:70919..84363,gap(unk100),
AC199308.2:84464..138443) //
Several years ago, at INSDC request, we implemented a different policy for 'collab' CON‐division
data products that are intended for use only within the INSDC:
1) Un‐annotated and Annotated CON records are provided via independent data products. For
example:
ncbi.con_nc_annot.0505.2009.gbff.gz ncbi.con_nc.0505.2009.gbff.gz
2) For Annotated CON records, both the CONTIG join() statement and instantiated sequence
data are presented.
Requirements (1) and (2) add significant complexity to our data processing. For each source of CON records, we have to dump each day's records, split them into annotated/un‐annotated, process them in different ways (sequence instantiated/non‐instantiated), and then combine (or not combine) the records to generate either one or two final data products (public vs. collab).
QUESTION: Are practices (1) and (2) still required by EBI and DDBJ?
We recently processed our first TPA‐WGS project. It is quite likely that we will eventually receive annotated and un‐annotated scaffolds for TPA‐WGS. If there is a possibility that (1) and (2) are no longer necessary, that would greatly simplify NCBI's processing of future TPA‐WGS scaffolds, and of existing CON/scaffold records.
DISCUSSION
NCBI
Should we still provide both annotated and un‐annotated con records? EBI
considering discontinuing ANN anyway
would be glad to remove this product, and not have a separation ‐ INSDC now includes SRA/ERA and trace archives. This is agreed by all collaborating databases.
might need a bit of extra time (~9 months)
needs some work in the database, but is a streamlining step DDBJ
also agree that CON and ANN should be merged into CON should not be too much of a problem
CONCLUSION
We should aim to discontinue ANN and merge with CON.
TIMELINE
In around 9 months time
W2. Modification of WGS Accession Format to Support WGS ‘ScaffoldMasters’
This is a discussion topic, rather than a formal proposal. However, the issue is important, and we are hoping for some productive debate during the meeting, and development of some creative ideas.
At NCBI, WGS projects consist of underlying sequence‐overlap contigs plus a “WGS Master” record, which stores data that are common to all of the contigs (references, comments, Project IDs, etc).
Upon user retrieval of a WGS contig record, the incorporation of WGS master data is possible only because there is a shared accession number convention for both the contig and its master. For example, given a request for contig record AZZZ05123456, software recognizes that the accession
is that of a WGS contig (Project Code + Assembly‐Version + Digits) and then infers the accession of
the corresponding WGS master ( AZZZ + 00 + 000000 ). Data is obtained from that master, and incorporated into AZZZ05123456 on‐the‐fly.
The advantages of a WGS master are clear: for example, publication data for the WGS project are not stored redundantly, and an update to a master’s publication is immediately visible for all of the contigs.
Unfortunately, NCBI has not used the same approach for scaffolds (also known as supercontigs, ultracontigs, etc) which are constructed from WGS contigs. Instead, the scaffolds/CON‐records all have their own publications, comments, Project IDs, etc, and utilize accession numbers of the traditional 2+6 format. There is no equivalent of a ‘master’ for scaffolds.
And without a master, maintenance of WGS scaffold records at NCBI becomes a significant issue: any simple update must be performed on 1,000s/10,000s/100,000s of individual records, rather than in one place.
The barrier which prevents us from implementing a ‘master’ for WGS scaffolds is the lack of a uniform accession number convention for scaffolds, like the convention that exists for contigs. If a
new accession format for scaffolds could be developed, similar to that of the contigs, yet also
different enough that users would understand the distinction, we could then make use of the ‘master’ approach, and storage and maintenance of scaffolds would be much improved.
Here are a few possibilities:
A contig A scaffold Chromosomes too?
AZZZ05012345 AZZZ05_s:012345 AZZZ_c:000001
AZZZ05012345 AZZZ05‐scf:012345 AZZZ‐chr:000001
AZZZ05012345 AZZZ05‐s‐012345 AZZZ‐c‐000001
The basic idea is to elaborate on the WGS contig accession format, and provide a means to recognize that a given record is a scaffold.
Some points to consider:
• We are consuming new prefixes for 2+6 accession numbers at an alarming rate. Adopting a
convention for scaffolds which is similar to the convention for contigs would reduce that rate significantly.
• In most cases, there are only a handful of chromosome‐level scaffolds, and preservation of