BOLD-MAS Documentation & Tutorials
  Tutorials
  Introduction to BOLD

  Documentation & Protocols
  BOLD Manual
  Specimen Data Submission Protocol
  Image Submission Protocol
  Trace Submission Protocol


  BOLD MANUAL
Welcome to BOLD – the Barcode of Life Data System.  In this document, you will learn the basic data structure of BOLD (what it is and what it can do for you), the expected workflow (from project generation to analysis), and some of the important features that BOLD has to offer.


Table Of Contents

  1. Data Structure
    1. Specimen Identifiers
    2. Collection Data
    3. Specimen Taxonomy
    4. Specimen Details
  2. Data Analysis
  3. Downloads
    1. Sequences
    2. Data Spreadsheet
  4. Sequence Analysis
    1. Taxon ID Tree
    2. Distance Summary
    3. Sequence Composition
    4. Nearest Neighbour Summary
  5. Specimen Aggregates
    1. Distribution Map
    2. Image Comparison
  6. Barcoding Workflow

Data Structure


All of the data in BOLD is organized by projects.  There is a limit of 1000 entries for a given project, to keep the size manageable.  Related projects can be grouped in to containers.

An individual entry in the database represents a barcode of a given specimen.  The Process ID uniquely represents a specimen in BOLD.  This is the identifier that is used to track a specimen through the barcoding process:  collection, taxonomic identification, sequencing, analysis and final publication of data.  Process ID is assigned internally when a specimen record is created.

BOLD is designed to hold not only sequence data (including primer sets, electropherogram trace files, and translations), but also complete taxonomic designation, geo-temporal collection data, and specimen images.

entering specimen dataSpecimen data can be entered in one of two ways.  Users can enter sample data through the web interface, which is convenient for small numbers of entries (“Project View, Data Management – Submit Specimens”).  

For larger sets of samples, the data can be entered on the BOLD Submission Template spread sheet and sent to BOLD.  Data managers will review the data, to ensure that it meets the minimum requirements, and input it to BOLD.  This method is preferable for large sets of data.

 





Specimen Identifiers


Each specimen is further identified by several fields.  Sample ID is often an extension of the field or catalog number.  In the case of vouchered specimens, Isolate or Field Number, Catalog Number, and Collection Code should match the respective identifiers from the host institution.  Institution represents the name of the institution which donated the tissue sample (or where the specimen is vouchered).

Collection Data


Collection Data describes the geo-temporal collection event.  Collectors identifies the individual(s) responsible for the collection.  Date Collected should be in numeral format, e.g. 14/11/2005.  The fields Continent/Ocean, Country/FAO (Food And Agriculture Organization), and State/Province are chosen from drop-down menus.  

The fields Region/Country/Lake/River, Sector, and Exact Site are free-text fields intended to allow a hierarchy of information to be captured while remaining flexible.

Latitude and Longitude should be entered in decimal degrees, Elevation in meters.  Coordinate source should indicate how these values were obtained (e.g. GPS model number, reference map, etc.).

Specimen Taxonomy


In the case of the online-submission form, all fields in this section are entered by drop-down menus.  

Specimen Details


Here the researcher can enter extra information about the specimen.  The Voucher Type must conform to a set of standard types.  Tissue type should also include the storage medium (e.g. alcohol, formalin, etc.).  Extra Info can include anything that might be interesting.  Sex can be male, female, or hermaphrodite.  Reproduction can be sexual, asexual, or cyclic parthenogenesis.  Life stage can be adult or immature.  Notes may contain any other information.

 

Data Analysis


projects listThe following section follows the on-line BOLD interface (http://www.BOLDsystems.org).

BOLD offers several options for examining you data.  From the “Project List View” (available by clicking on “Main Menu” at the top right of all pages).  Select a single project by clicking on the project name.  A “container project” holds several sub-projects, and clicking the container name will merge all the projects in the container.  Any combination of projects or containers can be selected (by placing a checkmark in the box next to the project or container name) and selecting “Project Options – Merge Projects”.

The “Summary View” allows the user to perform various functions on all of the specimens in a project.  Detailed View (discussed below) allows the same functions on a limited set of specimens within a project.  Note that, for users with Sequence Access, Analyze permission (but without View permission), they will be able to perform analyses from Summary View only (see Barcoding Workflow).

summary viewSummary View shows various details:  project start date, the total number of sequences, specimens, and species (total number and number sequenced).  Data Report displays the number of specimen records, and provides easy access to missing data, such as the number of specimens lacking geographic data, lacking photographs, or containing stop codons.

A list of the users associated with the project and a graph of the distribution of sequence length is provided.

In the case of a container project, or if projects have been merged, this data is derived from all specimens found in the merged projects.  A list of the merged project names is also presented.


detailed view

Clicking on “Options – Detailed View” the user is presented with summary information for each specimen in the project (or projects, if projects have been merged).  The “Contains” column will display icons indicating the data available for the given specimen.   A globe represents geographic reference data, a camera indicates an image of the specimen, a number in a box indicates the number of sequencing trace files available, and a red asterisk indicates that a stop codon is present in the sequence.







specimen dataClicking on the sample ID allows the user to view the specimen data.  If the user has Sequence Access, Edit permission, an “Edit Specimen” button will allow this data to be modified.  Clicking on the Process ID (the unique BOLD identifier) will show the sequence information.

The “Downloads”, “Specimen Analysis” and “Specimen Aggregates” sections at the left of the page contain the available analyses.  If selected from “Summary View”, the analysis is applied to all of the specimens.  If selected from “Detailed View”, individual specimens can be selected.





Downloads

Sequences


Clicking on “Sequences” will provide a text file with all of the barcode sequences in FASTA format.  The header line contains the Process ID, Sample ID, and the species name.

Data Spreadsheet

data spreadsheet

“Data Spreadsheet” allows the user to download a spread sheet containing all of the identification data, taxonomy, and collection data.  Additionally, a set of labels can be printed for the specimens.









Sequence Analysis

Taxon ID Tree


ID treebuild an ID tree“Taxon ID Tree” allows the user to generate a tree based on the selected sequences.  The tree can be generated from the nucleotide or amino acid sequences, using Kimura 2 Parameter, Jukes Cantor, or Pairwise Distance method. The user can select how the terminal branches will be labeled, and specify which codon positions are to be included in the analysis.  A filter is applied to disregard sequences below a given length (since very short sequences are not adequate and can skew the results).  Finally, the tree can be colourized based on various information associated with the specimens.

Clicking on “Build Tree” initiates the process, and the user can then download the results in PDF format.

Distance Summary

distance summaryIt is desirable for barcodes to show a very low frequency of divergence within a species, but a significant divergence at higher taxonomic levels.

 “Distance Summary” gives an indication of the differences of the barcode sequences at the level of species, genus, family, order, and class.  

The distance is calculated using the Kimura 2 Parameter method.  Comparisons are done on all samples matching at the given taxonomic levels, and the frequency is plotted for various levels of divergence.  Details for the comparisons done at the level of species, genus, and family are available by clicking on the appropriate buttons.







Sequence Composition

sequence composition
The frequency of occurrence of DNA bases, especially GC-content, can be a useful metric to the evolutionary biologist.  GC-content within the barcoding region of CO1 has been shown to correlate with GC-content of the entire mitochondrial genome for many species.

Clicking on “Sequence Composition” allows the user to view the frequency of each base, G, C, A and T, as well as combined GC content.  This information is presented for the whole of the sequence, as well as for codon positions 1, 2, and 3.  “Detailed View” tabulates the results for each specimen.


Nearest Neighbour Summary

nearest neighbour summary
Clicking on “Nearest Neighbour Summary” presents the users with an examination of the distance to the nearest neighbour for each of the species in the list of specimens.  Distances are highlighted if the distance to the nearest neighbour is less than 2%, or when the distance is less than the intra-specific distance.  “Summarize by Family” combines the results for each taxonomic family in the project.   






  






Specimen Aggregates

distribution map

Distribution Map

"Distribution Map” pinpoints the collection point for all specimens for which geographic reference data is available.  The satellite map resolution is 1km per pixel.  The user can zoom in or out, or pan the map in 4 directions.



Image Comparison

image comparison


“Image Comparison” shows the images associated with each specimen, when available.  This allows taxonomists an easy way to compare morphological differences between specimens.










Barcoding Workflow

barcoding workflow
 A project begins (“Create New Project”) by assigning a name, project code and description.  Other users can be added and assigned levels of access.  A project can only be accessed by users who have been specifically granted permission.  Permissions for an existing project can be viewed by selecting the project, then selecting “Options – Modify Project Properties”.  There are separate permissions for Sequence Access, and Specimen Access.

setting access levels If a user is assigned to a project, but no access level is specified, they can view summary data only.  Sequence Access permissions consist of three levels:  with Analyze permission, the user can perform analysis on the data, but cannot view more than a summary of the data.  With View permission, the user can view or download the sequence data.  With Edit permission, the user can upload sequences or make changes to existing sequence features.  

Specimen Access permission allows the user control over sample identifiers, taxonomy, collection data, and images of the specimen:  this level is intended for project managers, collectors, and taxonomists only.

For example, if the DNA sequencing work is being done by a separate lab, the researcher responsible for generating the sequences can be added to the list of users.  Since they will be uploading sequence data, this user would require Sequence Access, Analyze/View/Edit permissions.


Specimens can be added to the project in one of two ways:  either through the web interface or a sample submission spreadsheet (see Data Structure).

sequence upload form Sequence information can be uploaded (“Project View – Data Management, Submit Sequences”) for several specimens at once in FASTA format.  The FASTA header line must conform to the following format:  it should begin with a ‘>’ followed by the Process ID, followed by either a bar (‘|’), an underscore (‘_’) or a space (‘ ‘), followed by any other information the user wishes to add.  There can be no spaces before the end of the Process ID.