BIRCH
Downloading and Maintaining GenBank
Organization of
GenBank flatfile distribution
The GenBank
database is produced by the National
Center for Biotechnology Information (NCBI) at the NIH. It is
distributed as a set of flatfiles (text files), as described in the GenBank Release Notes.
As summarized in Table 1, the sequences are divided among several
divisions. Additionally, there are various index files. Note: Index
files not used by BIRCH are omitted from the table.
Table 1. GenBank
divisions and flatfiles
|
code
|
division
|
file(s)
|
PRI |
primate sequences |
gbprixxx.seq.gz
|
ROD |
rodent sequences |
gbrodxxx.seq.gz
|
MAM |
other mammalian sequences |
gbmamxxx.seq.gz
|
VRT |
other vertebrate sequences |
gbvrtxxx.seq.gz
|
INV |
invertebrate sequences |
gbinvxxx.seq.gz
|
PLN |
plant, fungal, and algal sequences |
gbplnxxx.seq.gz
|
BCT |
bacterial sequences |
gbbctxxx.seq.gz
|
VRL |
viral sequences |
gbvrlxxx.seq.gz
|
PHG |
bacteriophage sequences |
gbphgxxx.seq.gz
|
SYN |
synthetic sequences |
gbsynxxx.seq.gz
|
UNA |
unannotated sequences |
gbunaxxx.seq.gz |
EST |
EST sequences (expressed sequence
tags) |
gbestxxx.seq.gz
|
PAT |
patent sequences |
gbpatxxx.seq.gz
|
STS |
STS sequences (sequence tagged sites) |
gbstsxxx.seq.gz
|
GSS |
GSS sequences (genome survey
sequences) |
gbgssxxx.seq.gz
|
HTG |
HTGS sequences (high throughput
genomic sequences) |
gbhtgxxx.seq.gz
|
HTC |
HTC sequences (high throughput cDNA
sequences) |
gbhtcxxx.seq.gz
|
|
GenPept - translated proteins from GenBank
|
genpept.fsa.gz
|
|
Accession number index
|
gbacc.idx.gz
|
|
Secondary accession number index
|
gbseq.idx.gz
|
|
Release notes
|
gbrel.txt
|
The .gz extension indicates that files are compressed using the gzip
protocol for faster download. In most cases, each GenBank division is
split up among several files. For example, as of Relese 133.0,
the patented division (PAT) was split among gbpat1.seq,
gbpat2.seq... gbpat7.seq. All files can be downloaded by FTP
from NCBI and from mirror sites listed in Table 2. Because of the
size of GenBank, and the amount of traffic at the NCBI server, it is
usually best to use the server geographically closest to your site.
Automated
downloading and installation of GenBank
1. Getting ready
When you install BIRCH for the first time, the directories
$BIRCH/GenBank ($GB, $gb, $GENBANK, $genbank) and $BIRCH/GenPept ($GP,
$gp) will be created. $GENBANK will contain two files, gbupdate and
master.filelist. $GP will be empty.
The most critical consideration is space. For example, after
reformatting as described below, GenBank Release 133.0 required 78 Gb
including all files listed in Table 1. GenBank releases occur every
two months, and for many years growth has been constant at about 9 to
10% per release, corresponding to an annual growth rate of about
1.7. The actual file sizes for each GenBank release are found in
the current GenBank
Release Notes. Obviously, maintaining a local copy of GenBank is
only feasible with a very high speed internet connection.
Where disk space is limiting, it may still be feasible to maintain a
partial copy of GenBank. For example, the EST division is by far the
largest, accounting for well over half of the entire database. Other
big divisions that can easily be omitted are GSS and HTG. Elimination
of these divisions will not eliminate any of the major biological
taxa, and will still retain all annotated genes.
In some cases, it is useful to keep GenBank in a
separate filesystem from the rest of BIRCH. For example, at the U. of
M., BIRCH resides in /home/socs/birch, while GenBank resides in
/local/genbank. All that is required is to make a symbolic link called
$BIRCH/GenBank that points to /local/genbank. One advantage of a
separate filesystem is that it simplifies backups. While it is worth
backing up $BIRCH, which is relatively small, there is no point in
doing backups on $GENBANK, since it can always be restored from the
Internet.
2. Running gbupdate
Table 3. A sample filelist
|
gbrel.txt.Z
gbacc.idx.gz
gbsec.idx.gz
est
gss
htg
htc
pri
pln
bct
inv
rod
sts
vrl
vrt
mam
pat
syn
phg
una
genpept.fsa.gz
|
To keep current on when new GenBank releases become available, you
should follow the USENET news group bionet.molbio.genbank.
The gbupdate script automates the process of downloading and
reformatting some or all divisions of GenBank. Before running this
script, you need to set the environment variable $MAILID to your
email address. This is usually requested by most anonymous FTP sites,
and can most easily be set in the .cshrc file of the BIRCH
administrator.
The file 'filelist' defines which files and divisions to download.
Rules for the filelist file:
- Any file with a .gz or .Z file extension will be uncompressed
after downloading.
- A complete GenBank division can be downloaded and processed by
simply putting the 3-letter code found in Table 1 into filelist.
- Alternatively, any individual file can be downloaded by putting
the name into filelist (eg. gbpln5.seq.gz).
A typical download session
Before running gbupdate, you need to specify which mirror site
to use. This is done by editing the gbupdate script, commenting out the
lines for whichever mirror site is geograpically closest.
We will now show the sequence of events in a typical download session.
The file master.filelist is distributed with BIRCH. It's probably
safest to copy this to another file called 'filelist' to use as a
working copy. To launch gbupdate, move to the GenBank directory and
launch gbupdate. By terminating the line with '&' you can make
the command run in the background,
cd $GB
./gbupdate filelist &
The advantage of running gbupdate in the background is that you can
logout at any time during the download without interrupting it.
The first file in the list is gbrel.txt.Z. In the example,
gbrel.txt is the file containing the GenBank release notes. On some
mirror sites, this file is compressed, and is named gbrel.txt.Z. At
other FTP sites, this file is not compressed, so 'gbrel.txt' must be
used in filelist.
When the file is received, the sizes of the original file from the FTP
server and the file received are listed.
gbrel.txt.Z
ORIGINAL= 58663
RECEIVED= 58663
If these numbers are equal, the name of the file is written to
files_received. Otherwise, the name of the file is written to
files_missed. gbrel.txt is a special case. After being uncompressed,
gbrel.txt is automatically moved to $doc/GenBank. By default, files
remain in the $GENBANK directory.
Next on the list is gbacc.idx.Z. This file is the Accession number
index, listing, for each accession number, the LOCUS name and division
code (Table 1). This file is uncompressed and remains in the GenBank
directory. (gbacc.idx is used by the XYLEM fetch
program which retrieves GenBank entries by ACCESSION or LOCUS number.)
The file gbseq.idx contains secondary accession numbers, as described in
the GenBank Release
Notes. It is not used by any of the programs in the current
BIRCH implementation.
Next on the list are GenBank division names. For each division,
gbupdate will find out how many files are contained in the division, and
list them to the output. For example, in Release 133.0, there were
two files in the vertebrate (vrt) division:
-r--r--r-- 1 IUBio archive 50808692 Jan 6 00:54 gbvrt1.seq.gz
-r--r--r-- 1 IUBio archive 22515849 Jan 6 00:54 gbvrt2.seq.gz
gbvrt1.seq.gz
gbvrt2.seq.gz
Removing file(s) for gbvrt1, if they exist
The full listing of files for this division are written, and then the
names of each file are echoed to the output. Before beginning the
download, gbupdate will remove the current files for this division, if
they exist, as a way of making sure that enough space is available.
If a file contains the .seq extension, it is assumed to be a sequence
file containing GenBank entries. After unzipping the file, the .seq file
is split into 3 files containing annotation, sequence and an index.
Thus, gbvrt1.seq is split into gbvrt1.ano, gbvrt1.wrp and
gbvrt1.ind. This is fully described in the documentation for the XYLEM
program splitdb
. It is strongly recommended that you read this documentation file. The
key point is that the annotation files and sequence files can be
searched independently, saving a great deal of disk I/O. Thus, fasta or
blast would only search the .wrp files containing sequence, and
wouldn't have to read all of the documentation. When sequence entries
are retrieved by fetch,
the index (.ind) file is used to find the annotation and sequence for
each entry so that the complete entry can be retrieved.
The vertebrate division shown above is a small division. At the other
extreme, the EST division is the largest, consisting of 235 files
in GenBank 133.0. Thus, a download of all GenBank divisions will
spend the majority of time on the EST division.
The last file listed in master.filelist is genpept.fsa.gz. This file
contains amino acid sequences translated from all annotated open reading
frames in GenBank. Like the files produced by splitdb,
genpept.fsa is also in FASTA format,
which can be read by fasta or blast. After unzipping, this file is
automatically renamed as genpept.wrp, and moved to $BIRCH/GenPept.
The two critical factors influencing the time required for a download
are the speed of the internet connection and the speed of the
filesystem. On our Sun Ultra 60 at the University of Manitoba, using a
remotely-mounted NFS fileserver, a complete download of GenBank
133.0 took just under 24 hours.
Because of the system and network resources used, it is best to do a
small download before trying to download all of GenBank. For example,
if your filelist contained only
gbrel.txt
gbacc.idx
vrt
you would get just the release notes, the accession index, and the
vertebrate division. If these were successful, you could put the
remaining division codes into filelist and complete the download.
The progress of the download can be monitored in a number of ways.
Just doing a directory listing of $genbank periodically will list all
files in the $GENBANK directory. 'less files_received' will list the
files successfully downloaded. 'top' will shown the program
currently running: FTP if a file is being downloaded, gunzip if
a file is being uncompressed, or splitdb if the file is being split.
When splitdb finishes processing a .seq file, the .ano, .wrp and .ind
files are ready for use with no further processing. Thus, when gbupdate
is complete all files are ready to use. The only thing remaining is to
regenerate the index files used by FASTA, as described in the next
section.
Configuring FASTA for
GenBank searches
How FASTA finds database files
FASTA reads a list of database files from the file 'fastgbs'. The
location of fastgbs is specified by the environment variable $FASTLIBS,
which is set to $BIRCH/dat/fasta/fastgbs. A typical fastgbs file is
shown below:
PIR Protein Identification Resource 72.02 $00@/home/socs/birch/dat/fasta/pir.fil
GenPept GenBank 133.0 CDS translations$01/home/socs/birch/GenPept/genpept.wrp
GB133 Primate$1P@/home/socs/birch/dat/fasta/gbpri.fil
GB133 Rodent$1R@/home/socs/birch/dat/fasta/gbrod.fil
GB133 other Mammal$1M@/home/socs/birch/dat/fasta/gbmam.fil
GB133 verteBrates$1B@/home/socs/birch/dat/fasta/gbvrt.fil
GB133 Invertebrates$1I@/home/socs/birch/dat/fasta/gbinv.fil
GB133 pLants$1L@/home/socs/birch/dat/fasta/gbpln.fil
GB133 Expressed Sequece Tags$1E@/home/socs/birch/dat/fasta/gbest.fil
GB133 Bacteria$1T@/home/socs/birch/dat/fasta/gbbct.fil
GB133 Viral$1V@/home/socs/birch/dat/fasta/gbvrl.fil
GB133 Phage$1G@/home/socs/birch/dat/fasta/gbphg.fil
GB133 Synthetic$1Y@/home/socs/birch/dat/fasta/gbsyn.fil
GB133 Unannotated$1U@/home/socs/birch/dat/fasta/gbuna.fil
GB133 Patented$1D@/home/socs/birch/dat/fasta/gbpat.fil
GB133 STS$1X@/home/socs/birch/dat/fasta/gbsts.fil
GB133 HTG$1h@/home/socs/birch/dat/fasta/gbhtg.fil
GB133 GSS$1s@/home/socs/birch/dat/fasta/gbgss.fil
GB133 All sequences (VERY long!)$1A@/home/socs/birch/dat/fasta/genbank.fil
For each GenBank division, there is a file with the .fil extension
listing the files comprising that division. An example is shown in
Table 4.
Table 4. gbinv.fil
|
</home/socs/birch/GenBank gbinv1.wrp 0 gbinv2.wrp 0 gbinv3.wrp 0 gbinv4.wrp 0 gbinv5.wrp 0
|
This file lists the location and names of the 5 files in the
Invertebrate division. A complete description of syntax for these
files can be found in $BIRCH/doc/fasta/fasta3x.asc.
The problem is that the number of files increases in some
divisions with each GenBank release, requireing additional lines to be
added to the .fil files. The next section describes how to
automatically generate these files for a new GenBank Release.
Updating .fil files
cd $dat/fasta/fil
This directory contains a Python script called fil.py and a
file called filnum. For GenBank 133.0 filnum looked as shown in Table 5.
Table 5. filnum |
est 235 gss 63 htg 57 htc 3 pri 24 pln 7 bct 6 inv 5 rod 6 sts 2 vrl 3 vrt 2 mam 1 pat 7 syn 1 phg 1 una 1
|
For each GenBank division, a number indicates the number of
files in the division. The actual number of files per division can be
found in GenBank
Release Notes, under the heading "ORGANIZATON OF DATA FILES".
As well, a short list of those divisions for which the number of
files has changed since the previous release is found under the heading
"Important Changes in Release xxx.x.". This file should be edited to
reflect the number of files in the current release.
After you have updated filnum, you can run fil.py by typing
python fil.py
fil.py reads filnum and creates a set of new .fil files in
the current. To move them to the parent directory (ie. $dat/fasta) type
mv *.fil ..
Finally remember to edit fastgbs, using Find/Replace in your
text editor to change the GenBank Release number to the current release.
Configuring GDE to read
GenBank
The default FASTA menus for GDE are located in
$dat/GDE/makemenus/menus/Database. These menus only have one database
choice for User-created files. The FASTA menus in
$birch/local/dat/GDE/makemenus/menus/Database have additional menu
choices for each database file listed in $BIRCH/dat/fasta/fastgbs. All
we need to do is to
edit $birch/local/dat/GDE/makemenus/menus/menulist to choose the
local menu item files. This is done by adding lines to menulist, such as
Database
FASTADNA
TFASTA
Now, re-run makemenus.py
to update the .GDEmenus files
Please send suggestions of comments regarding this
page to frist@cc.umanitoba.ca