BIRCH
BIRCH
 Downloading and Maintaining GenBank



Organization of GenBank flatfile distribution

The GenBank database is produced by the National Center for Biotechnology Information (NCBI) at the NIH. It is distributed as a set of flatfiles (text files), as described in the GenBank Release Notes. As summarized in Table 1, the sequences are divided among several divisions. Additionally, there are various index files. Note: Index files not used by BIRCH are omitted from the table.

Table 1. GenBank divisions and flatfiles
code
division
file(s)
PRI primate sequences gbprixxx.seq.gz
ROD rodent sequences gbrodxxx.seq.gz
MAM other mammalian sequences gbmamxxx.seq.gz
VRT other vertebrate sequences gbvrtxxx.seq.gz
INV invertebrate sequences gbinvxxx.seq.gz
PLN plant, fungal, and algal sequences gbplnxxx.seq.gz
BCT bacterial sequences gbbctxxx.seq.gz
VRL viral sequences gbvrlxxx.seq.gz
PHG bacteriophage sequences gbphgxxx.seq.gz
SYN synthetic sequences gbsynxxx.seq.gz
UNA unannotated sequences gbunaxxx.seq.gz
EST EST sequences (expressed sequence tags) gbestxxx.seq.gz
PAT patent sequences gbpatxxx.seq.gz
STS STS sequences (sequence tagged sites) gbstsxxx.seq.gz
GSS GSS sequences (genome survey sequences) gbgssxxx.seq.gz
HTG HTGS sequences (high throughput genomic sequences) gbhtgxxx.seq.gz
HTC HTC sequences (high throughput cDNA sequences) gbhtcxxx.seq.gz

GenPept - translated proteins from GenBank
genpept.fsa.gz

Accession number index
gbacc.idx.gz

Secondary accession number index
gbseq.idx.gz

Release notes
gbrel.txt

The .gz extension indicates that files are compressed using the gzip protocol for faster download. In most cases, each GenBank division is split up among several files. For example, as of Relese 133.0,  the patented division (PAT) was split among gbpat1.seq, gbpat2.seq... gbpat7.seq.   All files can be downloaded by FTP from NCBI and from mirror sites listed in Table 2. Because of the size of GenBank, and the amount of traffic at the NCBI server, it is usually best to use the server geographically closest to your site.

Table 2. FTP sites for GenBank downloads.
Location
URL
directory
USA - NCBI, Bethesda
ftp.ncbi.nih.gov
genbank
Japan
bio-mirror.jp.apan.net
pub/biomirror/genbank
Australia
bio-mirror.au.apan.net
biomirror/genbank
Singapore
bio-mirror.sg.apan.net
biomirrors/genbank
China
bio-mirror.im.ac.cn
genbank
USA - Indiana University
bio-mirror.net
biomirror/genbank
USA - San Diego Supercomputing Center
genbank.sdsc.edu
pub

Automated downloading and installation of GenBank 

1. Getting ready

When you install BIRCH for the first time, the directories $BIRCH/GenBank ($GB, $gb, $GENBANK, $genbank) and $BIRCH/GenPept ($GP, $gp) will be created. $GENBANK will contain two files, gbupdate and master.filelist. $GP will be empty.

The most critical consideration is space. For example, after reformatting as described below, GenBank Release 133.0 required 78 Gb including all files listed in Table 1. GenBank releases occur every two months, and for many years growth has been constant at about 9 to 10% per release, corresponding to an annual growth rate of about 1.7. The actual file sizes for each GenBank release are found in the current GenBank Release Notes. Obviously, maintaining a local copy of GenBank is only feasible with a very high speed internet connection.

Where disk space is limiting, it may still be feasible to maintain a partial copy of GenBank. For example, the EST division is by far the largest, accounting for well over half of the entire database. Other big divisions that can easily be omitted are GSS and HTG. Elimination of these divisions will not eliminate any of the major biological taxa, and will still retain all annotated genes.

In some cases, it is useful to keep GenBank in a separate filesystem from the rest of BIRCH. For example, at the U. of M., BIRCH resides in /home/socs/birch, while GenBank resides in /local/genbank. All that is required is to make a symbolic link called $BIRCH/GenBank that points to /local/genbank. One advantage of a separate filesystem is that it simplifies backups. While it is worth backing up $BIRCH, which is relatively small, there is no point in doing backups on $GENBANK, since it can always be restored from the Internet.

2. Running gbupdate

Table 3. A sample filelist
gbrel.txt.Z
gbacc.idx.gz
gbsec.idx.gz
est
gss
htg
htc
pri
pln
bct
inv
rod
sts
vrl
vrt
mam
pat
syn
phg
una
genpept.fsa.gz

To keep current on when new GenBank releases become available, you should follow the USENET news group bionet.molbio.genbank.

The gbupdate script automates the process of downloading and reformatting some or all divisions of GenBank. Before running this script, you need to set the environment variable $MAILID to your email address. This is usually requested by most anonymous FTP sites, and can most easily be set in the .cshrc file of the BIRCH administrator.

The file 'filelist' defines which files and divisions to download.

Rules for the filelist file:
  1. Any file with a .gz or .Z file extension will be uncompressed after downloading. 
  2. A complete GenBank division can be downloaded and processed by simply putting the 3-letter code found in Table 1 into filelist.
  3. Alternatively, any individual file can be downloaded by putting the name into filelist (eg. gbpln5.seq.gz).

A typical download session

Before running gbupdate, you need to specify  which mirror site to use. This is done by editing the gbupdate script, commenting out the lines for whichever mirror site is geograpically closest.

We will now show the sequence of events in a typical download session.  The file master.filelist is distributed with BIRCH. It's probably safest to copy this to another file called 'filelist' to use as a working copy. To launch gbupdate, move to the GenBank directory and launch gbupdate. By terminating the line with '&' you can make the command run in the background,
cd $GB
./gbupdate filelist &

The advantage of running gbupdate in the background is that you can logout at any time during the download without interrupting it.

The first file in the list is gbrel.txt.Z.  In the example, gbrel.txt is the file containing the GenBank release notes. On some mirror sites, this file is compressed, and is named gbrel.txt.Z. At other FTP sites, this file is not compressed, so 'gbrel.txt' must be used in filelist.

When the file is received, the sizes of the original file from the FTP server and the file received are listed.

gbrel.txt.Z
ORIGINAL=  58663
RECEIVED=  58663
 If these numbers are equal, the name of the file is written to files_received. Otherwise, the name of the file is written to files_missed. gbrel.txt is a special case. After being uncompressed, gbrel.txt is automatically moved to $doc/GenBank. By default, files remain in the $GENBANK directory.

Next on the list is gbacc.idx.Z. This file is the Accession number index, listing, for each accession number, the LOCUS name and division code (Table 1). This file is uncompressed and remains in the GenBank directory. (gbacc.idx is used by the XYLEM fetch program which retrieves GenBank entries by ACCESSION or LOCUS number.) The file gbseq.idx contains secondary accession numbers, as described in the GenBank Release Notes. It is not used by any of the programs in the current BIRCH implementation.

Next on the list are GenBank division names. For each division, gbupdate will find out how many files are contained in the division, and list them to the output. For example, in Release 133.0, there were two files in the vertebrate (vrt) division:
-r--r--r--   1 IUBio    archive  50808692 Jan  6 00:54 gbvrt1.seq.gz
-r--r--r--   1 IUBio    archive  22515849 Jan  6 00:54 gbvrt2.seq.gz
gbvrt1.seq.gz
gbvrt2.seq.gz
Removing file(s) for gbvrt1, if they exist
The full listing of files for this division are written, and then the names of each file are echoed to the output. Before beginning the download, gbupdate will remove the current files for this division, if they exist, as a way of making sure that enough space is available.

If a file contains the .seq extension, it is assumed to be a sequence file containing GenBank entries. After unzipping the file, the .seq file is split into 3 files containing annotation, sequence and an index. Thus, gbvrt1.seq is split into gbvrt1.ano, gbvrt1.wrp and gbvrt1.ind. This is fully described in the documentation for the XYLEM program splitdb . It is strongly recommended that you read this documentation file. The key point is that the annotation files and sequence files can be searched independently, saving a great deal of disk I/O. Thus, fasta or blast would only search the .wrp files containing sequence, and wouldn't have to read all of the documentation. When sequence entries are retrieved by fetch, the index (.ind) file is used to find the annotation and sequence for each entry so that the complete entry can be retrieved.

The vertebrate division shown above is a small division. At the other extreme, the EST division is the largest, consisting of 235 files  in GenBank 133.0. Thus, a download of all GenBank divisions will spend the majority of time on the EST division.

The last file listed in master.filelist is genpept.fsa.gz. This file contains amino acid sequences translated from all annotated open reading frames in GenBank. Like the files produced by splitdb, genpept.fsa is also in FASTA format, which can be read by fasta or blast. After unzipping, this file is automatically renamed as genpept.wrp, and moved to $BIRCH/GenPept.

The two critical factors influencing the time required for a download are the speed of the internet connection and the speed of the filesystem. On our Sun Ultra 60 at the University of Manitoba, using a remotely-mounted NFS fileserver, a complete download of GenBank 133.0 took just under 24 hours.

Because of the system and network resources used, it is best to do a small download before trying to download all of GenBank. For example, if your filelist contained only

gbrel.txt
gbacc.idx
vrt


you would get just the release notes, the accession index, and the vertebrate division. If these were successful, you could put the remaining division codes into filelist and complete the download.

The progress of the download can be monitored in a number of ways. Just doing a directory listing of $genbank periodically will list all files in the $GENBANK directory. 'less files_received' will list the files successfully downloaded.  'top' will shown the program currently running:  FTP if a file is being downloaded, gunzip if a file is being uncompressed, or splitdb if the file is being split.

When splitdb finishes processing a .seq file, the .ano, .wrp and .ind files are ready for use with no further processing. Thus, when gbupdate is complete all files are ready to use. The only thing remaining is to regenerate the index files used by FASTA, as described in the next section.

Configuring FASTA for GenBank searches

How FASTA finds database files

FASTA reads a list of database files from the file 'fastgbs'. The location of fastgbs is specified by the environment variable $FASTLIBS, which is set to $BIRCH/dat/fasta/fastgbs. A typical fastgbs file is shown below:

PIR   Protein Identification Resource 72.02 $00@/home/socs/birch/dat/fasta/pir.fil 
GenPept GenBank 133.0 CDS translations$01/home/socs/birch/GenPept/genpept.wrp
GB133 Primate$1P@/home/socs/birch/dat/fasta/gbpri.fil
GB133 Rodent$1R@/home/socs/birch/dat/fasta/gbrod.fil
GB133 other Mammal$1M@/home/socs/birch/dat/fasta/gbmam.fil
GB133 verteBrates$1B@/home/socs/birch/dat/fasta/gbvrt.fil
GB133 Invertebrates$1I@/home/socs/birch/dat/fasta/gbinv.fil
GB133 pLants$1L@/home/socs/birch/dat/fasta/gbpln.fil
GB133 Expressed Sequece Tags$1E@/home/socs/birch/dat/fasta/gbest.fil
GB133 Bacteria$1T@/home/socs/birch/dat/fasta/gbbct.fil
GB133 Viral$1V@/home/socs/birch/dat/fasta/gbvrl.fil
GB133 Phage$1G@/home/socs/birch/dat/fasta/gbphg.fil
GB133 Synthetic$1Y@/home/socs/birch/dat/fasta/gbsyn.fil
GB133 Unannotated$1U@/home/socs/birch/dat/fasta/gbuna.fil
GB133 Patented$1D@/home/socs/birch/dat/fasta/gbpat.fil
GB133 STS$1X@/home/socs/birch/dat/fasta/gbsts.fil
GB133 HTG$1h@/home/socs/birch/dat/fasta/gbhtg.fil
GB133 GSS$1s@/home/socs/birch/dat/fasta/gbgss.fil
GB133 All sequences (VERY long!)$1A@/home/socs/birch/dat/fasta/genbank.fil

For each GenBank division, there is a file with the .fil extension listing the files comprising that division.  An example is shown in Table 4.
Table 4. gbinv.fil
</home/socs/birch/GenBank
gbinv1.wrp 0
gbinv2.wrp 0
gbinv3.wrp 0
gbinv4.wrp 0
gbinv5.wrp 0

This file lists the location and names of the 5 files in the Invertebrate division.  A complete description of syntax for these files can be found in $BIRCH/doc/fasta/fasta3x.asc.

The problem is that the number of files increases in some divisions with each GenBank release, requireing additional lines to be added to the .fil  files. The next section describes how to automatically generate these files for a new GenBank Release.

Updating .fil files

cd $dat/fasta/fil

This directory contains a Python script called fil.py and a file called filnum. For GenBank 133.0 filnum looked as shown in Table 5.

Table 5. filnum
est 235
gss 63
htg 57
htc 3
pri 24
pln 7
bct 6
inv 5
rod 6
sts 2
vrl 3
vrt 2
mam 1
pat 7
syn 1
phg 1
una 1

For each GenBank division, a number indicates the number of files in the division. The actual number of files per division can be found in GenBank Release Notes, under the heading "ORGANIZATON OF DATA FILES".  As well, a short list of those divisions for which the number of files has changed since the previous release is found under the heading "Important Changes in Release xxx.x.". This file should be edited to reflect the number of files in the current release.

After you have updated filnum, you can run fil.py by typing

python fil.py

fil.py reads filnum and creates a set of new .fil files in the current. To move them to the parent directory (ie. $dat/fasta) type

mv *.fil ..

Finally remember to edit fastgbs, using Find/Replace in your text editor to change the GenBank Release number to the current release.

Configuring GDE to read GenBank

The default FASTA menus for GDE are located in $dat/GDE/makemenus/menus/Database. These menus only have one database choice for User-created files. The FASTA menus in $birch/local/dat/GDE/makemenus/menus/Database have additional menu choices for each database file listed in $BIRCH/dat/fasta/fastgbs. All we need to do is to edit $birch/local/dat/GDE/makemenus/menus/menulist to choose the local menu item files. This is done by adding lines to menulist, such as
Database
FASTADNA
TFASTA

Now, re-run makemenus.py to update the .GDEmenus files



Please send suggestions of comments regarding this page to frist@cc.umanitoba.ca