BIRCH Administration

BIRCH
Downloading and Maintaining GenBank

Organization of GenBank flatfile distribution
Automated downloading and installation of GenBank
Configuring FASTA for GenBank searches
Configuring GDE to read GenBank

Organization of GenBank flatfile distribution

The GenBank database is produced by the National Center for Biotechnology Information (NCBI) at the NIH. It is distributed as a set of flatfiles (text files), as described in the GenBank Release Notes. As summarized in Table 1, the sequences are divided among several divisions. Additionally, there are various index files. Note: Index files not used by BIRCH are omitted from the table.

Table 1. GenBank divisions and flatfiles
code	division	file(s)
PRI	primate sequences	gbprixxx.seq.gz
ROD	rodent sequences	gbrodxxx.seq.gz
MAM	other mammalian sequences	gbmamxxx.seq.gz
VRT	other vertebrate sequences	gbvrtxxx.seq.gz
INV	invertebrate sequences	gbinvxxx.seq.gz
PLN	plant, fungal, and algal sequences	gbplnxxx.seq.gz
BCT	bacterial sequences	gbbctxxx.seq.gz
VRL	viral sequences	gbvrlxxx.seq.gz
PHG	bacteriophage sequences	gbphgxxx.seq.gz
SYN	synthetic sequences	gbsynxxx.seq.gz
UNA	unannotated sequences	gbunaxxx.seq.gz
EST	EST sequences (expressed sequence tags)	gbestxxx.seq.gz
PAT	patent sequences	gbpatxxx.seq.gz
STS	STS sequences (sequence tagged sites)	gbstsxxx.seq.gz
GSS	GSS sequences (genome survey sequences)	gbgssxxx.seq.gz
HTG	HTGS sequences (high throughput genomic sequences)	gbhtgxxx.seq.gz
HTC	HTC sequences (high throughput cDNA sequences)	gbhtcxxx.seq.gz
	GenPept - translated proteins from GenBank	genpept.fsa.gz
	Accession number index	gbacc.idx.gz
	Secondary accession number index	gbseq.idx.gz
	Release notes	gbrel.txt

The .gz extension indicates that files are compressed using the gzip protocol for faster download. In most cases, each GenBank division is split up among several files. For example, as of Relese 133.0, the patented division (PAT) was split among gbpat1.seq, gbpat2.seq... gbpat7.seq. All files can be downloaded by FTP from NCBI and from mirror sites listed in Table 2. Because of the size of GenBank, and the amount of traffic at the NCBI server, it is usually best to use the server geographically closest to your site.

Table 2. FTP sites for GenBank downloads.
Location	URL	directory
USA - NCBI, Bethesda	ftp.ncbi.nih.gov	genbank
Japan	bio-mirror.jp.apan.net	pub/biomirror/genbank
Australia	bio-mirror.au.apan.net	biomirror/genbank
Singapore	bio-mirror.sg.apan.net	biomirrors/genbank
China	bio-mirror.im.ac.cn	genbank
USA - Indiana University	bio-mirror.net	biomirror/genbank
USA - San Diego Supercomputing Center	genbank.sdsc.edu	pub

Automated downloading and installation of GenBank

1. Getting ready

When you install BIRCH for the first time, the directories $BIRCH/GenBank ($GB, $gb, $GENBANK, $genbank) and $BIRCH/GenPept ($GP, $gp) will be created. $GENBANK will contain two files, gbupdate and master.filelist. $GP will be empty.

The most critical consideration is space. For example, after reformatting as described below, GenBank Release 133.0 required 78 Gb including all files listed in Table 1. GenBank releases occur every two months, and for many years growth has been constant at about 9 to 10% per release, corresponding to an annual growth rate of about 1.7. The actual file sizes for each GenBank release are found in the current GenBank Release Notes. Obviously, maintaining a local copy of GenBank is only feasible with a very high speed internet connection.

Where disk space is limiting, it may still be feasible to maintain a partial copy of GenBank. For example, the EST division is by far the largest, accounting for well over half of the entire database. Other big divisions that can easily be omitted are GSS and HTG. Elimination of these divisions will not eliminate any of the major biological taxa, and will still retain all annotated genes.

In some cases, it is useful to keep GenBank in a separate filesystem from the rest of BIRCH. For example, at the U. of M., BIRCH resides in /home/socs/birch, while GenBank resides in /local/genbank. All that is required is to make a symbolic link called $BIRCH/GenBank that points to /local/genbank. One advantage of a separate filesystem is that it simplifies backups. While it is worth backing up $BIRCH, which is relatively small, there is no point in doing backups on $GENBANK, since it can always be restored from the Internet.

2. Running gbupdate

Table 3. A sample filelist

gbrel.txt.Z
gbacc.idx.gz
gbsec.idx.gz
est
gss
htg
htc
pri
pln
bct
inv
rod
sts
vrl
vrt
mam
pat
syn
phg
una
genpept.fsa.gz

To keep current on when new GenBank releases become available, you should follow the USENET news group bionet.molbio.genbank.

The gbupdate script automates the process of downloading and reformatting some or all divisions of GenBank. Before running this script, you need to set the environment variable $MAILID to your email address. This is usually requested by most anonymous FTP sites, and can most easily be set in the .cshrc file of the BIRCH administrator.

The file 'filelist' defines which files and divisions to download.

Rules for the filelist file:

Any file with a .gz or .Z file extension will be uncompressed after downloading.
A complete GenBank division can be downloaded and processed by simply putting the 3-letter code found in Table 1 into filelist.
Alternatively, any individual file can be downloaded by putting the name into filelist (eg. gbpln5.seq.gz).

A typical download session

Before running gbupdate, you need to specify which mirror site to use. This is done by editing the gbupdate script, commenting out the lines for whichever mirror site is geograpically closest.

We will now show the sequence of events in a typical download session. The file master.filelist is distributed with BIRCH. It's probably safest to copy this to another file called 'filelist' to use as a working copy. To launch gbupdate, move to the GenBank directory and launch gbupdate. By terminating the line with '&' you can make the command run in the background,

cd $GB
./gbupdate filelist &

The advantage of running gbupdate in the background is that you can logout at any time during the download without interrupting it.

The first file in the list is gbrel.txt.Z. In the example, gbrel.txt is the file containing the GenBank release notes. On some mirror sites, this file is compressed, and is named gbrel.txt.Z. At other FTP sites, this file is not compressed, so 'gbrel.txt' must be used in filelist.

When the file is received, the sizes of the original file from the FTP server and the file received are listed.

gbrel.txt.Z
ORIGINAL=  58663
RECEIVED=  58663

If these numbers are equal, the name of the file is written to files_received. Otherwise, the name of the file is written to files_missed. gbrel.txt is a special case. After being uncompressed, gbrel.txt is automatically moved to $doc/GenBank. By default, files remain in the $GENBANK directory.

Next on the list is gbacc.idx.Z. This file is the Accession number index, listing, for each accession number, the LOCUS name and division code (Table 1). This file is uncompressed and remains in the GenBank directory. (gbacc.idx is used by the XYLEM fetch program which retrieves GenBank entries by ACCESSION or LOCUS number.) The file gbseq.idx contains secondary accession numbers, as described in the GenBank Release Notes. It is not used by any of the programs in the current BIRCH implementation.

Next on the list are GenBank division names. For each division, gbupdate will find out how many files are contained in the division, and list them to the output. For example, in Release 133.0, there were two files in the vertebrate (vrt) division:

-r--r--r--   1 IUBio    archive  50808692 Jan  6 00:54 gbvrt1.seq.gz
-r--r--r--   1 IUBio    archive  22515849 Jan  6 00:54 gbvrt2.seq.gz
gbvrt1.seq.gz
gbvrt2.seq.gz
Removing file(s) for gbvrt1, if they exist

The full listing of files for this division are written, and then the names of each file are echoed to the output. Before beginning the download, gbupdate will remove the current files for this division, if they exist, as a way of making sure that enough space is available.

If a file contains the .seq extension, it is assumed to be a sequence file containing GenBank entries. After unzipping the file, the .seq file is split into 3 files containing annotation, sequence and an index. Thus, gbvrt1.seq is split into gbvrt1.ano, gbvrt1.wrp and gbvrt1.ind. This is fully described in the documentation for the XYLEM program splitdb . It is strongly recommended that you read this documentation file. The key point is that the annotation files and sequence files can be searched independently, saving a great deal of disk I/O. Thus, fasta or blast would only search the .wrp files containing sequence, and wouldn't have to read all of the documentation. When sequence entries are retrieved by fetch, the index (.ind) file is used to find the annotation and sequence for each entry so that the complete entry can be retrieved.

The vertebrate division shown above is a small division. At the other extreme, the EST division is the largest, consisting of 235 files in GenBank 133.0. Thus, a download of all GenBank divisions will spend the majority of time on the EST division.

The last file listed in master.filelist is genpept.fsa.gz. This file contains amino acid sequences translated from all annotated open reading frames in GenBank. Like the files produced by splitdb, genpept.fsa is also in FASTA format, which can be read by fasta or blast. After unzipping, this file is automatically renamed as genpept.wrp, and moved to $BIRCH/GenPept.

The two critical factors influencing the time required for a download are the speed of the internet connection and the speed of the filesystem. On our Sun Ultra 60 at the University of Manitoba, using a remotely-mounted NFS fileserver, a complete download of GenBank 133.0 took just under 24 hours.

Because of the system and network resources used, it is best to do a small download before trying to download all of GenBank. For example, if your filelist contained only

gbrel.txt
gbacc.idx
vrt

you would get just the release notes, the accession index, and the vertebrate division. If these were successful, you could put the remaining division codes into filelist and complete the download.

The progress of the download can be monitored in a number of ways. Just doing a directory listing of $genbank periodically will list all files in the $GENBANK directory. 'less files_received' will list the files successfully downloaded. 'top' will shown the program currently running: FTP if a file is being downloaded, gunzip if a file is being uncompressed, or splitdb if the file is being split.

When splitdb finishes processing a .seq file, the .ano, .wrp and .ind files are ready for use with no further processing. Thus, when gbupdate is complete all files are ready to use. The only thing remaining is to regenerate the index files used by FASTA, as described in the next section.

Configuring FASTA for GenBank searches

How FASTA finds database files

FASTA reads a list of database files from the file 'fastgbs'. The location of fastgbs is specified by the environment variable $FASTLIBS, which is set to $BIRCH/dat/fasta/fastgbs. A typical fastgbs file is shown below:

PIR   Protein Identification Resource 72.02 $00@/home/socs/birch/dat/fasta/pir.fil 
GenPept GenBank 133.0 CDS translations$01/home/socs/birch/GenPept/genpept.wrp 
GB133 Primate$1P@/home/socs/birch/dat/fasta/gbpri.fil
GB133 Rodent$1R@/home/socs/birch/dat/fasta/gbrod.fil
GB133 other Mammal$1M@/home/socs/birch/dat/fasta/gbmam.fil
GB133 verteBrates$1B@/home/socs/birch/dat/fasta/gbvrt.fil
GB133 Invertebrates$1I@/home/socs/birch/dat/fasta/gbinv.fil
GB133 pLants$1L@/home/socs/birch/dat/fasta/gbpln.fil
GB133 Expressed Sequece Tags$1E@/home/socs/birch/dat/fasta/gbest.fil
GB133 Bacteria$1T@/home/socs/birch/dat/fasta/gbbct.fil
GB133 Viral$1V@/home/socs/birch/dat/fasta/gbvrl.fil
GB133 Phage$1G@/home/socs/birch/dat/fasta/gbphg.fil
GB133 Synthetic$1Y@/home/socs/birch/dat/fasta/gbsyn.fil
GB133 Unannotated$1U@/home/socs/birch/dat/fasta/gbuna.fil
GB133 Patented$1D@/home/socs/birch/dat/fasta/gbpat.fil
GB133 STS$1X@/home/socs/birch/dat/fasta/gbsts.fil
GB133 HTG$1h@/home/socs/birch/dat/fasta/gbhtg.fil
GB133 GSS$1s@/home/socs/birch/dat/fasta/gbgss.fil
GB133 All sequences (VERY long!)$1A@/home/socs/birch/dat/fasta/genbank.fil

For each GenBank division, there is a file with the .fil extension listing the files comprising that division. An example is shown in Table 4.

Table 4. gbinv.fil

</home/socs/birch/GenBank
gbinv1.wrp 0
gbinv2.wrp 0
gbinv3.wrp 0
gbinv4.wrp 0
gbinv5.wrp 0

This file lists the location and names of the 5 files in the Invertebrate division. A complete description of syntax for these files can be found in $BIRCH/doc/fasta/fasta3x.asc.

The problem is that the number of files increases in some divisions with each GenBank release, requireing additional lines to be added to the .fil files. The next section describes how to automatically generate these files for a new GenBank Release.

Updating .fil files

cd $dat/fasta/fil

This directory contains a Python script called fil.py and a file called filnum. For GenBank 133.0 filnum looked as shown in Table 5.

Table 5. filnum

est 235
gss 63
htg 57
htc 3 
pri 24
pln 7
bct 6
inv 5 
rod 6
sts 2
vrl 3
vrt 2
mam 1
pat 7 
syn 1
phg 1
una 1

For each GenBank division, a number indicates the number of files in the division. The actual number of files per division can be found in GenBank Release Notes, under the heading "ORGANIZATON OF DATA FILES". As well, a short list of those divisions for which the number of files has changed since the previous release is found under the heading "Important Changes in Release xxx.x.". This file should be edited to reflect the number of files in the current release.

After you have updated filnum, you can run fil.py by typing

python fil.py

fil.py reads filnum and creates a set of new .fil files in the current. To move them to the parent directory (ie. $dat/fasta) type

mv *.fil ..

Finally remember to edit fastgbs, using Find/Replace in your text editor to change the GenBank Release number to the current release.

Configuring GDE to read GenBank

The default FASTA menus for GDE are located in $dat/GDE/makemenus/menus/Database. These menus only have one database choice for User-created files. The FASTA menus in $birch/local/dat/GDE/makemenus/menus/Database have additional menu choices for each database file listed in $BIRCH/dat/fasta/fastgbs. All we need to do is to edit $birch/local/dat/GDE/makemenus/menus/menulist to choose the local menu item files. This is done by adding lines to menulist, such as

Database
        FASTADNA
        TFASTA

Now, re-run makemenus.py to update the .GDEmenus files

Please send suggestions of comments regarding this page to frist@cc.umanitoba.ca