Homework 5 - Comp 364

Assigned Apr 4, Due Apr 11

Turn in by email to perkins@mcb.mcgill.ca

1. Creating a table and modifying its contents (10 points)

Professor A is running a set of experiments to see how different treatments affect cultured cells. In each experiment, a number of cells are inserted into the culture dish. Then the culture is left alone (the control group, C), or treated with a drug (group D). One day after treatment, each culture is examined and the numbers of normal and abnormal cells are counted. (The drug can do something to the cells, to make them abnormal, but they may also become abnnormal without treatment.) The results of the experiments are this:

Experiment	InitialCells	Treatment	Normal	Abnormal
1		22		C		22	0
2		24		C		22	2
3		18		D		10	8
4		34		C		33	1
5		32		C		28	3
6		29		D		20	9
7		27		D		14	13
8		25		D		10	15
(a) Log in to your mysql database and CREATE a table for storing the data above. Enter the data into the table using one or more INSERT INTO commands. Show the commands you used to do this and show the resulting table.

(b) Perform a SELECT query to check whether or not in each row the total number of normal and abnormal cells equals the initial number of cells. Show the command(s) you use and the results.

(c) Realizing (from part (b)) that there is a discrepancy in experiment 5, imagine that the initial and after-treatment images are reanalyzed. It is found that the abnormal cell count should have been 4, not 3. UPDATE this entry in the table to reflect the correct number. Show the command(s) you use and show the table afterwards.

(d) Professor A has a theory that the control cells should turn abnormal at a rate of 10%, but cells treated with the drug should turn abnormal at a rate of 50%. Add a new column to the table, ExpectedAbnormal, using an ALTER TABLE [table_name] ADD COLUMN [column name] [column type] command. Then, fill in the entries of that column with the appropriate numbers: 10% of InitialCells for the control treatment and 50% of the InitialCells for the drug treatment. Show the all the commands you use, and show the resulting table. Does it look like Professor A is right?

2. Simplified sequence analysis problems (10 points)

To begin, log in to the mysql server and run the sql commands here: MakeTables.sql by copying and pasting them into your prompt. Alternatively, you can save the file to a directory, and then before logging in to the mysql server, run this command at the terminal:

mysql -h mysql.cs.mcgill.ca -p"[password]" 2008Winter364[username] < MakeTables.sql
In this command, [password] should be replaced by your mysql password, and [username] should be replaced by your username. Running this command executes the SQL commands in the file MakeTables.sql. Either way (with the terminal, or pasting in to the mysql prompt) the sequence of commands will create two tables, called GeneInfoHW5 and GeneSeqsHW5, and will insert data into these tables. If all goes well, the tables should look like:
mysql> SELECT * FROM GeneInfoHW5;
+--------+-----------------+-------------+
| GeneID | Organism        | Function    |
+--------+-----------------+-------------+
| E001   | E. Coli         | Signaling   | 
| E002   | E. Coli         | Transport   | 
| E003   | E. Coli         | Signaling   | 
| E004   | E. Coli         | Degradation | 
| E005   | E. Coli         | Motility    | 
| E006   | E. Coli         | Metabolism  | 
| D001   | D. Melanogaster | Development | 
| D002   | D. Melanogaster | Signaling   | 
| D003   | D. Melanogaster | Metabolism  | 
| D004   | D. Melanogaster | Signaling   | 
| D005   | D. Melanogaster | Transport   | 
| B001   | B. Subtilis     | Metabolism  | 
| B002   | B. Subtilis     | Metabolism  | 
| B003   | B. Subtilis     | Signaling   | 
| B004   | B. Subtilis     | Transport   | 
| B005   | B. Subtilis     | Signaling   | 
| B006   | B. Subtilis     | Motility    | 
| B007   | B. Subtilis     | Motility    | 
| B008   | B. Subtilis     | Degradation | 
+--------+-----------------+-------------+
19 rows in set (0.00 sec)

mysql> SELECT * FROM GeneSeqsHW5;
+--------+--------------------------------+
| GeneID | Promoter                       |
+--------+--------------------------------+
| E001   | TCACGCCTTCAGTCTTCGGTTGCAGGACAA | 
| E002   | TGCAGGAACTACTGAGCGGACATAAGCGCT | 
| E003   | TCCACTCAAACATTGGGTCCGGCGTGGTTG | 
| E004   | TATCCACAATTGGGATGCAACCGTAAGCGG | 
| E005   | TGCTAAAAGTAAGGCTTTGAGACTACTGTC | 
| E006   | GTGGACGCCCGACAGGGGAGTCAGGTGACA | 
| D001   | CCAGCAAGAGATTCTAACTTGGCAGTATGG | 
| D002   | GGGCCAAAACTAGACTACTCACTGGGAGAA | 
| D003   | GATTGTGCAATGTAACTACTTCAGCTCCTC | 
| D004   | TAGTTCGGGGGCTATACTACTACATTCACA | 
| D005   | GAGTATACTTCGTCTTGACTACTTCAATGC | 
| B001   | CATCGACCTTCGTTGTGTTTATAACTACTT | 
| B002   | GCTTTGCACTTCCCTTAGACCTCCTTTGCG | 
| B003   | TACCTTACTCTATGCCGTCGCGTCCACTAC | 
| B004   | GCTGATATGCCTCACCGATAGAGATGTTCC | 
| B005   | TACGTACTAAGCAAGTCTCGCGTATGGGCA | 
| B006   | ATATCCACTACTCCAAACTCTACGAGCGGA | 
| B007   | TGACTCTGGGGTACTACTACTCCGAAATCG | 
| B008   | TCCGTCGCGGTCTTATTAGTCACTACGCAA | 
+--------+--------------------------------+
19 rows in set (0.00 sec)

mysql> 

These tables contain information about 19 imaginary genes in three different organisms. The table GeneInfoHW5 contains, for each gene, a unique GeneID, the Organism in which it is found, and the Function that the gene plays. In the second table, GeneSeqsHW5, we find for each gene a unique GeneID and a DNA sequence taken from the promoter of the gene. Your task is to produce one or more SQL commands to answer each of the following questions. Show the SQL commands that you use to get your answer, and show the answers produced by mysql when those commands are executed. (Of course, these tables are small enough that you could answer each of the questions "by eye". This is useful for checking your answers, but try as much as possible to get SQL to do the work for you.)

(a) Professor X is interested in the regulation of these genes, and in particular, is curious about which genes have promoters containing the string 'ACTACT', which represents a binding site for a particular transcription factor Professor X is studying. Generate a list of the GeneID's of genes whose promoters contain this pattern. (Hint: You will need the INSTR function.)

(b) In preparing to study the functions of these genes, Professor X is curious how many different functions appear in the GeneInfoHW5 table, and how often they appear. (For example, 'Signaling' is one of the functions, and it occurs 6 times in the list.) Produce a list of the functions in the table along with how many times each function occurs. (Hint: There are several ways to approach this problem. You may be interested in the COUNT function, and also the GROUP BY clause. Or a SELECT DISTINCT command might help you.)

(c) Professor X wants to know which genes have the 'ACTACT' pattern in their promoter and have a 'Signaling' function. Produce a list of the GeneIDs and Organisms of all such genes.

(d) Professor X wants to know how many genes each organism has that contain the 'ACTACT' pattern in its promoter. Produce a table with this result.

3. Feedback

How long did it take you? Fun / interesting / hard / boring?