We want to analyze GSE18789 from GEO study mRNA expression profiling in adult mouse liver following benzo(a)pyrene (BaP) treatment [Agilent gene expression array].
Option 1: Go to InSilico DB and start interpreting the data straight away with your favorite analysis tools.
Option 2: Convert the data to a matrix format and use for example Excel to interrogate the data.
This is a tutorial for converting the data into a tabular matrix:
1- Download the Series matrix file. This file contains almost all the information except the gene symbols and names.
- To find the gene symbols and names download the raw data here. This link is provided at the bottom of the GEO page.
4- The data is a compressed .tar file that will uncompress a directory which contains the sample gene expression values and the platform information (this is the file we want), each is in a different file in .gz compressed format.
the directory should look like
5- Expand the GPL...txt.gz file in that directory. an move it to a new sub-directory with the series_matrix_file. In this example I will call my directory GSE18789.
Open a terminal and launch an iPython notebook. Follow this instructions to install iPython. In this example we will be using python3.
Alains-MacBook-Pro% cd ~/Downloads/data/GSE18789_RAW/GSE18789
Alains-MacBook-Pro% ipython3 notebook
Click on the "New Notebook" button on the top right of the window. Mine is called Agilent_GSE18789_insertion.
iPyhton notebook is great to interactively rite and correct your scripts as you move on.
first we will use Pandas which is a great pyhton library to handle tabular data. follow the Pandas Installing instructions.
Back on the notebook, import pandas:
import pandas as pd
#This is the directory where all .txt files are
list_of_files = os.listdir()
#list all files in the current directory
for f in list_of_files:
match = re.search('^(GPL[0-9]+)_.+$', f)
gpl_file_path = match.group(0)
#find the file containing the GPL platform information
gpl_df= pd.read_csv(gpl_file_path, index_col='ID', sep='\t',comment="#",header=1)
#read the file using the pandas read_csv() function
series_matrix_file_path = 'GSE18789_series_matrix 2.txt'
sdf = pd.read_csv(series_matrix_file_path, encoding='latin1', sep='\t',comment="!", index_col="ID_REF")
GSE_df = pd.concat([sdf,gpl_df.ix[:,['GENE_SYMBOL','GENE_NAME']]],axis=1)
#Let's concatenate both matrixes using the insdex column
GSE_df = GSE_df[pd.notnull(GSE_df[:])]
#write csv file
Now we need the annotations. Back in the terminal parse the Series matrix file:
cat GSE18789_series_matrix\ 2.txt| grep '!' > GSE18789_annotations.txt
Back in the Notebook:
annot_df=pd.read_csv('GSE18789_annotations.txt',encoding='latin1', sep='\t', skiprows=30,index_col=0)
#start reading the file at line 30
annot_df_t = annot_df.transpose()
indent preformatted text by 4 spaces#transpose the matrix
annot_df_t = annot_df_t.ix[:,['!Sample_geo_accession','!Sample_source_name_ch1','!Sample_characteristics_ch1']]
#parse the relevant annotations
#write to csv