Chapter 2 Bioconductor Package

2.1 Overview of the BioConductor package

2.2 Installation of BioConductor package

To install BioConductor run the below command

if (!requireNamespace("BiocManager", quitely=TRUE))
     install.packages("BiocManager")
 BiocManager::install("ChemmineR")

2.3 Importing Data in R

Depending on the type of data file, R has a varying methods to load the data. The below list shows how to load different types of files to the R program.

data <- read_csv("path/to/your/file.csv") for csv file
data <- read_excel("path/to/your/file.xlsx", sheet = "Sheet1") for excel file
data <- fread("path/to/your/file.csv") for large csv files.
data <- read_sav("path/to/your/file.sav") for SPSS files.
data <- read_dta("path/to/your/file.dta") for stata files.
data <- read_sas("path/to/your/file.sas7bdat") for SAS files.

2.4 Writing Script of Code in R

In R, the use can interact directly with the console or prepare a script inform of *.R or *.Rmd (R Markdown) files. Script are useful as they allow one to;

easily pick up where they left off on a project and progressively build and refine code and analysis.
easily reproduce analysis run earlier and documentation. This also saves time on repetetive tasks.
do a batch processing such that tasks are automotated instead of running them one by one.
debug the code and find errors line by line.
version control such as git that tracks the changes and enables collaboration with other developers.

A script is a set of commands that usually includes commenting on what each piece of code intended to do

2.5 Data Interface

Follow the following steps when working with data from external interfaces;

Refer CSV working file(filepath) e.g test.csv
Get the current working directory, getwd()
Set the current working directory setwd().

2.5.1 Reading the *.csv file

In this section, we will work with two csv files, test.csv and data.csv. Both files are stored in a directory named data. Lets load the data.csv to the program.

person_info <- read.csv("data/data.csv")

The read.csv() function automatically converts the loaded to a data frame. is.data.frame() is used to find if a variable is a data frame

# just to confirm if the said data is a data frame 
print(is.data.frame(person_info))

## [1] TRUE

The person_info data set is confirmed to be a variable. The ncol() function counts the number of columns/variables/features in a data frame

print(ncol(person_info)) # count the number of columns

## [1] 6

This data frame has 6 columns. Alternatively, the nrow() function counts the number of rows/observations/records.

print(nrow(person_info)) # count the number of records/rows/observations

## [1] 10

This data set has 10 rows and 6 columns.

2.5.2 Brief data analysis

When the function, attach() is called to a data frame, the R program can refer to specific column without associating it with the data frame name. In this section the person_info will be attached to R and dive shallow into data analysis.

Calculate the mean of the salary

# attach the file inorder to avoid the $ sign
attach(person_info)

mean(Salary)

## [1] 65200

The average salary in the data set is 65200. The names() function is called to find the column names of the data set.

# find the column names 
names(person_info)

## [1] "Name"   "Emp.Id" "Gender" "Age"    "Smoke"  "Salary"

Now, lets jump to the protein data which is stored in the test.csv file. The head() function is used to show the first 6 records of the dataset

# Load the data set 
proteindata <- read.csv("data/test.csv")
head(proteindata)

##   SNO  DBNAME PROTEIN_NAME NO_DIMER NO_RESIDUES DEPOSITED_DATE
## 1   1     PDB      TUBULIN        3         349     05-05-2014
## 2   2     PDB      KERATIN        5         450     05-06-2015
## 3   3 UNIPROT   HEMOGLOBIN        7         345     05-07-2017
## 4   4     PIR  CYTOKINEASE       10         950     05-05-2014
## 5   5  EXPASY     PROTEASE        5         493     05-06-2015
## 6   6  ENTREX     TUBULIN1        4         760     05-07-2017

Count the number of rows and columns in the protein dataset

ncol(proteindata)

## [1] 6

nrow(proteindata)

## [1] 10

The data frame has 10 rows and 6 columns. Alternatively, the dim() can be used, it returns the number of rows and columns respectively.

dim(proteindata)

## [1] 10  6

Get the protein detail having the most residues and the max residues value.

objname1 <- max(proteindata$NO_RESIDUES)
objname1

## [1] 950

objname2 <- subset(proteindata, NO_RESIDUES == max(NO_RESIDUES))
objname2

##   SNO DBNAME PROTEIN_NAME NO_DIMER NO_RESIDUES DEPOSITED_DATE
## 4   4    PIR  CYTOKINEASE       10         950     05-05-2014

The max() function is used to find the maximum value in a vector or column. In this code, max(proteindata$NO_RESIDUES) returns the highest number of residues (NO_RESIDUES) found in the proteindata dataset, and this value is stored in objname1.

The subset() function is used to filter rows in a data frame that meet certain conditions. Here, subset(proteindata, NO_RESIDUES == max(NO_RESIDUES)) returns all rows from the proteindata dataset where the NO_RESIDUES column matches the maximum value, and the result is stored in objname2.

Get the protein detail having entry in PDB database

objname3 <- subset(proteindata, DBNAME == "PDB")
print(objname3)

##   SNO DBNAME PROTEIN_NAME NO_DIMER NO_RESIDUES DEPOSITED_DATE
## 1   1    PDB      TUBULIN        3         349     05-05-2014
## 2   2    PDB      KERATIN        5         450     05-06-2015
## 7   7    PDB     TUBULIN2        3         387     05-05-2014

The proteindata dataset is filtered to select rows where the DBNAME column is equal to "PDB". This means it extracts all proteins that are listed in the PDB database. The resulting subset is stored in objname3.

Get the protein detail having more than 3 dimer from PDB database

objname5 <- subset(proteindata, as.Date(DEPOSITED_DATE) > as.Date("2014-05-05"))
print(objname5)

## [1] SNO            DBNAME         PROTEIN_NAME   NO_DIMER       NO_RESIDUES   
## [6] DEPOSITED_DATE
## <0 rows> (or 0-length row.names)

The subset function is used to filter the proteindata dataset to include only rows where DBNAME is "PDB" and DIMER_COUNT is greater than 3.

The filtered data is stored in objname5

Now that we have the basic concept of how R works, lets introduce the Bioconductor package.

2.6 Introduction to Bioconductor

This open-source project provides tools for analyzing and understanding high-throughput genomic data. It includes a broad selection of packages for both bioinformatics and chemoinformatics. Before we explore the ChemmineR package in depth, let’s start with a few basic concepts.

2.6.1 Exploring Genomic data

The GenomicRanges package is used to work with genomic intervals. It installed by:

BiocManager::install("GenomicRanges")

Lets now import the genomic ranges package to the program

library(GenomicRanges)

GRanges objects represent genomic ranges, allowing you to manipulate genomic intervals and align them with genomic data like gene annotations.

# Create a simple genomic range object
gr <- GRanges(seqnames = "chr1", ranges = IRanges(start = 100, end = 200))
gr

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1   100-200      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

2.6.2 Visualizing Gene Expression Data

The DESeq2 package is used to find the differential gene expression analysis. It is widely used to identify genes that are differentially expressed under different conditions. Its is installed by;

BiocManager::install("DESeq2")

Load the library

library(DESeq2)

DESeq2 handles RNA-Seq data to model and estimate the expression levels of genes, providing statistical tools for identifying differentially expressed genes.

# Example dataset
dds <- makeExampleDESeqDataSet()
dds <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

res <- results(dds)
plotMA(res)

2.6.3 Annotating Genomic Data

The AnnotationDbi and org.Hs.eg.db packages are used to annotate gene identifiers with gene names, descriptions and other relevant metadata. The packages can be installed by;

BiocManager::install("org.Hs.eg.db")

Load the org.Hs.eg.db library.

library(org.Hs.eg.db)

The packages provide a framework for mapping and retrieving biological data from databases. It’s crucial for adding biological context to your data, such as converting gene IDs to gene names

# Example: Map Entrez IDs to gene symbols
gene_ids <- c("1", "2", "3")
symbols <- mapIds(org.Hs.eg.db, keys = gene_ids, column = "SYMBOL", keytype = "ENTREZID")

## 'select()' returned 1:1 mapping between keys and columns

symbols

##       1       2       3 
##  "A1BG"   "A2M" "A2MP1"

2.6.4 Working with Sequence Data

The popular `BioStrings package is used to analyze the protein, RNA and DNA sequences. The library is installed as the ones before.

BiocManager::install("Biostrings")

Load the library

library(Biostrings)

Lets manipulate the DNA String "ATCGGCTA" by reverse complimenting.

# Create a DNAString object
dna_seq <- DNAString("ATCGGCTA")
reverseComplement(dna_seq)

## 8-letter DNAString object
## seq: TAGCCGAT

This Bioconductor is an essential toolkit for anyone working with genomic data. Its packages are designed to facilitate a wide range of bioinformatics tasks, from basic sequence manipulation to complex statistical analyses.