Chapter 3 Chemical Properties prediction and visualized.

Structural properties similarity searching
Molecular visualizations

3.1 Whats is ChemmineR?

ChemmineR is an R package under the BioConductor library that is designed for the analysis and visualization of chemical data, especially on molecules and compounds. Below are some of its key features;

  1. Visualization: It provide tools that visualize the chemical structures and theor relationships. This make it easier to interpret and communicate your findings.
  2. Chemical Structure Handling: ChemmineR can read, write, and manipulate chemical structures. These structures are stored in file formats like SDF and SMILES which are the industry standard ways to describe the molecular structure in text format.
  3. Clustering: ChemmineR can identify patterns and relationships between different molecules by grouping the like ones together.
  4. Structure Similarity Search: ChemmineR can compare different chemical structures and find their similarity and how useful they are for certain tasks like identifying compounds that may have similar properties or biological effects.

3.2 Installation

The ChemmineR package can be installed from R with:

if(!requireNamespace("BioManager", quitely=TRUE))
install.packages("BioManager")
BiocManager::install("ChemminerR")

3.3 Loading the package and Documentation

The below command is called to invoke the ChemmineR package to the program

library("ChemmineR") # Loads the package

To find the documentation and list all functions and classes call the command below

library(help="ChemmineR") # Lists all the functions and classes

To open a PDF manual for ChemmineR documentation from R

vignette("ChemmineR")

3.4 Create Instances of SDFset class

SDF(Structure-Data-File) is the standard format for representing the chemical structures. Let me briefly walk you through a practical session on how to work with SDF files before diving deep into ChemmineR package.

  • Load the sdfsample file
    This is a built in dataset in ChemmineR. It is loaded by
data("sdfsample")
  • Explore the SDFset object
    This data type falls under the class SDFset.
class(sdfsample)
## [1] "SDFset"
## attr(,"package")
## [1] "ChemmineR"

Calling the data set by its name provides the summary

sdfsample
## An instance of "SDFset" with 100 molecules

In this case, the sdfsample has 100 molecules. To access a specific molecule, call it by index. For instance,accessing the first molecule;

view(sdfsample[1])
## $CMP1
## An instance of "SDF"
## 
## <<header>>
##                             Molecule_Name 
##                                  "650001" 
##                                    Source 
##                  "  -OEChem-07071010512D" 
##                                   Comment 
##                                        "" 
##                               Counts_Line 
## " 61 64  0     0  0  0  0  0  0999 V2000" 
## 
## <<atomblock>>
##           C1      C2  C3  C5  C6  C7  C8  C9 C10 C11 C12 C13 C14 C15 C16
## O_1   7.0468  0.0839   0   0   0   0   0   0   0   0   0   0   0   0   0
## O_2  12.2708  1.0492   0   0   0   0   0   0   0   0   0   0   0   0   0
## ...      ...     ... ... ... ... ... ... ... ... ... ... ... ... ... ...
## H_60  1.8411 -1.5985   0   0   0   0   0   0   0   0   0   0   0   0   0
## H_61  2.6597 -1.2843   0   0   0   0   0   0   0   0   0   0   0   0   0
## 
## <<bondblock>>
##      C1  C2  C3  C4  C5  C6  C7
## 1     1  16   2   0   0   0   0
## 2     2  23   1   0   0   0   0
## ... ... ... ... ... ... ... ...
## 63   33  60   1   0   0   0   0
## 64   33  61   1   0   0   0   0
## 
## <<datablock>> (33 data items)
##           PUBCHEM_COMPOUND_CID PUBCHEM_COMPOUND_CANONICALIZED 
##                       "650001"                            "1" 
##      PUBCHEM_CACTVS_COMPLEXITY  PUBCHEM_CACTVS_HBOND_ACCEPTOR 
##                          "700"                            "7" 
##                                
##                          "..."

-Get the names of the molecules
To find column names of the sdf data set call the sdfid command.

sdfid(sdfsample)
##   [1] "650001" "650002" "650003" "650004" "650005" "650006" "650007" "650008"
##   [9] "650009" "650010" "650011" "650012" "650013" "650014" "650015" "650016"
##  [17] "650017" "650019" "650020" "650021" "650022" "650023" "650024" "650025"
##  [25] "650026" "650027" "650028" "650029" "650030" "650031" "650032" "650033"
##  [33] "650034" "650035" "650036" "650037" "650038" "650039" "650040" "650041"
##  [41] "650042" "650043" "650044" "650045" "650046" "650047" "650048" "650049"
##  [49] "650050" "650052" "650054" "650056" "650058" "650059" "650060" "650061"
##  [57] "650062" "650063" "650064" "650065" "650066" "650067" "650068" "650069"
##  [65] "650070" "650071" "650072" "650073" "650074" "650075" "650076" "650077"
##  [73] "650078" "650079" "650080" "650081" "650082" "650083" "650085" "650086"
##  [81] "650087" "650088" "650089" "650090" "650091" "650092" "650093" "650094"
##  [89] "650095" "650096" "650097" "650098" "650099" "650100" "650101" "650102"
##  [97] "650103" "650104" "650105" "650106"

The SDF dataset can alsio be converted to SMILES (Simplified Molecular Input Line Entry System) format. However, you need to install the ChemmineOB before converting SMILE format

BiocManager::install("ChemmineOB")

the convert the file

smiles <- sdf2smiles(sdfsample)
head(smiles)
## An instance of "SMIset" with 1 molecules
  • Save and export the SDF data
    The SDF file can be exported to a file locally.
write.SDF(sdfsample, file="data/sdfsample_export.sdf")

3.5 Retrieving atom, bond and data block

Atom block
The atom block provides information about each atom in a molecule, including its coordinates in 3D space and the element symbol. It can be retrieved using ChemmineR by invoking the function atomblock. In this case, we are retrieving the atom block information of the first four molecules.

atomblock(sdfset[1:4])


atomblock-sample

Bond block
The bond block provide the details of the connection between/among atoms and the type of bond. The bond block information can be retrieved by the function bondblock. In this case we are retrieving the bonblock information of the first four molecules

bondblock(sdfset[1:4])


bondblock-sample

Data Block
This function contains the additional information or the metadata found on the SDF file. This metadata is the information connected to the molecular properties for instance the molecular weight and logP.

datablock(sdfset[1:4])
## $`650001`
##           MF           MW            C            H            N            O 
## "C23H28N4O6"   "456.4916"         "23"         "28"          "4"          "6" 
##            S            F           Cl 
##          "0"          "0"          "0" 
## 
## $`650002`
##           MF           MW            C            H            N            O 
## "C18H23N5O3"   "357.4069"         "18"         "23"          "5"          "3" 
##            S            F           Cl 
##          "0"          "0"          "0" 
## 
## $`650003`
##            MF            MW             C             H             N 
## "C18H18N4O3S"    "370.4255"          "18"          "18"           "4" 
##             O             S             F            Cl 
##           "3"           "1"           "0"           "0" 
## 
## $`650004`
##            MF            MW             C             H             N 
## "C21H27N5O5S"    "461.5346"          "21"          "27"           "5" 
##             O             S             F            Cl 
##           "5"           "1"           "0"           "0"

3.5.1 Assigning compound IDS and keeping them unique

Return IDs from SDFset object. They are the compound IDS stored in the data block of each molecule within the SDFset that are used to unique identify each molecule. In this case we will subset and return the first four molecules.

cid(sdfset)[1:4]
## [1] "650001" "650002" "650003" "650004"

Returns IDs from SD file header. The data set might have been combined from other different smaller data sets therefore, the makeUnique() function is used a unique ID to each molecule. It does by appending the suffixes “_1”, “_2” to the compound IDS. This is also used to handle the duplicates.

unique_ids <- makeUnique(sdfid(sdfset))
## [1] "No duplicates detected!"
cid(sdfset) <- unique_ids

3.5.2 Converting the data blocks in an SDFset to a matrix

These data blocks can also be converted to a matrix by

blockmatrix <- datablock2ma(datablocklist = datablock(sdfset))

The matrix can also be split into numeric and character matrix by splitNumChar() function

numchar <- splitNumChar(blockmatrix = blockmatrix)

Slice the numeric matrix

numchar[[1]][1:2,1:2]
##              MW  C
## 650001 456.4916 23
## 650002 357.4069 18

Slice the character matrix

numchar[[2]][1:2, 10:11]

Compute atom frequency matrix, molecular weight and formula The molecular formula for each molecule can be computed by the MF() function and will return an output that is a character of vectors, each entry corresponding to a molecule in the SDFset. MW calculates the molecular weight similar to the MF function where each molecule corresponds to the molecule in the SDFset.

propma <- data.frame(MF=MF(sdfset),
                     MW=MW(sdfset), atomcountMA(sdfset))

propma[1:4,]
##                 MF       MW  C  H N O S F Cl
## 650001  C23H28N4O6 456.4916 23 28 4 6 0 0  0
## 650002  C18H23N5O3 357.4069 18 23 5 3 0 0  0
## 650003 C18H18N4O3S 370.4255 18 18 4 3 1 0  0
## 650004 C21H27N5O5S 461.5346 21 27 5 5 1 0  0

Assign matrix data to data block

datablock(sdfset) <- propma
datablock(sdfset[1]) # call the first compound in the databases 
## $`650001`
##           MF           MW            C            H            N            O 
## "C23H28N4O6"   "456.4916"         "23"         "28"          "4"          "6" 
##            S            F           Cl 
##          "0"          "0"          "0"

3.5.3 String searching in SDFset

Returns the summary view of the matches

grepSDFset("650001", 
           sdfset,
           field="datablock",
           mode="subset")
## $`650001`
## An instance of "SDF"
## 
## <<header>>
##                             Molecule_Name 
##                                  "650001" 
##                                    Source 
##                  "  -OEChem-07071010512D" 
##                                   Comment 
##                                        "" 
##                               Counts_Line 
## " 61 64  0     0  0  0  0  0  0999 V2000" 
## 
## <<atomblock>>
##           C1      C2  C3  C5  C6  C7  C8  C9 C10 C11 C12 C13 C14 C15 C16
## O_1   7.0468  0.0839   0   0   0   0   0   0   0   0   0   0   0   0   0
## O_2  12.2708  1.0492   0   0   0   0   0   0   0   0   0   0   0   0   0
## ...      ...     ... ... ... ... ... ... ... ... ... ... ... ... ... ...
## H_60  1.8411 -1.5985   0   0   0   0   0   0   0   0   0   0   0   0   0
## H_61  2.6597 -1.2843   0   0   0   0   0   0   0   0   0   0   0   0   0
## 
## <<bondblock>>
##      C1  C2  C3  C4  C5  C6  C7
## 1     1  16   2   0   0   0   0
## 2     2  23   1   0   0   0   0
## ... ... ... ... ... ... ... ...
## 63   33  60   1   0   0   0   0
## 64   33  61   1   0   0   0   0
## 
## <<datablock>> (9 data items)
##           MF           MW            C            H              
## "C23H28N4O6"   "456.4916"         "23"         "28"        "..."

The molecules can searched using the grepSDSFset function. The searching is performed by string marching and then returns an SDF summary of the matching entries. In this case we will search the molecule by index.

grepSDFset("650001", 
           sdfset,
           field="datablock",
           mode="index")
## 1 1 1 1 1 1 1 1 1 
## 1 2 3 4 5 6 7 8 9

3.6 Molecular visualization in R

Export SDFset to SDF file locally

write.SDF(sdfset[1:4], file="data/sub.sdf", sig=TRUE)

Plot molecule structure of one or many SDFs

plot(sdfset[1:4], print=FALSE) # plots structures to R graphics device

Compound viewing in web browser

sdf.visualize(sdfset[1:4])

You save the image of the visualization

save.image("nstc.Rdata")

3.6.1 Molecular Property Functions

Different types of atoms vary in their frequency across all molecules in an SDFset. The atomCountMA() is used to compute their frequency.

propma <- atomcountMA(sdfset, addH=FALSE) 
boxplot(propma, col="blue", main="Atom Frequency")

In the above SDFset, Carbon and Hydrogen atoms are the most frequent while Flourine and Chlorine atoms are the least frequent in this sample SDFset. These data is stored in the variable atomprop.

3.6.2 Similarity Comparisons and Searching

3.6.2.1 Maximum Common Substructure(MCS) Searching

ChemmineR provides a functionality/algorithm that can be used to perform a pairwise compound comparison, clustering and similarity searching. This package is just but an add-on package to ChemmineR referred to as fmcsR and can be installed by

BiocManager::install("fmcsR")

To load the library to the console run

library(fmcsR)

We will use a different SDSFset provided by the fmcsR package and compare between caffeine and viagra

 data(fmcstest) # Loads test sdfset object 
 test <- fmcs(fmcstest[1], fmcstest[2], au=2, bu=1) # Searches for MCS with mismatches 
 plotMCS(test) # Plots both query compounds with MCS in color 

3.6.2.2 AP/APset Classes for Storing Atom Pair Descriptors

Atom pair descriptors are a type of molecular fingerprint that capture information about the types of atoms and the distances between them in a molecule, which is useful for comparing and analyzing molecular structures.

ChemmineR and fmcsR libraries has got you covered on this, the function sdf2ap() function is used to convert an SDF data of a molecule into atom pair descriptors. Lets extract the atom pair descriptors of the first molecule.

 ap <- sdf2ap(sdfset[[1]]) # For single compound
 ap
## An instance of "AP"
## <<atom pairs>>
## 52614450304 52615497856 52615514112 52616547456 52616554624 ... length: 528
# Extract for many compounds 
 apset <- sdf2ap(sdfset)

cid(apset[1:4]) # Compound IDs 
ap(apset[1:4]) # Atom pair descriptors 
db.explain(apset[1]) # Return atom pairs in human readable format 
apset
apset