8.8 Compound clustering

When large databases such as PubChem or ChemSpider are used for compound annotation, it is common to find many candidate structures for even a single feature. While choosing the right scoring settings can significantly improve their ranking, it is still very much possible that many candidates of potential interest remain. In this situation it might help to perform compound clustering. During this process, all candidates for a feature are clustered hierarchically on basis of similar chemical structure. From the resulting cluster the maximum common substructure (MCS) can be derived, which represents the largest possible substructure that ‘fit’ in all candidates. By visual inspection of the MCS it may be possible to identify likely common structural properties of a feature.

In order to perform compound clustering the makeHCluster() generic function should be used. This function heavily relies on chemical fingerprinting functionality provided by rcdk.

compounds <- generateCompounds(...) # get our compounds
compsClust <- makeHCluster(compounds)

This function accepts several arguments to fine tune the clustering process:

  • method: the clustering method (e.g. "complete" (default), "ward.D2"), see ?hclust for options
  • fpType: finger printing type ("extended" by default), see ?get.fingerprint
  • fpSimMethod: similarity method for generating the distance method ("tanimoto" by default), see ?fp.sim.matrix

For all arguments see the reference manual (?makeHClust).

The resulting object is of type compoundsCluster. Several methods are defined to modify and inspect these results:

# plot MCS of first cluster from candidates of M192_R355_191
plotStructure(compsClust, groupName = "M192_R355_191", 1)

# plot dendrogram
plot(compsClust, groupName = "M192_R355_191")

# re-assign clusters for a feature group
compsClust <- treeCut(compsClust, k = 5, groupName = "M192_R355_191")
# ditto, but automatic cluster determination
compsClust <- treeCutDynamic(compsClust, minModuleSize = 3, groupName = "M192_R355_191")

For a complete overview see the reference manual (?compoundsCluster).