Organisation of large collections of chemical structures for computer searching
Abstract
New techniques of partitioning large files of chemical structures are described which allow searches for whole structures to be concentrated on a small part of the file. Using these techniques, recognition of compounds new to the system may be accomplished without generating a unique representation of the chemical structure.
The chemical structure file is arranged into molecular formula groups and the larger groups are subdivided by comparing ordered lists of the atom-bond-atom pairs present in each compound. Where a finer division is required augmented versions of the atom-bond-atom pairs are used. Analysis of a number of molecular formula groups has shown that even the largest groups can be divided into small subgroups by the augmented pair technique. In the majority of cases the subgroups contain only one or two compounds.