In collaboration with AstraZeneca, Bristol-Myers Squibb, DuPont and Pfizer, IBM is providing a database of more than 2.4 million chemical compounds extracted from about 4.7 million patents and 11 million biomedical journal abstracts from 1976 to 2000. The announcement was made at an IBM forum on U.S. economic competitiveness in the 21st century, exploring how private sector innovations and investment can be more easily shared in the public domain.
The publicly available chemical data can be used by researchers worldwide to gain new insights and enable new areas of research. It will also help researchers save time by more efficiently finding information buried in millions of pages of patent documents. Access to this data will also allow researchers to analyze far larger sets of documents than the traditional manual process, adding a whole new dimension to the ability to search intellectual property.
The data was extracted using the IBM business analytics and optimization strategic IP insight platform (SIIP), a combination of data and analytics delivered via the IBM SmartCloud, and developed by IBM Research in collaboration with several major life sciences organizations. This new cloud-driven method for curating and analyzing massive amounts of patents, scientific content and molecular data. It uses techniques such as automated image analysis and enhanced optical recognition of chemical images and symbols to extract information from patents and literature upon publication. This is a task that otherwise takes weeks and months to complete manually, but can be done rapidly using this new technology.
"Information overload continues to be a challenge in drug discovery and other areas of scientific research," said Steve Heller, project director for the InChI Trust, a non-profit which supports the InChI international standard to represent chemical structures. "Rich data and content is often buried in patents, drawings, figures and scholarly articles. This contribution by IBM and its collaborators will make it easier for researchers to use this data, link to other data using the InChI structure representation and derive new insight."
Over the past six years, several major life sciences organizations have worked on this project with IBM Research gaining access to a comprehensive chemical library extracted from worldwide patents and scientific abstracts. Public structure extraction tools developed by researchers at the National Institutes of Health were also used successfully in this project.
"The scientific community will receive enormous benefit from this advancement," said Heller. "This is an important addition to the open chemistry data sets. The comprehensiveness of the data and the new ways researchers can look at these data and cross-link to other data associated with each chemical is expected to help with drug development to fight many forms of cancers and other human diseases, as well as the development of other chemical compounds."
The data will be contributed to the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM), and the Computer-Aided Drug Design (CADD) Group of the National Cancer Institute (NCI) at the National Institutes of Health. It will be incorporated in the NCBI's PubChem, a public resource for the scientific community that serves as an aggregator for scientific results as well as in NCI CADD Group services such as the Chemical Structure Lookup Service and the Chemical Identifier Resolver.