How to develop new drugs based on merged datasets

Polymorphs are molecules that have different molecular packing arrangements despite identical chemical compositions. In a recent paper, researchers at GlaxoSmithKline (GSK) and the Cambridge Crystallographic Data Centre (CCDC) combined their proprietary (GSK) and published (CCDC) datasets to better train machine learning (ML) models to predict stable polymorphs to use in new drug candidates.

What are the key differences between the CCDC and GSK datasets?

CCDC curates and maintains the Cambridge Structural Database (CSD). For the past century, scientists all over the world have contributed published, experimental crystal structures to the CSD, which now has over 1.1 million structures. The paper’s authors used a drug subset from the CSD combined with structures from GSK. The GSK structures were collected at different stages of the pharmaceutical pipeline and are not limited to marketed products. Co-author Dr Jason Cole, senior research fellow on CCDC’s research and development team, explained why structures gathered at different stages of the drug discovery pipeline are so important.

"In early-stage drug discovery, a crystal structure can help to rationalize conformational effects, for example, or characterize the chemistry of a new chemical entity where other techniques have led to ambiguity," Cole said. "Later in the process, when a new chemical entity is studied as a candidate molecule, crystal structures are critical as they inform form selection and can later aid in overcoming formulation and tabletting issues."

This information can help researchers prioritize their efforts - saving time and potentially lives down the road.

"By understanding a range of crystal structures, scientists can also assess the risk of a given form being long-term unstable,” Cole said. “A full characterization of the structural landscape leads to confidence in taking a form forward."

How do ML models in pharmaceutical science benefit from multiple datasets?

Industrial data sets reflect more than just science; they reflect cultural choices within a given organization.

"You will only find co-crystals if you look for co-crystals," Cole said, as an example. "Most companies prefer to formulate a free, or unbound, drug. One can assume that the types of structures in an industrial set reflect conscious decisions to search for forms of given types, whereas fewer bounds are placed on the researchers who contribute to the CSD."

ML models benefit from two key things: data volume and data specificity. That’s why coupling the volume and variety of data in the CSD with proprietary data sets is so helpful.

"Large amounts of data lead to more confident predictions," Cole said. "Data that are most directly relevant to the problem lead to more accurate predictions. In the predictions that use CCDC software, we select a subset of the most relevant entries that is large enough to give confidence. The GSK set is bound to have highly relevant compounds to other compounds in their commercial portfolio. So the model-building software can use these."

Industrial researchers working with highly relevant data can run into issues when they don’t have enough to generate confident models.

"Consider that CSD software typically picks around two thousand structures from the 1.1 million in the CSD," Cole said. "The industrial set is tiny by comparison, but you could pick, say, 40 or 50 highly relevant structures. You'd have insufficient data to build a good model with that alone, but the added compounds from the CSD supplement the data set. In essence, by including the GSK and CSD sets we get the best of both worlds: all the highly relevant industrial structures and a set of quite relevant CSD structures together to build a high-quality model."

Why do polymorphs present a risk to the pharmaceutical industry?

The different packing arrangements mean that one polymorph might be more suited for therapeutic delivery, while another form of the same compound might not. Researchers use crystal structure databases to make knowledge-based predictions about whether a potential new drug is comprised of a good, stable form that manufacturers can make, store, and deliver in a therapeutic manner. The authors at GSK and CCDC completed a robust analysis of the small molecule crystal structures containing X-ray diffraction results from GSK and its heritage companies for the past 40 years. They then combined those results with a drug subset of structures from CCDC’s CSD, which contains over 1.1 million small-molecule organic and metal-organic crystal structures from researchers all over the world.

Leen N Kalash, Jason C Cole, Royston CB Copley, Colin M Edge, Alexandru A Moldovan, Ghazala Sadiq, Cheryl L Doherty.
First global analysis of the GSK database of small molecule crystal structures.
CrystEngComm, 2021. doi: 10.1039/D1CE00665G

Most Popular Now

Therapy using dual immune system cells effectively…

A newly developed immunotherapy that simultaneously uses modified immune-fighting cells to home in on and attack two antigens, or foreign substances, on cancer cells was ...

How to develop new drugs based on merged datasets

Polymorphs are molecules that have different molecular packing arrangements despite identical chemical compositions. In a recent paper, researchers at GlaxoSmithKline (GS...

New drug combination effective against SARS-CoV-2 …

More countries with greater resources are opening up for a more normal life. But COVID-19 and the SARS-CoV-2 virus are still a significant threat in large parts of the wo...

Cleveland Clinic study suggests steroid nasal spra…

A recent Cleveland Clinic study found that patients who regularly use steroid nasal sprays are less likely to develop severe COVID-19-related disease, including a 20 to 2...

Sanofi to focus its COVID-19 development efforts o…

Recent positive interim results of Sanofi's mRNA-based COVID-19 vaccine candidate Phase 1/2 study confirm the company's platform robust capabilities and strategy in mRNA...

Discovery of mechanics of drug targets for COVID-1…

A team of international researchers, including McGill Professor Stéphane Laporte, have discovered the working mechanism of potential drug targets for various diseases suc...

Phase II/III trial shows Ronapreve™ (casirivimab a…

Roche (SIX: RO, ROG; OTCQX: RHHBY) today confirmed positive data from the phase II/III 2066 study, investigating Ronapreve™ (casirivimab and imdevimab) in patients hospit...

Pfizer and BioNTech receive first U.S. FDA Emergen…

Pfizer Inc. (NYSE: PFE) and BioNTech SE (Nasdaq: BNTX) announced that the U.S. Food and Drug Administration (FDA) has authorized for emergency use a booster dose of the P...

AZD7442 request for Emergency Use Authorization fo…

AstraZeneca has submitted a request to the US Food and Drug Administration (FDA) for an Emergency Use Authorization (EUA) for AZD7442, its long-acting antibody (LAAB) com...

Pfizer and BioNTech receive CHMP positive opinion …

Pfizer Inc. (NYSE: PFE, "Pfizer") and BioNTech SE (Nasdaq: BNTX, "BioNTech") today announced that the Committee for Medicinal Products for Human Use (CHMP) of the Europea...

Boehringer Ingelheim acquires Abexxa Biologics to …

Boehringer Ingelheim announced the acquisition of Abexxa Biologics Inc., a biopharmaceutical company taking a new approach in the fields of immuno-oncology and oncology r...

GSK welcomes WHO recommendation for broad roll-out…

GlaxoSmithKline (GSK) plc welcomes and applauds the WHO recommendation for the broader deployment of GSK's RTS,S malaria vaccine to reduce childhood illness and deaths fr...