Turning the Unknown into Known

Turning the Unknown into Known

Data mining is increasingly used to prospect for rare-disease biology and treatments.

Taken as a whole, rare diseases are not very rare. Even though a rare disease by definition is one that affects fewer than 200,000 Americans or fewer than one in 2,000 Europeans at any time, when rare diseases are considered together, they affect some 350 million people worldwide, or about 5% of the population (Figure 1). What is even more alarming is that 7,800 of the approximately 8,000 known rare diseases have no treatments available. It’s not that rare diseases are harder to treat than more widespread illnesses. Rather, compared to more common disorders, rare diseases simply do not draw the same level of attention from granting agencies, pharmaceutical companies, medical professionals, and researchers, so they languish in the shadows.

Figure 1: Rare diseases affect approximately 350 million people worldwide, more than the entire population of the United States. Here, Lisa Guay-Woodford, M.D., principal investigator at the Clinical and Translational Science Institute at Children’s National, talks with patient Michael Lewis during a 2012 visit. Lewis has the rare disease known as Morquio A syndrome (mucopolysaccharidosis IVA), which has no cure. (Photo courtesy of the National Center for Advancing Translational Sciences.)

Figure 1: Rare diseases affect approximately 350 million people worldwide, more than the entire population of the United States. Here, Lisa Guay-Woodford, M.D., principal investigator at the Clinical and Translational Science Institute at Children’s National, talks with patient Michael Lewis during a 2012 visit. Lewis has the rare disease known as Morquio A syndrome (mucopolysaccharidosis IVA), which has no cure. (Photo courtesy of the National Center for Advancing Translational Sciences.)

That means patients and patients’ families are often the ones scrambling to get information, poring through scientific journals for details about biology and chemistry, and scouring the Internet to find something that will fight the symptoms or the disease itself. This is about to change, according to rare-disease experts. And it’s thanks to big data and new data-mining approaches that are poised to compile and analyze information from obscure and sometimes disparate resources in a much faster and more methodical manner. The objective is to help patients get correct diagnoses earlier and to reveal more treatment options.

Unanswered Questions

Pamela Gavin

Pamela Gavin

Because rare diseases occur in small numbers of people and those people are often spread around the world, it’s little wonder that any given doctor is at a loss to put a name to these illnesses, let alone treat them. After all, the doctor may have never come across these particular suites of symptoms no matter how long he or she has been in practice. “These patients can go for years through a diagnostic labyrinth to find out what’s wrong with them, and in many cases they’re misdiagnosed before the actual accurate diagnosis is achieved,” says Pamela Gavin, chief operating officer at the National Organization for Rare Disorders (NORD) (Figures 2, right, and 3) [1].

A correct diagnosis is important to patients, even if that disease has no approved therapy, she asserts. “It’s just the notion that you’re not crazy, you do have a disease, and this is its name. The unknown is not the unknown anymore, so you can get on to the next phase of your journey and start to galvanize your focus on what resources may be out there to support you or your loved one.” A diagnosis is also critical for the patient’s physician, who will not only have some direction in caring for the patient, but can also contact researchers or clinical experts who have studied the illness or refer patients to clinicians specializing in that disease.

Figure 3: The NORD logo. (Image courtesy of NORD.)

Figure 3: The NORD logo. (Image courtesy of NORD.)

Once the disease is identified, the next challenge is to learn about the cause. For more common diseases, scientists conduct comprehensive studies with large numbers of patients to try to figure out the source, whether it is a gene, molecular pathway, or another factor. With rare diseases, that is much more difficult. For one thing, a sizable patient pool is not typically available. For another, there is usually little funding to pay for research.

Mining for Rare Gems

Fortunately, data mining can help. Today, scientists are extracting rare-disease data from myriad sources, including databases designed to pick and choose relevant information from electronic health records and insurance claim reports, as well as information that comes directly from the patients who either make themselves available to researchers or provide biospecimens for scientific studies through different registries, Gavin explains. “Patients and patient groups also contribute to the research by raising funds or by contributing to the discussion around what are appropriate endpoints for a certain clinical trial design.”

NORD developed a web-based platform to provide a cost-effective tool to foster better diagnostics and reduce misdiagnoses, improve the ability to find patients to participate in clinical trials, and enhance understanding of the impact of medical treatments on rare diseases. Gavin notes that the platform allows any entity, including patient organizations and clinical coordinators at academic institutions, to launch a natural-history study under an Institutional Review Board-approved protocol and through a series of consent forms, so that the entity can attract patients from across the country to participate in studies remotely.

“Through a collaborative grant we received last year, we have worked with the U.S. Food and Drug Administration to launch 20 natural-history studies this fall with patient organizations on diseases that have unmet medical needs, which means there’s no approved therapy,” she continues. These studies are designed to put a spotlight on disorders that currently receive very limited funding and draw almost no attention from the research community.

Noel Southall, Ph.D.

Noel Southall, Ph.D.

Other data-mining initiatives are under way. One is the International Rare Diseases Research Consortium (IRDiRC), which brings together researchers, research-funding agencies, patients, and patient advocates [2]. Formally established in 2011, IRDiRC has developed a two-part goal it plans to reach by the year 2020: 1) to deliver 200 new rare-disease therapies and 2) to generate the ability to diagnose most rare diseases. “We hope to identify what the bottlenecks are in developing diagnostics and therapeutics for the rare-disease community,” explains Noel Southall, Ph.D., a member of the steering committee of the IRDiRC Data Mining/Repurposing Task Force (the steering committee includes eight experts from international institutions and organizations) and an informatics scientist with the National Center for Advancing Translational Sciences at the National Institutes of Health (Figure 4, right). “Besides understanding business models for supporting the development of new products, that means looking at the basic-research side and how new technologies and approaches can really be leveraged to specific needs for the patient community.”

Making Sense of the Data

At this point, data mining is on the verge of shedding significant light on rare diseases and treatments, but a clear success story has yet to emerge, Southall acknowledges. “So far, it’s more the case that we find out something by other means, and after the fact, we can see that if we had looked at the data we already have, we could have come to the same conclusion. The question is, then, whether we can actually do that prospectively rather than in hindsight.”

Part of the issue is that the data are only now becoming truly accessible and useful. “It’s not like the information was available in some database and all we had to do was perform the right search on it,” he says. Rather, the data have to be indexed in such a way that a researcher can easily identify an individual case where a patient responded to a certain treatment, find out who the physician was, contact that physician, and learn the full story. Or, conversely, it has to be set up so that a physician can send word to the biological community about a patient’s response and ask whether it jibes with anything the biologists have seen in their work.

“Fortunately, those sorts of connections are being made more robustly now,” Southall explains, “because the data are being indexed better and are more accessible to more people.” Similar efforts are proceeding to develop and index patient registries, so the patients themselves can share their personal data. “The task force is very interested in trying to work with partners who can provide that kind of data so we can see what kinds of insights we can get out of it and move things forward,” he continues.

Figure 5: The American Association for Cancer Research launched an international data-sharing initiative, Project GENIE, in November 2015. (Image courtesy of the American Association for Cancer Research.)

Figure 5: The American Association for Cancer Research launched an international data-sharing initiative, Project GENIE, in November 2015. (Image courtesy of the American Association for Cancer Research.)

At the same time, the American Association for Cancer Research launched an international data-sharing initiative in November 2015 called Project Genomics, Evidence, Neoplasia, Information Exchange (GENIE) (Figure 5) [3]. Project GENIE is developing a registry that gathers clinical-grade cancer genomic data and links the data to clinical outcomes. “This is needed because we have this amazing proliferation of genomic data on patient’s tumors, including rare tumors, but the proliferation is ahead of our ability to react to the data,” says Charles Sawyers, M.D., Project GENIE’s steering committee chair and chair of the Human Oncology and Pathogenesis Program at the Memorial Sloan Kettering Cancer Center (Figure 6, below right). “We just don’t know what it all means.”

Charles Sawyers, M.D.

Charles Sawyers, M.D.

The eight cancer centers in the United States, Europe, and Canada that are part of the Project GENIE consortium have invested in the data-collecting technology because they know these data are going to be incredibly valuable someday, and they have decided to compile all of that sequencing data into one central database, Sawyers explains. “The consortium reasoned that by pooling the data together, we would have greater numbers of patients with rare variants, and we could then track down what happened to them and start to compile the disease registry information that you need to tell patients and their clinicians what mutations are responsible and what we know about their illness.”

The eight founding Project GENIE consortium members are

  • the Center for Personalized Cancer Treatment/The Netherlands Cancer Center, Amsterdam
  • the Dana-Farber Cancer Institute, Boston, Massachusetts
  • the Institut Gustave Roussy, Villejuif, France
  • Johns Hopkins University’s Sidney Kimmel Comprehensive Cancer Center, Baltimore, Maryland
  • the Memorial Sloan Kettering Cancer Center, New York City
  • the Princess Margaret Cancer Center, Toronto, Canada
  • the University of Texas M.D. Anderson Cancer Center, Houston
  • the Vanderbilt-Ingram Cancer Center, Nashville, Tennessee.

Informatics partners include Sage Bionetworks of Seattle, Washington, and cBioPortal of New York. As of October 2016, the database had already accumulated genomic data on approximately 20,000 patients.

Sawyers admits that such a publicly available database is not free of hurdles. “There are complications associated with sharing clinical data from hospital systems that have pretty serious privacy and legal concerns, not to mention the technological challenges of electronic medical record systems.” To address those issues, each participating medical institution has a Project GENIE gatekeeper behind its own firewall. This gatekeeper releases data to conform to the institution’s privacy and legal guidelines.

It works like this: researchers at the eight institutions make their case with concise project descriptions, the Project GENIE steering committee and subcommittees sift through the proposals and prioritize them, and Project GENIE headquarters then sends out a request for study data to the relevant institutions. As Sawyers explains, “We’ll tell the institute that we’re interested in a certain mutation, stating that we found patients at your institution with the mutation and would like this specific information. Then, the GENIE person who is already behind the firewall can pull up the data and send it right off.” With that information in hand, researchers can assemble control groups and begin studies.

The model already seems to be working. Several research projects are in progress, including two on rare alleles, according to Sawyer. “We have enough patients in the registry so that we will be able to show how well they do on the current standard of care, and we are planning to release the results of those studies at our annual meeting in the spring.”

Genomic data on cancer patients will only increase in abundance as healthcare professionals realize their importance for personalizing treatments and getting the best outcomes. Several studies have documented that understanding the mutations in patients with lung cancer and melanoma can help oncologists select more effective therapies for a large percentage of patients. By sequencing all cancer patients, which Sawyers hopes will soon be the norm, those with rare diseases will reap similar benefits. “For now, though, no single center can make a compelling case that it has saved a high number of lives by sequencing all cancer patients. But with the larger quantity afforded through Project GENIE, we could potentially show better outcomes to, for instance, 100 to 200 such patients across our eight participating centers in different countries. This would be a much stronger evidence base to say that, look, we have to do this for all our patients.”

For now, the initial goal of Project GENIE is the public release of the clinical sequencing data from the eight participating institutions. “It has taken us this first year to get all the data collated and harmonized and to get all the definitions of different subtypes of diseases coordinated because they are coded differently across hospitals and sometimes even within hospitals,“ Sawyer says. “But we plan to release all of the data prior to the end of 2016, so anyone in the world can log on and look at the genomic and baseline clinical information on all of these patients.”

The next milestone for Project GENIE is to expand its membership. Sawyer remarks, “There’s a lot of interest now, much more so than there was when we started talking about this idea two years ago. At first, it was a coalition of the willing, but now we have a lot of people knocking on the doors and asking if they can come in.”

Same Drug, New Use

One especially provocative application of data mining is the repurposing of drugs, or finding drugs that are already approved for one use and utilizing them to another. “For example, our lab has been working on repurposing a drug called Auranofin,” Southall explains. “It was originally approved for rheumatoid arthritis … and we found through random screening [that it] had an effect on a type of cancer that we were studying. When we went back to read the primary literature on this drug, it became obvious to us that there were already a lot of data on how it was inhibiting pathways that could be very useful against this specific cancer.”

Through the right data-mining approaches, such discoveries could become commonplace. Southall points to a series of recent papers suggesting that data mining would be useful for pulling a wide variety of relevant information from drug labels alone. Drug labels note not only the medication’s targeted action but also the side effects, both of which could have implications for other diseases. For instance, a drug may include neutropenia (a reduction in certain white blood cells) as a deleterious side effect. A patient with another disease, however, may benefit from that “side effect,” Southall explains.

One company that is working on data mining and analytics for drug repurposing is Healx (pronounced like “helix”), which bills itself as “a tech startup with a social mission.” Based in Cambridge, England, it spun out of technology originating from Cambridge University. “What we’re doing is helping rare-disease charities and patient groups to identify existing drugs that can help treat their diseases, and we’re doing it through a technology platform that combines machine learning, data analytics, and bioinformatics,” says Tim Guilliams, Ph.D., chief executive officer of Healx (Figure 7) [4]. Specifically, the company uses its Rarepurposing model, which integrates different data types, including transcriptomic, proteomic, or other omic data, along with natural language processing. “Then we employ machine-learning algorithms to mine data to find new links between diseases and existing drugs that have already gone through the approval process for other diseases,” he explains. “Patients and their family members play an important role, too, as they help accelerate research and provide valuable information about the disease symptoms and treatments currently prescribed to the patients.”

Figure 7: Tim Guilliams, chief executive officer of Healx, explains how the company integrates different data types and employs machine-learning algorithms with the goal of repurposing existing drugs as new applications for rare diseases. (Photo courtesy of Healx.)

Figure 7: Tim Guilliams, chief executive officer of Healx, explains how the company integrates different data types and employs machine-learning algorithms with the goal of repurposing existing drugs as new applications for rare diseases. (Photo courtesy of Healx.)

Healx has completed proof-of-concept demonstrations that have revealed promising drug candidates for a number of diseases, including a variety of cancers and orphan diseases such as CDKL5. In addition, as not every patient responds equally well to a given treatment, “it is also important to look at drug response of individuals” says Guilliams. “The company has successfully developed a drug response signature for a drug from Johnson & Johnson in a rare cancer called multiple myeloma. It revealed how likely patients would respond positively to that particular medication. Taken further, this approach could potentially help determine which first-in-line treatment any patient should receive,” he adds. So far, Healx is working directly with charities and patient groups and is also beginning to partner with biotech and pharma companies that want to collaborate on rare-disease repurposing or use Healx’s technology themselves.

Making a Difference

“Data mining is beginning to shed light on rare diseases that have spent too long in the dark,” Guilliams contends. “With 350 million people affected by rare diseases, it’s an enormous issue, especially since fewer than 5% of rare-disease patients have an approved treatment.” For Healx, the opportunity to work directly with patient communities has brought home those statistics. “There is no one really interested in their rare diseases and there are no drugs approved, so the patients and their parents are setting up organizations, raising money, starting clinical trials, and it’s great to be working with such inspirational people,” he remarks. “Our technology can speed up this process and make it significantly cheaper because, by repurposing existing drugs, we can shortcut the drug-discovery process. The drug is already approved, so we know it’s safe, which means we can very quickly get the drug to patients and without the cost of developing a completely new drug.”

Insights into rare diseases can have even broader implications, according to Southall. “While our fundamental goal at IRDiRC is to leverage the data we already have to identify new treatments for rare-disease patients, we will also be learning a lot about common diseases through that work.” He points to the example of a lysosomal disorder called Gaucher disease, which manifests in the spleen and liver. “We have developed small molecules that help alleviate the molecular phenotype of patient-affected cells for Gaucher disease, and while doing that work, we’ve also tested them in a model of Parkinson’s disease, which is a neurodegenerative disorder, and seen that it has a curative effect there as well. That has spawned some novel research into Parkinson’s disease treatments.”

Drug discovery will indeed benefit tremendously from data mining, Sawyers comments. “The drug-development sector will have access to much more information about potential drug targets, and it will not only know the size of the patient population but also how quickly it can put a patient trial together to get a drug approved.” He adds, “Databases and data mining work will begin to have an effect almost immediately and will certainly have a noticeable impact in a year or two.”

While these data mining approaches may be in their early stages, many experts agree that they offer a new and encouraging path forward, as well as a view of the future that includes faster diagnoses and treatment options for the hundreds of millions of people with rare diseases. Guilliams observes, “Data mining is a way to help power personalized medicine for rare diseases and drive the development of medicines for patients in need.”


  1. National Organization for Rare Disorders (NORD). [Online].
  2. International Rare Diseases Research Consortium. [Online].
  3. American Association for Cancer Research (AACR) Project GENIE. [Online].
  4. Healx. [Online].