A Series of Fortunate Events
Isidore Rigoutsos, PhD, sees complexity not as a barrier to understanding the genome, but as an essential ingredient
By Zach Nichols
Isidore Rigoutsos is in Athens, sitting in a lecture hall waiting for a physics class to begin.
A man walks in and announces that their professor is ill and that he was asked to give a demonstration of a new technology he studied while completing his PhD in France. The substitute rolls in a cart covered in a tangle of wires and topped with a screen. “This is a computer that I built,” he says, and writes “Good morning” on the screen to the amazement of everyone present.
Rigoutsos is so smitten that after finishing his physics degree, he decides to pursue graduate degrees in computer science in the United States.
Deeply versed in the hidden potential of machines, Isidore Rigoutsos, PhD, the Richard W. Hevner Professor of Computational Medicine and founding director of the Jefferson Computational Medicine Center, has done work in everything from computer vision to simulating airflow over an airplane wing. In the ’90s, he became involved with the emerging field of computational biology and co-founded IBM’s Computational Biology Center.
Among his important contributions to the field is Teiresias. Named for the blind seer of Greek myth, it’s an algorithm he and one of his PhD students created in 1996 to help spot recurring patterns in long strings of letters. It did this, ingeniously, by breaking the motifs down into component patterns, greatly reducing the amount of time needed to complete a task.
Today, he is one of the people providing vital proofs of concept on the value of embracing complexity as a way of seeing the world. We often say “it’s complicated” or “it’s very complex” as a way of shrugging at an excess of details. To Rigoutsos, this idea is fundamental to his way of understanding the world—not in spite of complexity but because of it.
Rigoutsos (left) meets with his team.
“Complexity is a measure of how many distinct things you need to account for to make sense of ‘it,’” he says, “whatever ‘it’ is.” We tend to think in straight lines: cause followed neatly, if somewhat later, by effect. This isn’t wrong, but it doesn’t tell the whole story, especially at scales much different from those in our day-to-day experience.
This is true of the sciences, too, including genetics. After the human genome was sequenced in 2000, scientists believed that the relevant parts of our DNA were genes, the pieces that code for proteins and account for just two percent of the double helix. The rest of the human genome’s six billion base pairs—composed of combinations of the molecules adenine, thymine, cytosine, and guanine—was deemed “junk DNA,” a relic of our evolutionary history.
Rigoutsos’ work is motivated in part by a modest premise—that DNA is not so wasteful. “The prevailing understanding is that nature is parsimonious,” he points out. “Everybody says that, and why wouldn’t it be the case for the human genome?” From a biological standpoint, there is an energy cost to keeping all that junk around and, while some of it may certainly be noise, it is possible, probable even, that the rest does something.
He speaks with reverence about DNA’s ability to compress and express extremely complex relationships within just four repeating base molecules. “Its job is information storage, information processing in the face of very limited resources,” he says, “and it had nearly two billion years to optimize itself.”
Rigoutsos gives a presentation at the United States Department of Agriculture facility outside Philadelphia. On his way home, he detours to visit a friend for coffee. In the midst of their conversation his friend’s colleague stops by and introduces himself.
They get to talking about their work, and the colleague, a geneticist, starts telling the group about these molecules he is studying called microRNAs. As he explains, he pulls out a napkin and begins sketching a little of what they know. It seems that the newcomer and his team are trying to target these sequences in their experiments, but are having trouble uncovering the rules of the interactions.
After the meeting, Rigoutsos drops everything to focus on the problem.
Little Things and How to See Them
MicroRNAs (MiRNAs) are a prominent member of a larger class of molecules Rigoutsos studies called non-coding RNA (ncRNA). While the protein-coding genes that have preoccupied biologists for decades can be thousands of bases long, these ncRNAs can contain as few as a dozen and a half bases.
RNAs play many roles throughout the cell, including assisting with transcription, the first step in the process by which genes are turned into proteins, and have typically been viewed as molecular middlemen. “What we discovered,” says Rigoutsos, “is that ncRNA actually meddles with this very process and helps to control what gets expressed in proteins and beyond.” Moreover, multiple ncRNA seem to be involved at each stage, affecting the process through a complex push-pull relationship by variously blocking or promoting their targets.
“People think something has to be big to be powerful,” he says, “It’s the interconnectivity that we have failed to account for in our efforts to understand ‘it.’” As such, the first step to explaining these relationships is the problem of keeping count, a task that computers and machine-assisted hypothesis generation are ideally suited for.
To Rigoutsos, this is nothing short of applying a tried-and-true method that has been used in numerous other fields. For instance, in the early ’50s, archaeologists used this approach, with pencil and paper, to decipher the “Linear B” language, a very early form of Greek. At the beginning of the 20th century, linguists were able to compare text from pottery fragments and other artifacts (see figure on right), again using pencil-and-paper charts to test different combinations of symbols.
Once they had their dataset, the archaeologists began grouping the symbols together according to factors like their relative spacing and frequency. This is analogous to the data-mining phase of the team’s work, when they identify potential targets, and doesn’t require any knowledge of the content, instead relying on frequency of appearance as a guidepost.
After generating a set of candidate words, it comes time to set definitions—what do they mean? This is systematic guesswork. In the case of Linear B, text found on a tablet near unearthed wine jugs could have “grapes” or “wine” substituted in. Ditto for names of popular trading posts and other context-specific possibilities. The process of substitution is akin to the experimental phase, when genomic regions are added or omitted and the effects are observed in the lab.
In this way, a mass of gibberish can quickly be turned into smaller, more manageable repeating patterns, which people are able to wrap their minds around and manipulate. This is a piecemeal solution to the bandwidth limitations of the human brain. In the past, this was a source of bias—toward large genes and simple diagrams—belying the vast, multicausal picture that is now emerging.
Rigoutsos, by now a well-established authority on miRNAs, is at a conference in Ohio, prepared to give a presentation, when a friend leans over and asks if he’s going to talk about the “crazy stuff”—his explorations of ncRNA. He responds no, but his friend insists, saying it’ll go over well, so Rigoutsos begins rewriting his entire presentation while sitting in the audience, an hour and a half before he’s due on stage.
He goes on to give the talk, and afterward a man comes up to him, agreeing effusively with everything he had just presented. The man has a plane to catch, but promises to give Rigoutsos a call.
A few months later, Rigoutsos gets the promised call. The man on the other end of the phone is George Calin, MD, PhD. As a postdoc at Jefferson working on an orphan project, Calin was the first to identify miRNA as a key factor in chronic lymphocytic leukemia in 2002.
Calin thinks that the “crazy” connections Rigoutsos had presented a few months earlier could help spot ncRNA involved in colorectal cancer, the third-leading cause of cancer-related death in the United States. Soon after the call, Calin and Rigoutsos set to work. Using computational analysis, they cherry-pick among the many genomic sites of interest flagged by a study Rigoutsos had done in 2006.
They converge on about 1,200 sites and build a special chip with probes to test those sites. The team begins by testing numerous samples from healthy subjects and patients until the probes light up like a Christmas tree, following patterns that changed with tissue and disease. Then they extend the work to samples from patients with colorectal cancer, and in a matter of a few months, they identify a ncRNA—several hundred letters long—whose abundance is associated with patient survival.
One of the team’s most striking findings using this methodology has to do with differences between races, ethnicities, populations, men, and women. By looking at data from people across four European populations and one African population, they were able to find notable differences in the miRNA spectrum, while also uncovering consistencies within each group. Even white European populations—Italians and Finns, for instance—showed consistent differences from one another.
This has direct implications for the clinical treatment, not only of groups, but also of individuals, who are always members of more than one group (French women, Greek men, and so on with greater specificity). “Everybody had anecdotes about these variances between people, but never thought to look for any consistency at the other end of the room. And, until recently we did not even have the data to answer this question,” says Rigoutsos. “We simply asked, ‘Does this variance show some coherence?’”
In a similar vein, the team analyzed publicly available datasets derived from more than 10,000 cancer samples and were able to distinguish 32 cancers via their ncRNA profile, suggesting that different subsets of these molecules play different roles in different diseases.
Unbeknownst to Rigoutsos, Mark Tykocinski, MD, Anthony F. and Gertrude M. DePalma Dean of SKMC, was in the audience that day in Ohio and was excited by Rigoutsos’ talk. Tykocinski immediately saw the implications of big data and the importance of an increasingly granular view of the genome.
Tykocinski, who had taken charge of the medical college just the year before, calls Rigoutsos. A few months later, in February 2010, Rigoutsos joins Jefferson.
Differences have long been observed among U.S. patients with triple-negative breast cancer—the most aggressive kind. In black patients, the cancer has an increased incidence rate, appears at a younger age, and progresses faster than in white patients. These differences persist even after socioeconomic variables are taken into account, suggesting that genetics may be the cause.
Clearly, these findings point to an underlying complexity in triple-negative breast cancer and many other cancers, as the Jefferson team showed. But, Rigoutsos stresses, it is important to realize that this uncovered complexity is a powerful and invaluable development.
“Say 1 percent of your active molecules can ever become drugs,” Rigoutsos muses. “One percent of 20,000 is one number, but 1 percent of two million is a much bigger number.” Having uncovered more causal factors means that more targets are now available for developing therapies that are tuned to the disease and the patient. This means that there is now a greater chance that more effective treatments can be found.
Another eight and a half years pass before Rigoutsos and Calin publish their 48-author paper with their findings on the ncRNA, which by now has a name, N-BLR (pronounced “enabler”). N-BLR shows promise as a possible biomarker that could one day save lives as a diagnostic tool or a target for personalized treatments.
Rigoutsos is amazed at how these things take on a life of their own, how data begets even more data, and how finally they have a result. It feels improbable. A physics professor in Greece calls out sick.
A diagram on a napkin. One question more than a decade ago.
A few hundred bases out of six billion.
A cascade of events gaining momentum over time and successive interactions, becoming—with every iteration—inevitable.
Thinking Through Computers
“A computer is not just the thing that allows me to buy things from Amazon,” says Isidore Rigoutsos, PhD, of the ubiquitous machines. “Its original purpose was to actually solve computational problems that we couldn’t do with pencil and paper.”
In the era of Big Data, these problems could be anything from the location of earthlike planets or quantum particles to why a particular gene causes cancer or what left-handed Philadelphians want for dinner. What all these problems have in common is that they require computers and the know-how to use them effectively. The world has recognized how important this skillset is, and the number of undergraduates studying computer science in the United States—about 89,000 at any given time—shows it.
But there is a problem: Each year universities throughout the country graduate about 1,200 PhDs, many of whom are summarily recruited by tech giants like Facebook, Google, Apple, and Amazon. This leaves about 300 left over to pursue academic careers—an insufficient number by any stretch.
A bidding war for new hires with some of the most valuable companies in the world is not a feasible solution to the emerging shortage of qualified faculty. Instead, Rigoutsos and his computational medicine colleagues at Jefferson have devised a new way to secure the next generation—find them, train them, and hire (some of) them.
By putting together a certificate program in computer science and computational thinking, with the aim of eventually granting doctoral degrees, they are helping to bring technical talent in-house. Since 2015, they have been doing the much-needed preparatory work and now have launched two brand-new courses in computational science: Introduction to R Programming and Data Visualization.
“Our version of ‘data visualization’ is likely not what you might think,” says Rigoutsos. “Our version of the course pulls together understanding from data science and the human visual system in order to teach students how best to present their findings. Along the way, we also show them how, by visualizing data, they can pick up patterns that highlight which topics are most important to the scientific community at a given point in time and a host of other areas.”
Using problem sets and hands-on exercises familiar to any engineering student, the goal of these courses is to provide Jefferson’s students with their first experience of thinking with a computer. “The amount of information they need to work with is daunting, and it can be difficult to know where to begin,” says Rigoutsos. “I try to teach them not to be afraid to try things, to let the computer loose on the data.” This is perhaps the crux of “computational thinking,” seeing the overarching question and then breaking it down into well-defined, computable pieces that will yield definite answers…then weaving these answers into a single story.
One of the goals of this effort is to embed these abilities in different fields by teaching traditionally specialized professionals like physicians, designers, architects, and engineers, anyone whose work could be affected by a deep dive into the data—in other words, everyone. Big data has implications for every vocation and subject from automobile design and political polling to advertising and fraud detection, anywhere that patterns need to be discerned and deciphered.