Using Big Data to Find Small Changes

Tiny variations in our genes account for everything about us. Some variations are responsible for benign things like your eye colour or how wet your earwax is (yup). Other variations can cause or increase your risk for genetic disorders like Multiple Sclerosis or Rheumatoid Arthritis. Computations geneticists gather and examine data sets of genetic information from thousands of people to find these variations. “In practice, it comes down to looking at very small differences between people. That’s why you have to have many, many people to do this type of work in order to pick up the slight differences,” explains Annique Claringbould, a PhD student in genetics at the University of Groningen.

When you’re trying to identify risk factors for about 1,200 genetic disorders at the same time, you need a large database of people. Annique’s research group works with different biobanks, which store biological samples for research. Using biobanks gives researchers access to data from a large group of people. Her research group has also been contacting other research groups who do the same type analysis to gather together data from 32,000 people, making it the largest data set in the field. This was an impressive improvement on the previous largest data set which was around 7,000 people.

Using this sizable data set, Annique compares people with the risk variants for different genetic conditions to people with the non-risk variant to determine the molecular differences between them. With just her computer Annique is able to do a greater number of projects than if she had to run experiments in a lab. She can test for a diverse variety of traits in 32,000 people all at once. “If you have some crazy idea you can just try it. It’s not going to hurt anyone and it’s not going to cost a lot of money. You can just try and see if it works.” says Annique.

Through computational genetics it could be possible to not only pinpoint the genes that cause specific genetic disorders, but also develop new targets for medications to treat them. It’s been shown that drugs that use targets with a molecular or genetic background are far more likely to be successful than those that don’t. Current development approaches involve screening thousand of compounds to see if they cause any reaction in the cells—a very untargeted approach. Using their data sets, Computational geneticists can identify targets faster and less expensively.

This type of research highlights the need for larger pools of health data. Even if a person has all the known genetic risk factors for a condition, researchers can still only predict who will actually develop the disease with 20-30% accuracy. But if they have a larger pool of samples that is more representative of the population at large then they could identify more genetic variants and predict diagnoses more accurately. “Because the field is relatively new, people are hesitant since they don’t know what in the end researchers will be able to know about you,” Annique says. She thinks that modern society should start getting used to the idea that health data, which most people consider to be very sensitive information, can have such a huge impact on research potential that it might be worth sharing it. With a data set of 32,000 people, research groups around the world have identified anywhere from 50 to 100 variants for each genetic disease, but it’s seeming like more than 1000 variants need to be identified to accurately assess someone’s risk for disease. “What we’re doing is very interesting but it’s not close to being solved.”

Discover hundreds of academic jobs