Need for speed

Research Computing program connects UofSC scientists with high-powered data analytics solutions

Posted on: April 2, 2019; Updated on: April 2, 2019
By Chris Horn, chorn@sc.edu, 803-777-3687

They don’t look like a NASCAR pit crew, but the Research Computing team is speeding up the data-crunching capabilities of the university’s scientists, bringing faster analytical results and more efficient computing to those who work with very large and complex data sets.

“Research computing is the third pillar of science now,” says Paul Sagona, who directs the four-person team in the university’s Division of Information Technology. “You’ve got theoretical, experimentation, and now computation and simulation. It’s become an absolute necessity for science.”

The Research Computing program was begun several years ago to help researchers improve project performance by optimizing data analytic work flows, identifying bottlenecks and problems in computer code and securing computing resources at the university level and beyond.

“Depending on a researcher’s computing needs, we can help migrate them onto larger systems with more CPU cores and more RAM. If their code scales well, we’ll move them up to what we have available on campus and, if necessary, find ways to graduate them to larger resources, whether it be supercomputing centers across the country or cloud computing.”

The program offers services at no charge to university researchers, although there are opportunities to pay for dedicated and priority access to the university’s computer cluster. Research Computing’s services sometimes improve research grant competitiveness by demonstrating that a researcher has high-powered data analytics support and that the research question itself is scalable.

Sean Norman, an environmental health sciences professor in the Arnold School of Public Health, needed the kind of speed Research Computing can deliver. His metagenomics research involves collecting environmental field samples and analyzing their genetic profile.

“With the advances in DNA sequencing, we generate millions of DNA sequences, and we have to rely on computers to help us,” Norman says. “We can’t sit down and actually analyze 30 or 40 million sequences by hand.”

When the information technology division’s Hyperion high-performance computing cluster became operational last fall, Norman worked with Research Computing to use the new machine for his team’s data analytic needs. He had previously used the national supercomputer network, which provides major computing horsepower but also requires waiting in line with other researchers around the country.

The in-house Hyperion was far more convenient to use, but Sagona’s team noticed that some of the analyses Norman was running were taking a month or more to complete.

“By optimizing the work flow of Sean Norman’s research, we were able to speed it up,” says Sagona, adding that the goal was to reduce the computing time from months to weeks. After optimizing the computer code, Sagona’s team determined that the Hyperion’s capacity wasn’t sufficient for the task. That’s when they looked at cloud computing.

Sagona’s team talked to Google Cloud engineers and began to optimize Norman’s coding for a cloud-based approach. They tested small samples initially to ensure the code worked, then uploaded the entire data set to Google Cloud.

“We ended up accessing nearly 4,000 computing nodes, each one of them with up to 10-times the computing horsepower of a high-powered laptop,” Sagona says.

Ben Torkian, senior application scientist for Research Computing, wrote software tools that could split the job into smaller segments, then reassemble the results after the computational analysis was completed. It turned out to be one of the largest data sets ever run on the Google Cloud Platform, and the results were phenomenal, according to Sagona.

“It was a huge job, and it was completed in 16.6 hours. We calculated that if Sean had tried to run this on his PC, it would have taken seven years,” Sagona says. “We provided a lot of feedback to Google during that project, and I think we learned a lot and they did, too. We’ll be better prepared to help the next person at USC who needs that much computing power.”

Perhaps not surprisingly, Sagona was invited to the Google Cloud Next conference in London this past October to present a talk on the scale of the project and the role of cloud computing in the performance of data analytics.

“My talk was well received, and I had a lot of positive feedback,” he says. “I made some great connections there that will prove to be very valuable moving forward with cloud and funding agencies.”

Share this Story! Let friends in your social network know what you are reading about

Topics: Faculty, Research, Arnold School of Public Health