WEST LAFAYETTE, Ind. – A Purdue University data science and machine learning innovator wants to help organizations and users get the most for their money when it comes to cloud-based databases. Her same technology may help self-driving vehicles operate more safely on the road when latency is the primary concern.
Somali Chaterji, a Purdue assistant professor of agricultural and biological engineering who directs the Innovatory for Cells and Neural Machines [ICAN], and her team created a technology called OPTIMUSCLOUD.
The system is designed to help achieve cost and performance efficiency for cloud-hosted databases, rightsizing resources to benefit both the cloud vendors who do not have to aggressively over-provision their cloud-hosted servers for fail-safe operations and to the clients because the data center savings can be passed on them.
“It also may help researchers who are crunching their research data on remote data centers, compounded by the remote working conditions during the pandemic, where throughput is the priority,” Chaterji said. “This technology originated from a desire to increase the throughput of data pipelines to crunch microbiome or metagenomics data.”
The Purdue technology works with the three major cloud database providers: Amazon’s AWS, Google Cloud, and Microsoft Azure. Chaterji said it would work with other more specialized cloud providers such as Digital Ocean and FloydHub, with some engineering effort.
It is benchmarked on Amazon’s AWS cloud computing services with the NoSQL technologies Apache Cassandra and Redis.
“Let’s help you get the most bang for your buck by optimizing how you use databases, whether on-premise or cloud-hosted,” Chaterji said. “It is no longer just about computational heavy lifting, but about efficient computation where you use what you need and pay for what you use.”
Chaterji said current cloud technologies using automated decision making often only work for short and repeat tasks and workloads. She said her team created an optimal configuration to handle long-running, dynamic workloads, whether it be workloads from the ubiquitous sensor networks in connected farms or high-performance computing workloads from scientific applications or the current COVID-19 simulations from different parts of the world in a rush to find the cure against the virus.
“Our right-sizing approach is increasingly important with the myriad applications running on the cloud with the diversity of the data and the algorithms required to draw insights from the data and the consequent need to have heterogeneous servers that drastically vary in costs to analyze the data flows,” Chaterji said. “The prices for on-demand instances on Amazon EC2 vary by more than a factor of five-thousand, depending on the virtual memory instance type you use.”
The Purdue team’s technology has been accepted for publication at the 2020 USENIX Annual Technical Conference, taking place as a virtual event in July.
Chaterji said OPTIMUSCLOUD has numerous applications for databases used in self-driving vehicles (where latency is a priority), health care repositories (where throughput is a priority), and Internet of Things (IoT) infrastructures in farms or factories.
OPTIMUSCLOUD is a software that is run with the database server. It uses machine learning and data science principles to develop algorithms that help jointly optimize the virtual machine selection and the database management system options.
“Also, in these strange times when both traditionally compute-intensive laboratories such as ours and wet labs are relying on compute storage, such as to run simulations on the spread of COVID-19, throughput of these cloud-hosted VMs is critical and even a slight improvement in utilization can result in huge gains,” Chaterji said. “Consider that currently, even the best data centers run at lower than 50% utilization and so the costs that are passed down to end-users are hugely inflated.”
The other members of the team that developed OPTIMUSCLOUD are Saurabh Bagchi, a Purdue professor of electrical and computer engineering and computer science (by courtesy); Ashraf Mahgoub, a Ph.D. student in computer science; and Karthik Shankar, an undergraduate researcher in Chaterji’s lab headed to Carnegie Mellon for graduate school in computer science.
“Our system takes a look at the hundreds of options available and determines the best one normalized by the dollar cost,” Chaterji said. “When it comes to cloud databases and computations, you don’t want to buy the whole car when you only need a tire, especially now when every lab needs a tire to cruise.”
The work is supported by the National Institutes of Health through a 5-year R01 grant and a grant from the Wabash Heartland Innovation Network (Lilly Endowment Inc.).
The team worked with the Purdue Research Foundation Office of Technology Commercialization to patent the technology. The office recently moved into the Convergence Center for Innovation and Collaboration in Discovery Park District, adjacent to the Purdue campus.
The creators are looking for partners to commercialize their technology. For more information on licensing this innovation, contact Matt Halladay of OTC at email@example.com and mention track code 2020-CHAT-68827.
About Purdue Research Foundation Office of Technology Commercialization
The Purdue Research Foundation Office of Technology Commercialization operates one of the most comprehensive technology transfer programs among leading research universities in the U.S. Services provided by this office support the economic development initiatives of Purdue University and benefit the university’s academic activities through commercializing, licensing and protecting Purdue intellectual property. The office recently moved into the Convergence Center for Innovation and Collaboration in Discovery Park District, adjacent to the Purdue campus. In fiscal year 2019, the office reported 136 deals finalized with 231 technologies signed, 380 disclosures received and 141 issued U.S. patents. The office is managed by the Purdue Research Foundation, which received the 2019 Innovation and Economic Prosperity Universities Award for Place from the Association of Public and Land-grant Universities. In 2020, IPWatchdog Institute ranked Purdue third nationally in startup creation and in the top 20 for patents. The Purdue Research Foundation is a private, nonprofit foundation created to advance the mission of Purdue University. Contact firstname.lastname@example.org for more information.
About Purdue University
Purdue University is a top public research institution developing practical solutions to today’s toughest challenges. Ranked the No. 6 Most Innovative University in the United States by U.S. News & World Report, Purdue delivers world-changing research and out-of-this-world discovery. Committed to hands-on and online, real-world learning, Purdue offers a transformative education to all. Committed to affordability and accessibility, Purdue has frozen tuition and most fees at 2012-13 levels, enabling more students than ever to graduate debt-free. See how Purdue never stops in the persistent pursuit of the next giant leap at purdue.edu.
Saurabh Bagchi, email@example.com
OPTIMUSCLOUD: Heterogeneous Configuration Optimization for Distributed Databases in the Cloud
Achieving cost and performance efficiency for cloud-hosted databases requires exploring a large configuration space, including both the parameters exposed by database and the variety of VM configurations available in the cloud. Even small deviations from an optimal configuration have significant consequences for performance and cost. Existing systems that automate cloud deployment configuration can select near-optimal instance types for homogeneous clusters of virtual machines and for stateless, recurrent data analytics workloads. We show that to find optimal performance-per-$ cloud deployments for NoSQL database applications, it is important to (1) consider heterogeneous cluster configurations, (2) jointly optimize database and VM configurations, and (3) dynamically adjust configuration as workload behavior changes. We present OPTIMUSCLOUD, an online reconfiguration system that can efficiently perform such joint and heterogeneous configuration for dynamic workloads. We evaluate our system with two clustered NoSQL systems: Cassandra and Redis, using three representative workloads and show that OPTIMUSCLOUD provides 40% higher throughput/$ and 4.5× lower 99-percentile latency on average compared to state-of-the-art prior systems, CherryPick, Selecta, and SOPHIA.