Cutting cloud waste at scale: Akamai saves 70% using AI agents orchestrated by kubernetes

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more

Particularly in this dawning era of generative AI, cloud costs are at an all-time high. But that’s not merely because enterprises are using more compute — they’re not using it efficiently. In fact, just this year, enterprises are expected to waste $44.5 billion on unnecessary cloud spending.

This is an amplified problem for Akamai Technologies: The company has a large and complex cloud infrastructure on multiple clouds, not to mention numerous strict security requirements.

To resolve this, the cybersecurity and content delivery provider turned to the Kubernetes automation platform Cast AI, whose AI agents help optimize cost, security and speed across cloud environments.

Ultimately, the platform helped Akamai cut between 40% to 70% of cloud costs, depending on workload.

“We needed a continuous way to optimize our infrastructure and reduce our cloud costs without sacrificing performance,” Dekel Shavit, senior director of cloud engineering at Akamai, told VentureBeat. “We’re the ones processing security events. Delay is not an option. If we’re not able to respond to a security attack in real time, we have failed.”

Specialized agents that monitor, analyze and act

Kubernetes manages the infrastructure that runs applications, making it easier to deploy, scale and manage them, particularly in cloud-native and microservices architectures.

Cast AI has integrated into the Kubernetes ecosystem to help customers scale their clusters and workloads, select the best infrastructure and manage compute lifecycles, explained founder and CEO Laurent Gil. Its core platform is Application Performance Automation (APA), which operates through a team of specialized agents that continuously monitor, analyze and take action to improve application performance, security, efficiency and cost. Companies provision only the compute they need from AWS, Microsoft, Google or others.

APA is powered by several machine learning (ML) models with reinforcement learning (RL) based on historical data and learned patterns, enhanced by an observability stack and heuristics. It is coupled with infrastructure-as-code (IaC) tools on several clouds, making it a completely automated platform.

Gil explained that APA was built on the tenet that observability is just a starting point; as he called it, observability is “the foundation, not the goal.” Cast AI also supports incremental adoption, so customers don’t have to rip out and replace; they can integrate into existing tools and workflows. Further, nothing ever leaves customer infrastructure; all analysis and actions occur within their dedicated Kubernetes clusters, providing more security and control.

Gil also emphasized the importance of human-centricity. “Automation complements human decision-making,” he said, with APA maintaining human-in-the-middle workflows.

Akamai’s unique challenges

Shavit explained that Akamai’s large and complex cloud infrastructure powers content delivery network (CDN) and cybersecurity services delivered to “some of the world’s most demanding customers and industries” while complying with strict service level agreements (SLAs) and performance requirements.

He noted that for some of the services they consume, they’re probably the largest customers for their vendor, adding that they have done “tons of core engineering and reengineering” with their hyperscaler to support their needs.

Further, Akamai serves customers of various sizes and industries, including large financial institutions and credit card companies. The company’s services are directly related to its customers’ security posture.

Ultimately, Akamai needed to balance all this complexity with cost. Shavit noted that real-life attacks on customers could drive capacity 100X or 1,000X on specific components of its infrastructure. But “scaling our cloud capacity by 1,000X in advance just isn’t financially feasible,” he said.

His team considered optimizing on the code side, but the inherent complexity of their business model required focusing on the core infrastructure itself.

Automatically optimizing the entire Kubernetes infrastructure

What Akamai really needed was a Kubernetes automation platform that could optimize the costs of running its entire core infrastructure in real time on several clouds, Shavit explained, and scale applications up and down based on constantly changing demand. But all this had to be done without sacrificing application performance.

Before implementing Cast, Shavit noted that Akamai’s DevOps team manually tuned all its Kubernetes workloads just a few times a month. Given the scale and complexity of its infrastructure, it was challenging and costly. By only analyzing workloads sporadically, they clearly missed any real-time optimization potential.

“Now, hundreds of Cast agents do the same tuning, except they do it every second of every day,” said Shavit.

The core APA features Akamai uses are autoscaling, in-depth Kubernetes automation with bin packing (minimizing the number of bins used), automatic selection of the most cost-efficient compute instances, workload rightsizing, Spot instance automation throughout the entire instance lifecycle and cost analytics capabilities.

“We got insight into cost analytics two minutes into the integration, which is something we’d never seen before,” said Shavit. “Once active agents were deployed, the optimization kicked in automatically, and the savings started to come in.”

Spot instances — where enterprises can access unused cloud capacity at discounted prices — obviously made business sense, but they turned out to be complicated due to Akamai’s complex workloads, particularly Apache Spark, Shavit noted. This meant they needed to either overengineer workloads or put more working hands on them, which turned out to be financially counterintuitive.

With Cast AI, they were able to use spot instances on Spark with “zero investment” from the engineering team or operations. The value of spot instances was “super clear”; they just needed to find the right tool to be able to use them. This was one of the reasons they moved forward with Cast, Shavit noted.

While saving 2X or 3X on their cloud bill is great, Shavit pointed out that automation without manual intervention is “priceless.” It has resulted in “massive” time savings.

Before implementing Cast AI, his team was “constantly moving around knobs and switches” to ensure that their production environments and customers were up to par with the service they needed to invest in.

“Hands down the biggest benefit has been the fact that we don’t need to manage our infrastructure anymore,” said Shavit. “The team of Cast’s agents is now doing this for us. That has freed our team up to focus on what matters most: Releasing features faster to our customers.”

Editor’s note: At this month’s VB Transform, Google Cloud CTO Will Grannis and Highmark Health SVP and Chief Analytics Officer Richard Clarke will discuss the new AI stack in healthcare and the real-world challenges of deploying multi-model AI systems in a complex, regulated environment. Register today.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link