November 15, 2024

KubeCon 2024 Recap: The Future of AI and Kubernetes in Cloud-Native Ecosystems

Table of Contents

Discover the latest AI and Kubernetes advancements from KubeCon North America 2024. Dive into sessions on scaling AI workloads, GPU scheduling, and cloud-native infrastructure, and see how Harness’s AI-Native Software Delivery platform is enhancing DevOps productivity and security. Explore insights from keynotes, AI-driven workflows, and co-located events that are shaping the future of cloud-native tech.

As the flagship event of the Cloud Native Computing Foundation, KubeCon brought together engineers, thought leaders, and cloud-native professionals to explore innovations and best practices in Kubernetes and cloud-native technologies. AI was a major theme this year, with sessions on scaling AI workloads, optimizing model deployment, and leveraging reinforcement learning for GPU scheduling. These insights highlighted Kubernetes' role in AI-driven workflows, where managing large-scale models is crucial.

Harness’s AI-Native Software Delivery platform complements these advancements by accelerating DevOps with intelligent automation. In this blog post, we’ll dive into the key trends from KubeCon’s AI-focused talks and how platforms like Harness are redefining productivity and security for AI and Kubernetes practitioners alike.

Keynote Highlights

KubeCon North America 2024 opened with CNCF leaders setting the stage for key themes, including the convergence of AI, Kubernetes, and cloud-native innovation. Here are a few highlights:

  • Scaling AI Workloads: Peter Salanki, CTO of CoreWeave, and Chen Goldberg, SVP of Engineering at CoreWeave, shared real-world insights into the complexities of training large-scale foundation models on Kubernetes. With extensive GPU clusters, even small hardware issues can become major bottlenecks. They discussed lessons learned in building Kubernetes clusters at scale, including strategies for handling hardware failures, optimizing GPU scheduling, and enhancing observability. Leveraging CNCF projects, they demonstrated how Kubernetes offers a robust platform for managing complex infrastructure for generative AI, making it easier to monitor and maintain large AI workloads. 
  • Platform Engineering and AI: Kasper Borg Nissen, Staff Platform Engineer at Lunar, explored how AI can move beyond isolated teams and become a core part of an organization’s infrastructure. Drawing on cloud-native platform engineering principles, he demonstrated how businesses can leverage existing tools and architectures to democratize AI, making it accessible to all teams without the need for complex setups.

These highlights reflect the key trends driving the cloud-native landscape, from AI integration to enhanced security and scalability.

AI and Machine Learning in the Cloud-Native Ecosystem

KubeCon showcased the expanding role of AI and machine learning in cloud-native environments, with a focus on the unique demands of training, serving, and scaling models on Kubernetes. Talks explored advanced techniques for managing complex AI workloads in production and shared best practices for making machine learning accessible, efficient, and operationally sound within Kubernetes clusters.

Advanced Model Serving with Ray on Kubernetes led by Andrew Sy Kim from Google and Kai-Hsun Chen of Anyscale, highlighted Ray's growing role in handling distributed model serving, especially for large language models (LLMs). The speakers delved into techniques like model composition, multiplexing, and fractional GPU scheduling to optimize resource allocation. A key takeaway was how Ray’s integration with Kubernetes and support for GPU-native communication facilitates scalable tensor parallelism, making it feasible to deploy LLMs across multiple GPUs. The session’s live demo of KubeRay underscored how these techniques are applied in real-world scenarios, enabling complex models to be efficiently orchestrated across diverse hardware environments.

In AI and ML: The Operational Side, Rob Koch and Milad Vafaeifard from Slalom Build and Epam discussed the often-overlooked, yet essential, operational challenges of running machine learning applications. They presented a compelling case for leveraging service meshes to streamline ML infrastructure. By enhancing observability, maintaining reliability, and simplifying operations, a service mesh allows ML engineers to focus on building robust models rather than reinventing infrastructure. This approach ensures reliable access to compute resources like GPUs and provides the much-needed separation between datasets and training processes in complex, production-scale environments.

Another practical perspective was provided in Operationalizing High-Performance GPU Clusters in Kubernetes by Will Gleich and Wai Wu from Databricks. They shared hard-earned insights from training Databricks’ DBRX model on a massive 400-node GPU cluster. The speakers emphasized the intricacies of GPU health monitoring, including their innovative use of Prometheus and DCGM Exporter for real-time health detection. They also addressed the complexity of GPU Direct Remote Direct Memory Access (GDRDMA) monitoring and managing failure scenarios in distributed environments. Databricks' experience highlighted the importance of infrastructure resilience and scalability in multi-cloud GPU clusters, especially when handling the varying demands of large-scale AI training.

Production AI at Scale: Cloudera’s Journey led by Zoram Thanga and Peter Ableda showcased Cloudera's approach to creating a high-performance inference platform using Kubernetes and open-source technologies. Their talk provided a blueprint for building secure, scalable, and standards-compliant inference services in production, with strong emphasis on authentication, authorization, and audit logging. By ensuring openness and security, Cloudera illustrated how large organizations can integrate AI inferencing into big data ecosystems, making production-grade AI accessible and reliable for enterprise applications.

Finally, Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet presented by Andrey Velichkevich and Yuki Iwai introduced the new Kubeflow TrainJob API. This solution simplifies distributed training and LLM fine-tuning on Kubernetes, reducing the need for extensive DevOps support. By providing reusable, extendable training runtimes, TrainJob empowers data scientists to run large-scale, distributed training with ease. Integrated with Kubernetes JobSet, this API enhances scalability and resilience, promoting a more inclusive, accessible, and productive ML development ecosystem.

One of the most innovative approaches to improving AI infrastructure was explored in Unlocking the Future of GPU Scheduling in Kubernetes with Reinforcement Learning by Nikunj Goyal from Adobe Systems and Aditi Gupta from Disney Plus Hotstar. They discussed the challenges of GPU scheduling, particularly in multi-GPU setups, and how reinforcement learning (RL) could optimize GPU resource utilization. Unlike traditional scheduling algorithms, RL continuously adapts to the dynamic nature of Kubernetes clusters, addressing issues like resource fragmentation and low utilization. The talk highlighted state-of-the-art RL algorithms and the potential for Reinforcement Learning with Human Feedback (RLHF) to enhance GPU scheduling in Kubernetes, offering a glimpse into the future of more efficient and cost-effective AI workloads.

These sessions, among others, illustrate how Kubernetes is evolving to meet the needs of modern AI and machine learning workflows. While this is not an exhaustive list, it provides a snapshot of the exciting advancements being made in the cloud-native AI space at KubeCon.

Harness’s AI-Native Software Delivery platform accelerates DevOps by integrating intelligent automation across the entire software lifecycle, enhancing processes from code generation and testing to CI/CD pipeline creation and cloud governance. AI tools like the AI DevOps Assistant and AI QA Assistant help streamline pipeline management, testing, and vulnerability remediation, improving developer productivity and application security. This platform brings efficiency to modern software delivery, ensuring seamless, scalable, and secure cloud-native operations.

For more details, visit the Harness AI-Native Software Delivery page.

Highlights from Co-Located Events

Harness participated at several co-located events at KubeCon North America 2024, including ArgoCon, OpenTofu Day, and BackstageCon, each of which provided a deep dive into specialized areas of cloud-native technology.

Harness IDP: A Powerful Internal Developer Portal powered by Backstage

At BackstageCon, Himanshu Mishra from Harness delivered a practical and insightful talk titled “Backstage Adoption Deep Dive - Navigating the Pitfalls.” As Backstage is gaining momentum in organizations looking to enhance developer productivity, Himanshu’s session tackled the intricacies of adopting Backstage, including where to start and how to lay a solid foundation for success. He outlined critical milestones from day one to the first month, shared common pitfalls, and highlighted the essential cultural and technical considerations. By offering a clear roadmap for Backstage adoption, Himanshu provided Platform Engineers with actionable guidance on creating a developer portal that’s both effective and aligned with organizational goals.

Harness IDP builds on the Backstage framework to address adoption challenges, offering a fully managed portal within a secure, enterprise-grade platform with user management, RBAC, curated plugins, OPA policies, and audit trails. The Web Admin UI makes it easier to configure Backstage, enabling faster time-to-value with fewer engineering resources. The Harness booth, Z16, was buzzing with activity at BackstageCon, where we had many in-depth conversations about the demand for developer portals and the journey toward effective adoption across platform and development teams.

ArgoCon: The Journey of Argo CD Maturity

ArgoCon brought together a diverse group of companies, each on a unique journey with Argo CD—from early manual implementations to automated deployment strategies. Many attendees expressed a growing interest in achieving higher levels of automation with Argo CD, although automated rollbacks and robust automation practices are still evolving across organizations. 

Harness GitOps extends GitOps with automated pull request pipelines, simplifying deployments across services and environments without manual promotion. With continuous health monitoring and intelligent rollback capabilities, it leverages observability tools to ensure smooth rollbacks if issues arise. Harness also offers fully managed GitOps infrastructure, removing the burden of setup and maintenance for users. 

The Harness team engaged in conversations with companies eager to enhance their Argo CD workflows and address real-world challenges, underscoring the excitement and potential for Argo CD’s continued growth in the DevOps space.

OpenTofu Day: A Community United in Open-Source Enthusiasm

OpenTofu Day was marked by palpable enthusiasm, with the community rallying around OpenTofu’s potential to become an open-source alternative in the infrastructure as code (IaC) space. Attendees shared insights on the importance of open standards and collaboration as OpenTofu gains traction. The Harness team witnessed the community’s passion for creating a transparent, interoperable IaC solution, highlighting OpenTofu’s promise as a pivotal project in the cloud-native ecosystem.

Building on this energy, Harness IaCM (Infrastructure as Code Management) provides a unified platform that enables developers and cloud engineers to manage Terraform/OpenTofu Infrastructure-as-Code in a reliable, repeatable way. With integrated self-service IaC, shared pipelines, and CI/CD automation, it accelerates adoption, mitigates risks, and enhances governance. By offering visibility, reusable templates, and cost control features, Harness IaCM helps teams streamline infrastructure changes, reduce errors, speed up deployments, and maintain budget compliance—contributing to the broader goal of collaboration and efficiency in the IaC space.

Harness at KubeCon: Highlights 

Harness had an impactful presence at KubeCon North America 2024, engaging attendees through insightful talks, exciting booth activities, and networking events. Here are some of the key highlights:

Database DevOps: CD for Stateful Applications

On Thursday, November 14, Stephen Atwell from Harness and Christopher Crow from Pure Storage presented “Database DevOps: CD for Stateful Applications.” This session tackled the unique challenges of deploying stateful applications on Kubernetes, demonstrating how Continuous Delivery (CD) can simplify the management of persistent data and database structural changes. They showcased how tools like Kubernetes and Liquibase can transform stateful application deployment into a repeatable, testable process. Through a live demo, Stephen and Chris demonstrated how CD tooling can automate data migrations within Kubernetes, providing a more reliable and efficient deployment method for stateful applications.

KubeCon Kick-off Event at Flanker Kitchen + Sporting Club

Harness hosted a lively kick-off event on Tuesday, November 12, at Flanker Kitchen + Sporting Club. Attendees enjoyed cocktails, small bites, and the chance to connect with technical experts from Harness. The event offered a relaxed setting to explore Harness’s product suite and have in-depth discussions about the latest in DevOps and cloud-native solutions.

Increased Interest in Open Source and Scalable Platforms

Throughout KubeCon, Harness observed a growing interest from customers in open-source software (OSS) as they increasingly recognize its value in modern infrastructure. Alongside this interest, attendees expressed a desire for robust, scalable platform solutions that can integrate with open-source tools, underscoring a shift towards a more resilient and flexible approach to cloud-native development.

The excitement around Harness Open Source at this year’s KubeCon was inspiring. Attendees engaged in deep technical discussions and hands-on demos, eager to explore how Harness’s open-source platform can transform cloud-native workflows. By integrating source code management, CI/CD pipelines, hosted development environments, and artifact management, Harness Open Source provides a cohesive approach that simplifies operations, reduces setup time, and enables teams to focus on building and deploying applications efficiently. 

Interested in trying Harness Open Source? Visit Harness Developer Hub to learn more and get started today.

Engagement at Booth J7

Harness’s booth, J7, was a vibrant hub of activity, drawing crowds with demos, raffles, and interactive experiences. The team offered live demonstrations of Harness’s latest features, provided insights on continuous delivery best practices, and hosted exciting LEGO and swag giveaways that added an element of fun to the learning experience.

With engaging presentations, memorable events, and hands-on demos, Harness’s presence at KubeCon showcased its commitment to empowering organizations on their cloud-native and DevOps journeys.

Harness Crew at KubeCon North America 2024

Next Stop: London for KubeCon Europe 2025!

As KubeCon North America 2024 wraps up, we're already looking forward to our next big destination: London, for KubeCon Europe 2025! Join us as we dive deeper into the latest advancements in cloud-native technology, DevOps, and open-source innovation. We’ll bring more hands-on demos, insightful talks, and new announcements tailored to help you scale your operations with confidence.

If you’re planning to attend, keep an eye on our social channels and blog for updates on our sessions, events, and exclusive opportunities to connect with the Harness team. Don’t miss out—register for KubeCon Europe 2025 and be part of the conversation that’s shaping the future of cloud-native!

Ready to meet us in London? Follow Harness on LinkedIn for the latest news and event details, and sign up for updates on our KubeCon Europe plans. See you in London!

No items found.
AI Code Assistant
AI QA Assistant
Code Repository
Chaos Engineering
Cloud Cost Management
Continuous Delivery & GitOps
Continuous Integration
Database DevOps
Cloud Development Environments
Infrastructure as Code Management
Internal Developer Portal
Open Source