Company
Chuhaijiang, developed by Data Nebula, is a social commerce platform designed to support players in the cross-border e-commerce ecosystem with data-driven insights and intelligent workflows.
Leveraging big data and AI, the platform helps businesses analyze market trends and optimize operational strategies.
As a data-centric company, Chuhaijiang's core offerings include:
- Real-time analytics: Monitoring social commerce platforms to track user engagement, product popularity, and conversion rates.
- AI-powered insights: Using machine learning to uncover emerging trends in global markets and recommend high-potential products.
- Data visualization: Presenting key market data through intuitive dashboards and reports to guide ad placement and marketing strategies.
Architecture Optimization
Previously, Chuhaijiang ran a hybrid infrastructure—real-time analytics workloads were handled on-demand via AWS EC2, while large-scale data processing was done on-premises.
Following a comprehensive evaluation by CloudPilot AI, the team improved its architecture. They first migrated all big data workloads to the cloud, then adopted Kubernetes to take full advantage of cloud elasticity. Finally, they moved workloads to AWS Spot Instances to reduce costs.
With CloudPilot AI's optimization, the company maintained high performance and service stability while significantly cutting cloud expenses.
Challenges
High EC2 Costs Limiting Growth
Chuhaijiang’s real-time data analytics—core to its business—originally ran on AWS on-demand EC2 instances. However, this setup didn’t take advantage of elastic compute resources or cost-efficient Spot Instances.
As workloads ran continuously, EC2 costs soared. Meanwhile, big data tasks remained on-premises, constrained by limited compute capacity—making it difficult to scale with growing business demands.
Complex Resource Management and Limited Scalability
Before adopting Amazon EKS (Elastic Kubernetes Service), Chuhaijiang managed compute resources manually, without automated scaling. This created several operational challenges:
- Low resource utilization: On-demand instances were often either underused or overloaded.
- Slow response to traffic spikes: Manual scaling couldn't keep up with changing workloads, affecting real-time analytics.
- High operational overhead: The team spent significant effort managing infrastructure instead of focusing on core development.
Concerns Around Spot Instance Reliability
While Spot Instances offer major cost savings, Chuhaijiang was initially hesitant to adopt them due to several risks:
- Unpredictable interruptions could lead to task failures and impact service reliability.
- Lack of intelligent scheduling made it hard to allocate resources efficiently across availability zones.
- High network overhead for Spark jobs, which involved complex communication patterns. Frequent Spot instance changes risked increasing cross-AZ network costs—potentially offsetting savings.
Caught between rising cloud costs and the need for scalable infrastructure, Chuhaijiang needed a solution that could deliver both efficiency and stability.
Solution
Results
-
✅ 60% Reduction in EC2 Costs
Smarter Spot selection and autoscaling cut heavy reliance on on-demand instances.
-
✅ Increased Compute Stability
120-minute interruption prediction and auto-migration kept workloads running smoothly.
-
✅ Faster Spark Performance
Running tasks in the same availability zone reduced network latency.
-
✅ Lower Operational Overhead
Fully automated scheduling freed the team from manual instance management.
To support rapidly growing workloads, Chuhaijiang decided to fully migrate its big data infrastructure to the cloud—boosting flexibility and scalability. At the same time, the team needed to reduce cloud spending to ensure long-term sustainability.
After extensive research and testing, Chuhaijiang chose CloudPilot AI as its intelligent cloud optimization partner.
By combining Spot automation, intelligent node selection, and Kubernetes resource optimization, CloudPilot AI helped the team significantly cut costs—without compromising workload stability.
Automated Kubernetes Scaling and Scheduling
With the adoption of Amazon EKS, the team aimed to automate resource scaling to handle fluctuating demand. Previously, they relied on manual instance adjustments — an inefficient and slow process that couldn't respond in real time.
CloudPilot AI introduced intelligent scheduling and autoscaling for Kubernetes workloads. When Spot capacity was available, it prioritized the lowest-cost instances. During peak demand or spot shortages, it automatically switched to on-demand instances to maintain performance.
This automation not only improved resource efficiency but also helped the team cut cloud costs by 60%.
Intelligent Spot Instance Management
During the PoC phase, the CloudPilot AI team conducted an in-depth analysis of Chuhaijiang’s workloads, focusing on the stability of real-time data analytics and Spark tasks running on Spot Instances.
The evaluation showed that CloudPilot AI's optimization strategies could significantly reduce computing costs while ensuring task stability. As a result, Chuhaijiang decided to fully migrate its real-time analytics and big data workloads to Spot Instances.
Previously, the team was concerned that Spot Instances might be terminated unexpectedly, disrupting computations and affecting data accuracy.
To mitigate this, CloudPilot AI implemented a 120-minute early noticing system (compared to AWS's 2-minute notice), providing advance alerts when Spot Instances were at risk of being reclaimed. The system would then automatically migrate tasks to more stable compute resources.
This proactive approach and automated task migration ensured continuity, minimized risks from Spot terminations, and preserved cost savings without manual intervention.
Optimizing Spark Task AZ Affinity Scheduling
In addition to Spot instance instability, the team faced performance bottlenecks due to network interactions in Spark computations. Since Spark tasks rely on multiple nodes for distributed computing, tasks spread across availability zones (AZs) generated extra network traffic costs.
To address this, CloudPilot AI optimized the scheduling logic for Spark tasks, ensuring that tasks were prioritized to run within the same availability zone. This reduced network transfer costs and improved data processing efficiency.
Intelligent Node Selection and Dynamic Resource Scheduling
To ensure the stable execution of compute tasks when Spot instance resources are scarce, CloudPilot AI offers intelligent node selection. The system analyzes Spot instance price trends and interruption rates in real time, automatically selecting the most cost-effective and stable instances.
If Spot instances are insufficient to support the compute tasks, the system will automatically switch to on-demand instances, ensuring that tasks are never interrupted due to resource shortages.
This allows Chuhaijiang to primarily use Spot instances for cost savings, automatically switching to on-demand instances when resources are scarce, ensuring smooth task execution.
spot.cloudpilot.ai. Instance prices can vary by 30% across availability zones, making it hard to manually select the most cost-effective and stable ones.
"CloudPilot AI has helped us save 60% on AWS costs while maintaining business performance. In the past, we were concerned about the risk of Spot instance termination, but now, with CloudPilot AI’s 120-minute interruption prediction and intelligent migration, our services remain stable.
Additionally, CloudPilot AI supported the smooth migration of our big data operations to the cloud and optimized Spark tasks, reducing unnecessary data transfer costs. It not only eased our operational burden but also allowed us to focus more on business growth instead of being distracted by cost and resource management."
Wang Ruiheng, Infra Lead at Data Nebula
Next steps
With CloudPilot AI, Chuhaijiang successfully optimized cloud-based resource management, achieving both cost reduction and improved computing efficiency.
The intelligent resource management and automated scheduling allowed the team to flexibly handle varying compute demands, while also lowering management costs and increasing resource utilization, enabling further business expansion.
Looking ahead, the Chuhaijiang team plans to extend CloudPilot AI's capabilities to more compute tasks, further optimizing their cost efficiency. They also anticipate greater support from CloudPilot AI in finer-grained resource scheduling and cost forecasting to enhance their business competitiveness.