One of the most common pitfalls for enterprises adopting AI is oversimplifying the problem as "buy more GPUs." However, as models evolve from standard LLMs to Mixture-of-Experts (MoE) architectures, the inference bottleneck shifts from compute density to communication latency and memory bandwidth. Google Cloud shared a reference inference solution centered on A4X (GB200 NVL72) and NVIDIA Dynamo, emphasizing treating inference as a systems engineering challenge composed of infrastructure layer, serving layer, and orchestration layer.
The most critical insights for "ERP + AI"
- AI is not a monolithic application: To serve multiple business lines, multiple system calls, and scalable concurrency, it must be engineered.
- More detailed trade-off between cost and performance: Different scenarios (batch processing vs real-time Q&A) have different throughput/latency requirements, requiring a layered architecture.
- Platformization is more important than stacking hardware: Orchestration like K8s/GKE, cache management, and observability determine whether large-scale reuse is possible.
Three-tier architecture (translated in corporate language)
- Infrastructure layer: Compute + Network + Storage (determines bandwidth, latency, stability).
- Serving layer: Model runtime/inference engine (KV cache, scheduling, parallel strategy).
- Orchestration layer: resource lifecycle, scaling, disaster recovery, quotas, and scheduling policies.
Implementation Advice (Get It Right First, Then Scale Up)
- Classify business scenarios by SLA: real-time (low latency)/near real-time/offline batch processing.
- Prioritize the design of "Data and Cache": context, vector database, KV cache, hot-cold tiering.
- Make observability the default: track cost, latency, and failure reasons for every inference call.
References
Google Cloud Blog: Scaling MoE inference with NVIDIA Dynamo on Google Cloud A4X
