Build Custom Large-Scale Generative AI Models | NVIDIA GTC
TL;DR
Adobe's CTO explains why the company chose to build proprietary generative AI models from scratch to ensure legal compliance and creative control, then details how they discovered that naive scaling approaches resulted in GPUs sitting idle 60-70% of the time due to coordination bottlenecks.
🎯 Strategic Decision to Build 3 insights
Professional creatives reject prompt roulette
Off-the-shelf models produced random outputs unsuitable for Adobe's customers, who required precise iterative control rather than gambling with text prompts to achieve their specific vision.
Legal liability blocked enterprise adoption
Enterprise legal departments refused available models due to copyright and IP training risks, forcing Adobe to use fully licensed, human-moderated datasets with complete provenance tracing.
Differentiation justified massive investment
Adobe determined that control and legal compliance provided sufficient competitive differentiation to justify building custom frontier models despite requiring millions in GPU infrastructure.
🛠️ Initial Technical Architecture 3 insights
Naive scaling with PyTorch Lightning
The team initially approached large-scale training as simply adding more GPUs to standard loops, using PyTorch Lightning and AWS S3 storage with thousands of NVIDIA A100s.
Simple data parallelism strategy
They implemented straightforward data parallelism that split petabytes of training data across GPUs, where each processed independently before a manager node collated updates.
Early validation masked inefficiency
The first working model validated the technical approach but exhibited typical early-generation artifacts while hiding severe resource underutilization that threatened cost viability.
⚠️ The Utilization Crisis 3 insights
GPUs idle 60-70% of training time
Profiling revealed GPUs sat empty approximately two-thirds of the time waiting for a manager node to gather and merge model updates, effectively wasting $600,000 of every $1 million spent.
Data parallelism fails beyond 16 GPUs
The straightforward data parallelism approach creates exponentially worse coordination bottlenecks when scaling beyond roughly 16 parallel processors, making it unsuitable for frontier model training.
Storage and checkpointing bottlenecks
Loading petabytes from distributed storage and saving massive checkpoints for insurance created constant stalls, compounded by CPU preprocessing delays and unnecessary GPU synchronization.
Bottom Line
Scaling AI training requires fundamental pipeline architecture changes to eliminate coordination overhead and storage bottlenecks, not simply adding more GPUs, as standard data parallelism becomes exponentially inefficient beyond small clusters.
More from NVIDIA AI Podcast
View all
Advancing to AI's Next Frontier: Insights From Jeff Dean and Bill Dally
Google's Jeff Dean and NVIDIA's Bill Dally discuss the rapid evolution toward autonomous AI agents capable of multi-day tasks and self-improvement, while detailing the radical hardware shifts—toward 'speed of light' latency and specialized inference chips—required to power this next frontier.
MLOps 101: Platforms and Processes for Building AI | NVIDIA GTC
MLOps requires balancing scientific rigor with engineering discipline, combining rigorous hypothesis testing and data validation with robust system design, interface contracts, and continuous production monitoring to avoid catastrophic failures and pseudoscientific pitfalls.
Build, Optimize, Run: The Developer's Guide to Local Gen AI on NVIDIA RTX AI PCs
NVIDIA is driving a paradigm shift from cloud-based LLMs to local small language models (SLMs) on RTX GPUs, enabling personalized agentic AI with full data privacy. Through advanced quantization and tools like Olama, developers can now run sophisticated coding agents and creative assistants entirely on local hardware with 11x performance gains over competitors.
Insights from NVIDIA Research | NVIDIA GTC
NVIDIA Research reveals architectural breakthroughs targeting 16,000 tokens/sec inference speeds through radical data movement reduction, while recounting how the 500-person team previously pioneered the company's AI, networking, and ray tracing transformations.