Build Custom Large-Scale Generative AI Models | NVIDIA GTC

| Podcasts | April 08, 2026 | 1.68 Thousand views | 39:02

TL;DR

Adobe's CTO explains why the company chose to build proprietary generative AI models from scratch to ensure legal compliance and creative control, then details how they discovered that naive scaling approaches resulted in GPUs sitting idle 60-70% of the time due to coordination bottlenecks.

🎯 Strategic Decision to Build 3 insights

Professional creatives reject prompt roulette

Off-the-shelf models produced random outputs unsuitable for Adobe's customers, who required precise iterative control rather than gambling with text prompts to achieve their specific vision.

Legal liability blocked enterprise adoption

Enterprise legal departments refused available models due to copyright and IP training risks, forcing Adobe to use fully licensed, human-moderated datasets with complete provenance tracing.

Differentiation justified massive investment

Adobe determined that control and legal compliance provided sufficient competitive differentiation to justify building custom frontier models despite requiring millions in GPU infrastructure.

🛠️ Initial Technical Architecture 3 insights

Naive scaling with PyTorch Lightning

The team initially approached large-scale training as simply adding more GPUs to standard loops, using PyTorch Lightning and AWS S3 storage with thousands of NVIDIA A100s.

Simple data parallelism strategy

They implemented straightforward data parallelism that split petabytes of training data across GPUs, where each processed independently before a manager node collated updates.

Early validation masked inefficiency

The first working model validated the technical approach but exhibited typical early-generation artifacts while hiding severe resource underutilization that threatened cost viability.

⚠️ The Utilization Crisis 3 insights

GPUs idle 60-70% of training time

Profiling revealed GPUs sat empty approximately two-thirds of the time waiting for a manager node to gather and merge model updates, effectively wasting $600,000 of every $1 million spent.

Data parallelism fails beyond 16 GPUs

The straightforward data parallelism approach creates exponentially worse coordination bottlenecks when scaling beyond roughly 16 parallel processors, making it unsuitable for frontier model training.

Storage and checkpointing bottlenecks

Loading petabytes from distributed storage and saving massive checkpoints for insurance created constant stalls, compounded by CPU preprocessing delays and unnecessary GPU synchronization.

Bottom Line

Scaling AI training requires fundamental pipeline architecture changes to eliminate coordination overhead and storage bottlenecks, not simply adding more GPUs, as standard data parallelism becomes exponentially inefficient beyond small clusters.

More from NVIDIA AI Podcast

View all
Build Video Analytics AI Agents with Skills
59:53
NVIDIA AI Podcast NVIDIA AI Podcast

Build Video Analytics AI Agents with Skills

NVIDIA introduces the Video Search and Summarization (VSS) blueprint for building vision AI agents that process billions of camera streams using vision language models and a new 'skills' framework, enabling deep video search and summarization 60x faster than manual review.

12 days ago · 9 points
Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs
48:56
NVIDIA AI Podcast NVIDIA AI Podcast

Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs

NVIDIA researchers detail the development of Nemotron 3 Nano Omni, explaining how they evolved a text-only model into a multimodal system capable of processing vision, audio, and video through progressive training stages while maintaining the hybrid Mamba-Transformer architecture.

13 days ago · 10 points
Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture
51:38
NVIDIA AI Podcast NVIDIA AI Podcast

Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture

NVIDIA researchers Lynn Chai and Luc introduce TensorRT Edge LLM, a purpose-built inference engine for deploying large language models on Jetson edge devices, showcasing NVFP4 quantization and speculative decoding techniques that achieve up to 7x faster prefill speeds and 500 tokens per second generation while previewing a simplified vLLM-style Python API coming soon.

20 days ago · 10 points
March 10 - Jetson AI Lab Research Group Call - Lightning talks
55:28
NVIDIA AI Podcast NVIDIA AI Podcast

March 10 - Jetson AI Lab Research Group Call - Lightning talks

This Jetson AI Lab Research Group call features lightning talks on open-source hardware for remote Jetson access, a real-time emotional AI engine for robots running entirely on Jetson Nano, and updates to the Jetson AI Lab model repository with new performance benchmarks and deployment guides.

20 days ago · 8 points