Stanford CS547 HCI Seminar | Winter 2026 | Visual and Algorithmic Interpretation for Responsible AI
Fine-tuning large language models risks sudden catastrophic failure of safety guardrails, which break abruptly rather than gradually like capability metrics. Researchers demonstrate that dynamically segmenting training data into safe and unsafe portions—instead of binary filtering—maintains both safety alignment and model performance.