- Problem
- Large-language-model outputs needed rigorous, repeatable validation for truthfulness, robustness, bias, and strict instruction-following - directly analogous to model-risk review.
- Approach
- Evaluated outputs against rule-based guidelines, authored evaluation documentation, and refined the evaluation frameworks; built scalable multilingual (German / English) NLP data-ingestion and ETL workflows in Databricks and Azure to feed the reviews.
- Result
- Improved human–AI alignment, explainability of judgments, and reproducibility of reviews across the evaluation programme (Jan–Sep 2025).