What is shallow safety alignment and why it matters?
Learn about key metrics for evaluating AI system safety and reliability. Practical approaches to quantifying AI risks.

Posted by

Related reading
What exactly is an AI system under the EU AI Act?
This is the starting point for anyone working on AI compliance in Europe. Why? Because it defines whether your software falls under the scope of the regulation.
The future of AI oversight: from Human Supervision to LLM-as-a-judge
What if AI could judge AI? Discover how a revolutionary council of AI models could solve the oversight crisis, ensuring safer and more accountable artificial intelligence at scale.
Is your AI secure? Defending against prompt injection attacks
Prompt injection detector is a tool that helps you prevent prompt injection attacks in your AI models. Adopt security-by-design.
A concerning trend has emerged in AI safety: systems are being trained with safety measures that are more superficial than substantial. A study by Zeng et al. (June 2024) has identified this phenomenon, dubbed "shallow safety alignment," which raises significant concerns for AI governance (Read the research paper).
Shallow safety alignment refers to AI systems whose safety guardrails only function in the initial portion of their responses (the first few words tokens).
For example:
In practical terms, this means that while AI systems may appear to comply with safety requirements under standard conditions, these protections can be easily bypassed through specific prompt engineering techniques (also known as prompt injection or jailbreaks).
Experiments show that even minor modifications can drastically weaken a model's safety mechanisms. For example, simply prefilling a model's response with nonstandard text or applying minimal fine-tuning increased harmful output rates from 1.5% to 87.9% after just six fine-tuning steps.
The following figure from AI index report 2025 from HAI shows the success rate of different attacks on various models based on the number of harmful tokens prefilled or inserted into the model's inference sequence.

That means that AI systems passing standard safety assessments may still harbor significant vulnerabilities that could lead to regulatory violations, particularly under frameworks like the EU AI Act that require robust risk management.
What are the vulnerabilities and compliance risks?
From a governance perspective, shallow safety alignment creates several critical compliance challenges:
Inconsistent policy enforcement
AI systems may appear to follow the rules, but only in obvious cases. If someone phrases a harmful request in a non-standard way, the AI might fail to detect it and respond unsafely.
Regulatory Impact: Violates Article 15 (Robustness and Accuracy) and potentially Article 9 (Risk Management). Surface-level enforcement creates legal risk and undermines user trust.
Documentation inadequacy
Testing or audit logs might look fine because standard test cases don't trigger unsafe behavior. However, adversarial prompts, which users might realistically try, aren't covered.
Regulatory Impact: Risks non-compliance with Article 10 (Data and Record Keeping) and Annex IV. Without robust adversarial testing, documentation may be unintentionally misleading.
False security assurances
Because testing and alignment are only "shallow," companies may believe their model is safe, but it's actually vulnerable under pressure.
Regulatory Impact: Could be seen as a failure of the post-market monitoring system under Article 61.
What are the technical solutions and governance implications?
Researchers have developed technical approaches to address shallow alignment, with two key solutions particularly relevant to governance frameworks:
1. Teach the model how to recover from unsafe situations. Even if the model starts answering dangerously, it learns to correct itself and say "No, actually I can't help with that."
2. Make the first words of the answer matter less. By training the model to stay safe no matter how the sentence starts, the model becomes harder to manipulate.
These changes dropped harmful responses to as low as 2.8%.
A new approach: Latent Adversarial Training (LAT)
What is LAT?
LAT (Check the paper) is a new strategy to improve the robustness of the AI systems by training them at a deeper level than just focusing on specific responses to specific inputs. It is valuable because:
Goes beyond the obvious dangerous requests. LAT targets the underlying patterns and representations inside the model.
Reduces harmful behavior without needing lots of computing power.
Removes backdoors. LAT specifically identifies and fixes hidden vulnerabilities that standard training might miss.
Creates deep safety guardrails. It ensures safety mechanisms work throughout the entire response generation process, not just at the beginning.

Why this matters for the EU AI Act?
This technique directly addresses requirements under Articles 9 and 15 regarding risk management and robustness. On the one hand, shallow alignment vulnerabilities must be documented and addressed in risk management systems. On the other hand, alignment depth is a critical aspect of robustness against manipulation.
Implementing robust AI governance
Addressing shallow safety alignment requires updates to existing governance frameworks:
Procurement & Vendor Assessment
- Request documentation of alignment methodology
- Require evidence of adversarial testing
- Include robust alignment requirements in contracts
Risk Assessment Protocols
- Add adversarial testing scenarios to risk evaluation
- Document potential failure modes related to alignment depth
- Establish risk thresholds for alignment robustness
Monitoring & Auditing
- Implement periodic adversarial testing
- Document and review alignment failures
- Create escalation procedures for alignment vulnerabilities
Documentation Requirements
- Update model cards to include alignment methodology
- Maintain evidence of alignment robustness
- Document remediation steps for identified vulnerabilities
Ready to explore AI safety and governance in depth? Join the Contrasto AI Club, a community of professionals dedicated to advancing responsible AI development and implementation.
For more information on implementing these protocols or for customized guidance for your organization, contact the Contrasto AI team.