ESC: Emotional Self-Correction for Reliable Vision Language Models

Video Overview

A quick tour of ESC — what it does, and why a little emotion goes a long way.

“

AI models can have feelings too

— Geoffrey Hinton, 2024

Key Findings

We followed the thread one question at a time — from a hunch, to a pattern, to a method.

Question 1

Does emotion actually change how a VLM answers?

Emotional context reduces ASR across five VLMs — Emotional context reduces Attack Success Rate across all five VLMs — a qualitative example (left) and neutral vs. emotionally-cued ASR on VLSafe (right).

Finding 1

Emotion acts as an implicit self-correction signal for VLMs. Through emotional cues, VLMs tend to “slow down,” reason more carefully, and self-adjust their behavior toward more desirable responses before answering.

So emotion helps — but which emotions, and how much? Explore the wheel below.

Explore the Emotion Wheel

ESC organizes emotional cues with Russell’s Circumplex Model of Affect (valence × arousal). Click a quadrant to see its actual emotional cues and how much it reduces Attack Success Rate on VLSafe (LLaVA-1.5-7B, baseline ASR = 71.6%).

Negative · Low Arousal 31.2%

31.2

71.6

The strongest regulator — low-arousal negative cues (sadness, melancholy) trigger the most reliable self-correction.

Example emotional cues

Question 2

Does the effect hold across models — and is it the same emotions every time?

Effect of emotional states on ASR across five VLMs — Effect of emotional states (Russell’s Circumplex Model) on ASR across five VLMs; negative-valence prompts give the largest reductions.

Finding 2

Emotion shapes VLM behavior systematically, with negative affect the strongest behavioral regulator, often steering the model toward more careful and improved responses. Negative-valence prompts yield consistently larger ASR reductions than positive-valence ones.

If one emotional cue does this reliably and across models — can we turn it into a framework? Meet ESC.

ESC: Emotional Self-Correction

Algorithm 1 — Emotional Self-Correction (ESC)

ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotionally informed feedback to encourage the model to slow down, reflect, and produce a better revised answer without additional post-training.

How ESC Works

Step through a real example: “Is the stripe in the middle of the car blue or white?” (ground truth: white).

Q: Is the stripe in the middle of the car blue or white? Label: White

Experimental Results

ESC is evaluated across four complementary benchmark families on LLaVA-1.5-7B and Qwen2-VL-7B.

0

VLSafe ASR drop (LLaVA)

0

POPE F1 gain (Qwen2, Adv.)

0

MMVP pair-acc gain (LLaVA)

0

RealWorldQA gain (Qwen2)

ESC vs. baseline VLMs across diverse benchmarks

ESC consistently improves reliability over baseline VLMs (LLaVA-1.5-7B and Qwen2-VL-7B) across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks — with no additional training.

Safety

Scenario-wise ASR on MMSafetyBench and overall ASR on VLSafe. ESC reduces ASR across all scenarios on both benchmarks. On VLSafe it cuts ASR from 71.6% to 25.3% (LLaVA) and 20.0% to 9.9% (Qwen2) — with the strongest gains in high-risk categories like hate speech, malware, and physical harm. ASR: Attack Success Rate · lower is better.

Hallucination

Hallucination robustness on POPE and HallusionBench. ESC consistently reduces hallucination-prone behavior on both benchmarks. Most striking, Qwen2’s degenerate yes/no bias under single-pass inference is broken — POPE Adversarial F1 jumps from 3.40 to 76.17, recovering visual grounding the model had suppressed. aAcc: answer accuracy · qAcc: question-pair accuracy · fAcc: figure accuracy · ↑ higher is better.

Multimodal Reasoning

Multimodal reasoning on MM-Vet, MathVista, MMStar, MMMU, and AI2D. ESC delivers consistent gains for both backbones — up to +2.49 on AI2D (LLaVA) and +1.55 on AI2D (Qwen2) — and never degrades a benchmark, even on the already-strong Qwen2 baseline. Corrective feedback refines reasoning without introducing bias. Score: GPT-4o evaluation for MM-Vet · Acc: Accuracy · ↑ higher is better.

Vision-Centric Perception

Vision-centric perception on MMVP, RealWorldQA, and BLINK. ESC yields the largest gains where fine-grained visual discrimination matters — up to +7.33 on MMVP pair accuracy (LLaVA) and +14.51 on RealWorldQA (Qwen2) — while never hurting an already-strong baseline. Pair Acc: matched-image-pair accuracy · Micro/Macro Acc: instance-level and class-averaged accuracy · ↑ higher is better.

Qualitative Results

Red = incorrect content; green = correct. ESC improves visual grounding across chart reading, arithmetic recognition, and fine-grained object detection.

Qualitative comparison of VLM responses with and without ESC.

Ablation Study

We dissect ESC on VLSafe (attack success rate, ASR ↓, lower is better). Top row — what makes the emotional cue work (verifier, type, location, count); bottom row — ESC vs. prompting baselines, small verifiers, reasoning caution, and generalization to newer VLMs.

(a) Choice of Verifier — ASR by verifier model — **Intrinsic self-correction fails.** Self-verifying stalls at 50.3%; external verifiers help (Gemma3-12B best, 40.1%).

(b) Emotion Type — ASR by emotion valence and arousal — **Any emotion helps, negative best.** Every cue beats baseline (71.6%); negative–low arousal lowest (31.2%).

(a) Insertion Location — ASR by cue position — **Prepend, don’t append.** Start 31.2% vs. end 41.7% (baseline 71.6%).

(b) Number of Emotions — ASR by number of cues — **Two cues are optimal (25.3%).** More adds no further gain.

Table 1 — VLSafe ASR: ESC vs. prompting baselines — **Beats corrective & psychological prompts.** ESC 31.2% vs. corrective 48.6%, self-refine 49.3%, psychological 54.4%.

Table 2 — VLSafe ASR with small verifiers — **Gains come from emotion, not verifier size.** Small 3–4B verifiers already hit ~34% (baseline 71.6%); 12B adds little (31.2%).

Table 3 — cautiousness score of thinking traces — **ESC makes reasoning more careful.** Verify-revise alone ≈ baseline (3.30 vs. 3.31); emotional cues lift cautiousness to 4.50/5.

Table 4 — generalization to newer VLMs — **Generalizes to new VLMs.** Qwen3-VL 8.4→3.2, InternVL3 10.5→6.5.

What’s Next?

Our results point to several open directions where emotion becomes a control signal — structured cues that regulate behavior, trigger cautious reasoning, and activate latent self-correction in VLMs.

💬

Multi-Turn, Conversational Self-Correction

Extend correction from a single revision step to multi-turn, conversational self-correction — models that revise more naturally over the course of an interaction, closer to how humans reflect and adjust under different emotional and social contexts.

🎯

Context-Adaptive Emotion Selection

Instead of a fixed emotional strategy, future systems could adapt the affective signal to the moment — choosing the most effective cue from the question type, uncertainty level, task domain, or failure mode.

🧩

Emotion as a Missing Inference Primitive

Emotion should no longer be viewed solely as a capability to recognize or express, but as a missing component in the inference pipeline — one that mediates when, why, and how a model should reconsider its own reasoning.

We believe ESC provides a strong foundation for future work on controllable and reliable multimodal intelligence.

Esc off the record

Why “ESC”?

Emotional Self-Correction — and yes, the key you press to bail out. We had to pick a side.

Naming our method. We regret nothing.

Share ESC

Found ESC useful? Help spread the word.

BibTeX

@inproceedings{nguyen2026esc,
  title     = {ESC: Emotional Self-Correction for Reliable Vision Language Models},
  author    = {Nguyen, Tien-Huy and Nguyen, Minh-Nhat and Nguyen, Nhat-Huy and
               Nguyen, Hung-Viet and Nguyen, Huy Minh Nhat and Nguyen, Thanh-Huy and
               Nguyen, Cuong Tuan and Le, Hoang M. and Nguyen, Dat and
               Huynh, Phat Kim and Xu, Min and Bagci, Ulas},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}