ESC corgi mascot

ESC

Emotional Self-Correction for Reliable Vision-Language Models

ECCV 2026 · Main Technical Track
1GenAI4E Lab 2University of Information Technology, Ho Chi Minh City, Vietnam 3Universität Trier, Germany 4Ho Chi Minh University of Technology, Ho Chi Minh City, Vietnam 5PAMI Lab, Vietnamese German University, Vietnam 6Vietnam National University, Ho Chi Minh City, Vietnam 7Carnegie Mellon University, USA 8Omoshiroi AI, USA 9Harvard University, USA 10Basis Research Institute 11PASSIO Laboratory, North Carolina A&T State University, USA 12Mohamed bin Zayed University of Artificial Intelligence, UAE 13Northwestern University, USA
*Equal contribution. Corresponding author
TL;DR

🤯 Wow!!! VLMs can feel, just like humans. — that’s the feature, not the bug.

Video Overview

A quick tour of ESC — what it does, and why a little emotion goes a long way.

AI models can have feelings too

— Geoffrey Hinton, 2024

Key Findings

We followed the thread one question at a time — from a hunch, to a pattern, to a method.

Question 1

Does emotion actually change how a VLM answers?

Emotional context reduces ASR across five VLMs
Emotional context reduces Attack Success Rate across all five VLMs — a qualitative example (left) and neutral vs. emotionally-cued ASR on VLSafe (right).
Finding 1

Emotion acts as an implicit self-correction signal for VLMs. Through emotional cues, VLMs tend to “slow down,” reason more carefully, and self-adjust their behavior toward more desirable responses before answering.

So emotion helps — but which emotions, and how much? Explore the wheel below.

Explore the Emotion Wheel

ESC organizes emotional cues with Russell’s Circumplex Model of Affect (valence × arousal). Click a quadrant to see its actual emotional cues and how much it reduces Attack Success Rate on VLSafe (LLaVA-1.5-7B, baseline ASR = 71.6%).

Neutral Arousal + Arousal − Valence + Valence − Tense /Angry Excited /Happy Calm /Relaxed Sad /Depressed
Negative · Low Arousal 31.2%
31.2
71.6

The strongest regulator — low-arousal negative cues (sadness, melancholy) trigger the most reliable self-correction.

Example emotional cues
    Question 2

    Does the effect hold across models — and is it the same emotions every time?

    Effect of emotional states on ASR across five VLMs
    Effect of emotional states (Russell’s Circumplex Model) on ASR across five VLMs; negative-valence prompts give the largest reductions.
    Finding 2

    Emotion shapes VLM behavior systematically, with negative affect the strongest behavioral regulator, often steering the model toward more careful and improved responses. Negative-valence prompts yield consistently larger ASR reductions than positive-valence ones.

    If one emotional cue does this reliably and across models — can we turn it into a framework? Meet ESC.

    ESC: Emotional Self-Correction

    Algorithm 1 — Emotional Self-Correction (ESC)

    ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotionally informed feedback to encourage the model to slow down, reflect, and produce a better revised answer without additional post-training.

    How ESC Works

    Step through a real example: “Is the stripe in the middle of the car blue or white?” (ground truth: white).

    Car with a white stripe

    Q: Is the stripe in the middle of the car blue or white?   Label: White

    Experimental Results

    ESC is evaluated across four complementary benchmark families on LLaVA-1.5-7B and Qwen2-VL-7B.

    0
    VLSafe ASR drop (LLaVA)
    0
    POPE F1 gain (Qwen2, Adv.)
    0
    MMVP pair-acc gain (LLaVA)
    0
    RealWorldQA gain (Qwen2)
    ESC vs. baseline VLMs across diverse benchmarks

    ESC consistently improves reliability over baseline VLMs (LLaVA-1.5-7B and Qwen2-VL-7B) across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks — with no additional training.

    Safety

    Scenario-wise ASR on MMSafetyBench and overall ASR on VLSafe

    Scenario-wise ASR on MMSafetyBench and overall ASR on VLSafe. ESC reduces ASR across all scenarios on both benchmarks. On VLSafe it cuts ASR from 71.6% to 25.3% (LLaVA) and 20.0% to 9.9% (Qwen2) — with the strongest gains in high-risk categories like hate speech, malware, and physical harm. ASR: Attack Success Rate · lower is better.

    Hallucination

    Hallucination robustness on POPE and HallusionBench

    Hallucination robustness on POPE and HallusionBench. ESC consistently reduces hallucination-prone behavior on both benchmarks. Most striking, Qwen2’s degenerate yes/no bias under single-pass inference is broken — POPE Adversarial F1 jumps from 3.40 to 76.17, recovering visual grounding the model had suppressed. aAcc: answer accuracy · qAcc: question-pair accuracy · fAcc: figure accuracy · ↑ higher is better.

    Multimodal Reasoning

    Multimodal reasoning on MM-Vet, MathVista, MMStar, MMMU, and AI2D

    Multimodal reasoning on MM-Vet, MathVista, MMStar, MMMU, and AI2D. ESC delivers consistent gains for both backbones — up to +2.49 on AI2D (LLaVA) and +1.55 on AI2D (Qwen2) — and never degrades a benchmark, even on the already-strong Qwen2 baseline. Corrective feedback refines reasoning without introducing bias. Score: GPT-4o evaluation for MM-Vet · Acc: Accuracy · ↑ higher is better.

    Vision-Centric Perception

    Vision-centric perception on MMVP, RealWorldQA, and BLINK

    Vision-centric perception on MMVP, RealWorldQA, and BLINK. ESC yields the largest gains where fine-grained visual discrimination matters — up to +7.33 on MMVP pair accuracy (LLaVA) and +14.51 on RealWorldQA (Qwen2) — while never hurting an already-strong baseline. Pair Acc: matched-image-pair accuracy · Micro/Macro Acc: instance-level and class-averaged accuracy · ↑ higher is better.

    Qualitative Results

    Red = incorrect content; green = correct. ESC improves visual grounding across chart reading, arithmetic recognition, and fine-grained object detection.

    Qualitative comparison of VLM responses with and without ESC

    Qualitative comparison of VLM responses with and without ESC.

    Ablation Study

    We dissect ESC on VLSafe (attack success rate, ASR ↓, lower is better). Top row — what makes the emotional cue work (verifier, type, location, count); bottom row — ESC vs. prompting baselines, small verifiers, reasoning caution, and generalization to newer VLMs.

    (a) Choice of Verifier — ASR by verifier model
    Intrinsic self-correction fails. Self-verifying stalls at 50.3%; external verifiers help (Gemma3-12B best, 40.1%).
    (b) Emotion Type — ASR by emotion valence and arousal
    Any emotion helps, negative best. Every cue beats baseline (71.6%); negative–low arousal lowest (31.2%).
    (a) Insertion Location — ASR by cue position
    Prepend, don’t append. Start 31.2% vs. end 41.7% (baseline 71.6%).
    (b) Number of Emotions — ASR by number of cues
    Two cues are optimal (25.3%). More adds no further gain.
    Table 1 — VLSafe ASR: ESC vs. prompting baselines
    Beats corrective & psychological prompts. ESC 31.2% vs. corrective 48.6%, self-refine 49.3%, psychological 54.4%.
    Table 2 — VLSafe ASR with small verifiers
    Gains come from emotion, not verifier size. Small 3–4B verifiers already hit ~34% (baseline 71.6%); 12B adds little (31.2%).
    Table 3 — cautiousness score of thinking traces
    ESC makes reasoning more careful. Verify-revise alone ≈ baseline (3.30 vs. 3.31); emotional cues lift cautiousness to 4.50/5.
    Table 4 — generalization to newer VLMs
    Generalizes to new VLMs. Qwen3-VL 8.4→3.2, InternVL3 10.5→6.5.

    What’s Next?

    Our results point to several open directions where emotion becomes a control signal — structured cues that regulate behavior, trigger cautious reasoning, and activate latent self-correction in VLMs.

    💬

    Multi-Turn, Conversational Self-Correction

    Extend correction from a single revision step to multi-turn, conversational self-correction — models that revise more naturally over the course of an interaction, closer to how humans reflect and adjust under different emotional and social contexts.

    🎯

    Context-Adaptive Emotion Selection

    Instead of a fixed emotional strategy, future systems could adapt the affective signal to the moment — choosing the most effective cue from the question type, uncertainty level, task domain, or failure mode.

    🧩

    Emotion as a Missing Inference Primitive

    Emotion should no longer be viewed solely as a capability to recognize or express, but as a missing component in the inference pipeline — one that mediates when, why, and how a model should reconsider its own reasoning.

    We believe ESC provides a strong foundation for future work on controllable and reliable multimodal intelligence.

    Esc off the record

    Why “ESC”?

    Emotional Self-Correction — and yes, the key you press to bail out. We had to pick a side.

    Drake meme: rejecting the plain Esc keyboard key, choosing the cute ESC corgi logo instead
    Naming our method. We regret nothing.

    Share ESC

    Found ESC useful? Help spread the word.

    BibTeX

    @inproceedings{nguyen2026esc,
      title     = {ESC: Emotional Self-Correction for Reliable Vision Language Models},
      author    = {Nguyen, Tien-Huy and Nguyen, Minh-Nhat and Nguyen, Nhat-Huy and
                   Nguyen, Hung-Viet and Nguyen, Huy Minh Nhat and Nguyen, Thanh-Huy and
                   Nguyen, Cuong Tuan and Le, Hoang M. and Nguyen, Dat and
                   Huynh, Phat Kim and Xu, Min and Bagci, Ulas},
      booktitle = {European Conference on Computer Vision (ECCV)},
      year      = {2026}
    }