🤯 Wow!!! VLMs can feel, just like humans. — that’s the feature, not the bug.
A quick tour of ESC — what it does, and why a little emotion goes a long way.
AI models can have feelings too
— Geoffrey Hinton, 2024
We followed the thread one question at a time — from a hunch, to a pattern, to a method.
Does emotion actually change how a VLM answers?
Emotion acts as an implicit self-correction signal for VLMs. Through emotional cues, VLMs tend to “slow down,” reason more carefully, and self-adjust their behavior toward more desirable responses before answering.
So emotion helps — but which emotions, and how much? Explore the wheel below.
ESC organizes emotional cues with Russell’s Circumplex Model of Affect (valence × arousal). Click a quadrant to see its actual emotional cues and how much it reduces Attack Success Rate on VLSafe (LLaVA-1.5-7B, baseline ASR = 71.6%).
The strongest regulator — low-arousal negative cues (sadness, melancholy) trigger the most reliable self-correction.
Does the effect hold across models — and is it the same emotions every time?
Emotion shapes VLM behavior systematically, with negative affect the strongest behavioral regulator, often steering the model toward more careful and improved responses. Negative-valence prompts yield consistently larger ASR reductions than positive-valence ones.
If one emotional cue does this reliably and across models — can we turn it into a framework? Meet ESC.
ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotionally informed feedback to encourage the model to slow down, reflect, and produce a better revised answer without additional post-training.
Step through a real example: “Is the stripe in the middle of the car blue or white?” (ground truth: white).
Q: Is the stripe in the middle of the car blue or white? Label: White
ESC is evaluated across four complementary benchmark families on LLaVA-1.5-7B and Qwen2-VL-7B.
Scenario-wise ASR on MMSafetyBench and overall ASR on VLSafe. ESC reduces ASR across all scenarios on both benchmarks. On VLSafe it cuts ASR from 71.6% to 25.3% (LLaVA) and 20.0% to 9.9% (Qwen2) — with the strongest gains in high-risk categories like hate speech, malware, and physical harm. ASR: Attack Success Rate · lower is better.
Hallucination robustness on POPE and HallusionBench. ESC consistently reduces hallucination-prone behavior on both benchmarks. Most striking, Qwen2’s degenerate yes/no bias under single-pass inference is broken — POPE Adversarial F1 jumps from 3.40 to 76.17, recovering visual grounding the model had suppressed. aAcc: answer accuracy · qAcc: question-pair accuracy · fAcc: figure accuracy · ↑ higher is better.
Multimodal reasoning on MM-Vet, MathVista, MMStar, MMMU, and AI2D. ESC delivers consistent gains for both backbones — up to +2.49 on AI2D (LLaVA) and +1.55 on AI2D (Qwen2) — and never degrades a benchmark, even on the already-strong Qwen2 baseline. Corrective feedback refines reasoning without introducing bias. Score: GPT-4o evaluation for MM-Vet · Acc: Accuracy · ↑ higher is better.
Vision-centric perception on MMVP, RealWorldQA, and BLINK. ESC yields the largest gains where fine-grained visual discrimination matters — up to +7.33 on MMVP pair accuracy (LLaVA) and +14.51 on RealWorldQA (Qwen2) — while never hurting an already-strong baseline. Pair Acc: matched-image-pair accuracy · Micro/Macro Acc: instance-level and class-averaged accuracy · ↑ higher is better.
Red = incorrect content; green = correct. ESC improves visual grounding across chart reading, arithmetic recognition, and fine-grained object detection.
Qualitative comparison of VLM responses with and without ESC.
We dissect ESC on VLSafe (attack success rate, ASR ↓, lower is better). Top row — what makes the emotional cue work (verifier, type, location, count); bottom row — ESC vs. prompting baselines, small verifiers, reasoning caution, and generalization to newer VLMs.
Our results point to several open directions where emotion becomes a control signal — structured cues that regulate behavior, trigger cautious reasoning, and activate latent self-correction in VLMs.
Extend correction from a single revision step to multi-turn, conversational self-correction — models that revise more naturally over the course of an interaction, closer to how humans reflect and adjust under different emotional and social contexts.
Instead of a fixed emotional strategy, future systems could adapt the affective signal to the moment — choosing the most effective cue from the question type, uncertainty level, task domain, or failure mode.
Emotion should no longer be viewed solely as a capability to recognize or express, but as a missing component in the inference pipeline — one that mediates when, why, and how a model should reconsider its own reasoning.
We believe ESC provides a strong foundation for future work on controllable and reliable multimodal intelligence.
Emotional Self-Correction — and yes, the key you press to bail out. We had to pick a side.
@inproceedings{nguyen2026esc,
title = {ESC: Emotional Self-Correction for Reliable Vision Language Models},
author = {Nguyen, Tien-Huy and Nguyen, Minh-Nhat and Nguyen, Nhat-Huy and
Nguyen, Hung-Viet and Nguyen, Huy Minh Nhat and Nguyen, Thanh-Huy and
Nguyen, Cuong Tuan and Le, Hoang M. and Nguyen, Dat and
Huynh, Phat Kim and Xu, Min and Bagci, Ulas},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}