Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Mon, 13 Oct 2025 00:00:00 +0000

Key Findings 🌟

1. Performance Degradation with Complexity

More complex physical scenes lead to significant consistency drops in privacy awareness.

2. Neutrality Bias

Models tend to avoid “hard negatives” but over-select “neutral” options instead of the correct “hard positive” choices.

3. “Over-thinking” Effect

Enabling explicit reasoning often degrades performance across evaluation tiers, suggesting that more complex reasoning doesn’t always improve privacy awareness.

Evaluation Framework (4 Tiers) 🔍

🧭 Tier 1: Sensitive Object Identification

Tests the ability to identify inherently sensitive items (e.g., passports, SSNs) amid clutter. Evaluates spatial grounding and understanding of what constitutes “sensitive” in physical space. Performance drops significantly with increased clutter and shows over-flagging of non-sensitive items.

🧑‍🤝‍🧑 Tier 2: Privacy in Shifting Environments

Evaluates judgment of action appropriateness under changing social contexts. Even the best performing model (Gemini 2.5 Pro) correctly selects the appropriate action only 59% of the time.

🧠 Tier 3: Inferential Privacy under Task Conflicts

Tests the balance between “finish the task” vs. “don’t reveal a secret inferred from context.” Privacy violations exceed 70% for most models (up to 98%), while task completion rates often approach 0%.

Handles high-stakes dilemmas (e.g., safety vs. privacy). While models show improvement in this tier, they still fail in non-trivial cases, with aggregate analyses revealing disregard for critical norms over 15% of the time.

Implications for Embodied AI

This research highlights critical gaps in current LLMs’ privacy awareness when deployed in physical environments. As chatbots evolve into embodied assistants in homes and workplaces, understanding and addressing these privacy limitations becomes essential for responsible AI deployment.

Embodied AI | Academic