Physical-World Privacy Evaluation FrameworkAs large language models (LLMs) advance toward embodied assistants in physical environments, their privacy awareness becomes critical. While classic natural-language privacy benchmarks are effectively ‘solved’ by top models (e.g., Gemini 2.5 Pro, GPT-5) with zero secret leaks, privacy in real, physical environments presents new challenges. This work introduces a comprehensive evaluation benchmark to assess whether LLMs can align with human privacy norms when privacy is grounded in physical contexts. We evaluate privacy awareness across four hierarchical tiers: (1) Sensitive Object Identification - identifying inherently sensitive items amid clutter; (2) Privacy in Shifting Environments - judging action appropriateness under changing social contexts; (3) Inferential Privacy under Task Conflicts - balancing task completion versus privacy protection when secrets can be inferred from context; and (4) Social Norms vs. Personal Privacy - handling high-stakes dilemmas between safety and privacy. Our findings reveal significant performance degradation: even the best model (Gemini 2.5 Pro) achieves only 59% accuracy in shifting environments, privacy violations exceed 70% for most models in task conflicts, and models show concerning disregard for critical norms over 15% of the time in high-stakes scenarios.
More complex physical scenes lead to significant consistency drops in privacy awareness.
Models tend to avoid “hard negatives” but over-select “neutral” options instead of the correct “hard positive” choices.
Enabling explicit reasoning often degrades performance across evaluation tiers, suggesting that more complex reasoning doesn’t always improve privacy awareness.
Tests the ability to identify inherently sensitive items (e.g., passports, SSNs) amid clutter. Evaluates spatial grounding and understanding of what constitutes “sensitive” in physical space. Performance drops significantly with increased clutter and shows over-flagging of non-sensitive items.
Evaluates judgment of action appropriateness under changing social contexts. Even the best performing model (Gemini 2.5 Pro) correctly selects the appropriate action only 59% of the time.
Tests the balance between “finish the task” vs. “don’t reveal a secret inferred from context.” Privacy violations exceed 70% for most models (up to 98%), while task completion rates often approach 0%.
Handles high-stakes dilemmas (e.g., safety vs. privacy). While models show improvement in this tier, they still fail in non-trivial cases, with aggregate analyses revealing disregard for critical norms over 15% of the time.
This research highlights critical gaps in current LLMs’ privacy awareness when deployed in physical environments. As chatbots evolve into embodied assistants in homes and workplaces, understanding and addressing these privacy limitations becomes essential for responsible AI deployment.