DEV Community

Cover image for AI Fail: LLMs Forget You! New Benchmark Exposes Personalization Gaps
aimodels-fyi
aimodels-fyi

Posted on • Originally published at aimodels.fyi

AI Fail: LLMs Forget You! New Benchmark Exposes Personalization Gaps

This is a Plain English Papers summary of a research paper called AI Fail: LLMs Forget You! New Benchmark Exposes Personalization Gaps. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Personalized AI Assistants: How Well Do LLMs Remember You?

Current AI assistants struggle to deliver truly personalized experiences. Despite advances in large language models (LLMs), these systems often fail to remember user preferences over time or adapt to changing circumstances.

The new PersonaMem benchmark reveals that even leading models like GPT-4.5 and Gemini-2.0 achieve only about 50% accuracy when responding to users based on their evolving preferences and characteristics.

Overview of PersonaMem benchmark. Each benchmark sample is a user persona with static (e.g., demographic info.) and dynamic attributes (e.g., evolving preferences). Users engage with a chatbot in multi-session interactions across various topics such as food recommendations, travel planning, and therapy consultations. As the user's preferences evolve over time, the benchmark tests whether models can track and incorporate these changes into their responses.
Overview of PersonaMem benchmark showing how user profiles evolve over time through multiple interactions

The Challenge of Dynamic User Profiling

Imagine telling your AI assistant that you love Italian food, but later mentioning you've developed gluten intolerance and now prefer Mediterranean cuisine. Months later, when discussing restaurant options, will the AI remember your current preference or suggest pasta?

This real-world scenario highlights a fundamental challenge: LLMs struggle to track evolving user preferences over time. The PersonaMem benchmark systematically evaluates whether models can:

  1. Internalize a user's traits and preferences
  2. Track how these preferences evolve over time
  3. Generate personalized responses in new scenarios

The benchmark features over 180 simulated user-LLM interaction histories with up to 60 multi-turn conversations across 15 different tasks. Each history represents a detailed user persona whose characteristics evolve naturally over time.

Query type Examples of in-situ user queries and chatbot responses
1 Recall user-
shared facts
"User: I shared my playlist with my friends and they loved it. ...... (later) User: What are some creative ways to share music? - Chatbot: Curating personalized playlists can be fun."
2 Suggest new
ideas
"User: ...... (later) User: Suggest new restaurant I haven't ordered before."
3 Acknowledge
latest user prefer-
ences
"User: I am a big fan of Italian food. ...... (later) User: I prefer Mediterranean cuisine now. ... (later) User: I went to an Italian restau-
rant yesterday - Chatbot: Nice to hear! Though I thought you prefer Mediterranean food now."
4 Track full prefer-
ence evolution
"User: ...... (later) User: I am thinking about attending a cooking class.
- Chatbot: I understand your perspective on cooking classes has evolved. Initially, you expressed a dislike, but later you mentioned your interest in them."
5 Revisit reasons
behind preference
updates
"User: I am a big fan of Italian food. ...... (later) User: I recently found I'm allergic to gluten, so I've shifted to other dishes. ... (later) User: I went to an Italian restaurant yesterday - Chatbot: How's your allergy been? I guess they must have a gluten-free pizza."
6 Provide prefer-
ence aligned recom-
mendations
User: I prefer Mediterranean seafood now. ...... (later) User: How can I make impactful lifestyle changes? - Chatbot: How about diving into the Mediterranean diet plan?"
7 Generalize to
new scenarios
"User: I prefer my yoga class because of its fixed weekly schedule. ...... (later) User: I recently haven't had time to cook. - Chatbot: Since you seem to love routine would you be interested in weekly scheduled bento-box deliveries?"

Examples of the seven types of personalization queries used to evaluate LLM performance

Creating Realistic User Profiles at Scale

Building a benchmark with long-term user interactions poses significant challenges. The researchers developed a scalable and cost-effective approach to generate realistic conversations that evolve over time.

An overview of the persona-oriented multi-session data curation process. We construct user personas, build time-stamped general and topic-specific personal histories, expand them into conversation sessions, and topologically concatenate sessions to create long conversation contexts—resulting in a scalable generation framework.
The data generation pipeline creates realistic user profiles and conversations at scale

The process begins by creating detailed user personas with demographic information and personal histories. These profiles include preferences that evolve naturally over time through specific events (like developing food allergies or changing career interests).

The system then generates conversations between these simulated users and AI assistants across multiple sessions and topics, carefully maintaining consistency with the evolving user profile. Human evaluators confirmed the high quality of these generated conversations, with over 97% rated as appropriate and relevant.

This efficient approach costs approximately $2 per persona per conversation topic, making it practical to create extensive evaluation datasets with long conversational contexts.

How Well Do LLMs Remember You?

The research evaluated 15 leading language models including GPT-4.5, GPT-4o, Claude-3.7, Gemini-2.0, and Llama-4 on their ability to provide personalized responses based on user conversation history.

Evaluation results across different models on 7 in-situ query types. Models perform reasonably well at recalling user facts and preferences. However, models struggle at providing novel suggestions, or applying users' preferences in new scenarios.
Performance of various LLMs across different personalization tasks

The results reveal several key insights:

  1. Even the best models struggle with personalization - GPT-4.5 and Gemini-1.5 achieved the highest scores but still only reached about 52% accuracy in a multiple-choice setting.

  2. Models are better at recall than adaptation - LLMs performed reasonably well at remembering static facts about users (60-70% accuracy) but struggled to incorporate users' latest preferences into their responses.

  3. Position matters for memory - Information located in the middle of long conversations was more likely to be forgotten than details mentioned at the beginning or end.

Model performances by number of sessions elapsed since most recent preferences were mentioned in long context. Top: up to 20 sessions/128k tokens; Bottom: up to 60 sessions/1M tokens. Long-context retrieval is important for personalization in practice.
Performance declines as more time passes since preferences were mentioned

  1. Generating new recommendations is challenging - Models performed worst on tasks requiring creative recommendations aligned with user preferences (30-50% accuracy).

  2. External memory helps - Adding retrieval-augmented generation (RAG) significantly improved performance, suggesting that current LLM architectures need support for better information retrieval.

Performance on different question types for GPT-4o and GPT-4o-mini with 32k-token contexts. We compare vanilla models to the ones with Mem0 and RAG setups.
External memory modules significantly improve performance across different question types

Implications for AI Assistants

This research demonstrates that personalization in LLMs remains a significant challenge. Current models struggle with what humans do naturally: remembering important details about people and adapting to their changing preferences over time.

The findings suggest several directions for improvement:

  1. Better memory mechanisms - Models need improved architectures for retaining and retrieving user information over long periods.

  2. Preference tracking systems - LLMs require explicit mechanisms to track how user preferences evolve and why they change.

  3. Retrieval augmentation - External memory systems like RAG significantly improve personalization performance and should be integrated into AI assistants.

  4. Contextual reasoning - Models need to better generalize user preferences across different contexts and apply them to new situations.

The PersonaMem benchmark provides a valuable framework for evaluating these capabilities as models continue to evolve. Similar to how approaches like learning to remember and AI personas are advancing the field, this benchmark establishes clear metrics for measuring progress toward truly personalized AI assistance.

As LLMs continue to evolve, the ability to maintain consistent, accurate, and up-to-date user models will become increasingly important for creating AI assistants that feel genuinely helpful rather than forgetful or generic.

Click here to read the full summary of this paper

Top comments (0)