Summary
While current large language models (LLMs) hit a wall with extensive contexts, the human brain effortlessly organizes and retrieves experiences spanning a lifetime. Drawing inspiration from human cognition, we introduce EM-LLMEpisodic Memory Language Model - our novel architecture that combines human-like memory mechanisms with LLMs., an architecture that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning requiredThe model works out of the box with any Transformer-based LLM without requiring additional training!. Beyond achieving state-of-the-art (SOTA) performance on long-context tasks, EM-LLM's approach to information organization shows remarkable similarities to human memory patternsOur analysis reveals strong correlations between EM-LLM's event segmentation and human-perceived events - see the full analysis below, suggesting we're on the right track in bridging artificial and biological information processing.
At its core, EM-LLM organizes incoming information into coherent episodic events, much like how humans naturally segment their experiences. It does this through a sophisticated combination of Bayesian surprise detection and graph-theoretic boundary refinement, operating in real-time as information flows in. When needed, these events are retrieved through a two-stage memory process that mirrors human memory access patterns, combining similarity-based search with temporal relationships.
Experiments on the LongBenchA comprehensive benchmark for evaluating LLM performance on long-context tasks.
Click to visit the github repo. and ∞-BenchA benchmark designed to test model performance on extremely long contexts.
Click to visit the github repo. benchmarks demonstrate EM-LLM's superior performance, consistently outperforming the SOTA model in long-context LLM architectures InfLLMA recent SOTA model by C Xiao et al., published in NeurIPS 2024.
Click to visit the paper. across various baseline LLMs. In addition, EM-LLM outperforms SOTA RAGWe used NV-Embed-v2 retriever, which ranks No. 1 on the Massive Text Embedding Benchmark (MTEB benchmark)(as of Oct 25, 2024).
Click to visit the leaderboard. retrieval models in a wide range of tasks, while requiring similar resources. Notably, EM-LLM's performance even surpasses full-context modelsModels that process the entire input context at once without any chunking or retrieval in most tasks, while successfully performing retrieval across 10M tokensTen million tokens - approximately equivalent to 7,500 pages of text - a scale computationally infeasible for such models. Our analysis reveals strong correlations between EM-LLM's event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.
Architecture
EM-LLM brings human-like memory capabilities to LLMs through three key innovations: An initial segmentation of the context window into events based on a metric of surprise (1), the refinement of the boundary of these events based on graph theory (2) and a two-stage memory retrieval process (3-4). The complete EM-LLM architecture showing all components and their interactions is shown below. You can hover over each section to explore the individual components.
Performance Results
EM-LLM sets new benchmarks across multiple long-context tasks, consistently outperforming both current SOTA models and traditional RAG approaches. We tested EM-LLM on the LongBench and ∞-Bench benchmarks, across a wide range of long-context tasks (including tasks with millions of tokens), comparing it to the current SOTA in both RAG retrievals and other long-context models. Here's a quick look at how we did:
EM-LLM vs RAG and full-context models
On the left in the figure below, we see EM-LLM vs. RAG (NV-Embed-v2 retriever) vs. full-context, with LLaMA-3.1-8B as the base LLM, evaluated on LongBench. On the right, we see a comparison of various long-sequence methods (sorted based on their context window length) on an extended version of ∞-Bench's Retrieve.PassKey.
EM-LLM vs InfLLM
We also compared EM-LLM against the method in the literature that is both the closest in terms of architecture as well as the SOTA (at the time of writing) in long-context benchmarks. The result of this performance in LongBench can be seen in this figure:EM-LLM shows consistent improvements, with standout performances (up to 40% improvement) in retrieval and QA tasks. To view the result tables for both benchmarks click here.
Full benchmark tables
Performance on all LongBench tasks:Base LLM | Method | SQA | MQA | Sum | FSL | Ret | Cod | Avg. |
---|---|---|---|---|---|---|---|---|
Mistral v2 | InfLLM (4k+2k) | 33.0 | 25.5 | 27.1 | 66.1 | 64.0 | 54.8 | 41.9 |
EM-LLMSM+C | 32.9 | 27.0 | 27.2 | 66.8 | 84.1 | 54.8 | 43.7 | |
LLaMA 3 | InfLLM (4k+4k) | 38.5 | 36.9 | 27.0 | 69.0 | 84.0 | 53.2 | 47.0 |
EM-LLMS | 39.3 | 37.7 | 27.0 | 69.2 | 87.5 | 50.3 | 47.2 | |
LLaMA 3.1 | InfLLM (4k+4k) | 41.4 | 40.7 | 29.0 | 69.0 | 97.0 | 64.2 | 51.1 |
EM-LLMSM | 41.2 | 41.3 | 29.2 | 69.1 | 98.5 | 64.1 | 51.3 | |
Phi 3 | InfLLM (1k+3k) | 28.4 | 24.9 | 25.6 | 52.9 | 7.5 | 57.0 | 34.5 |
EM-LLMS | 29.2 | 27.1 | 25.9 | 53.5 | 10.0 | 57.0 | 35.4 | |
Phi 3.5 | InfLLM (1k+3k) | 31.7 | 28.5 | 23.9 | 56.3 | 11.5 | 40.3 | 34.2 |
EM-LLMS | 31.8 | 31.9 | 24.5 | 55.5 | 13.0 | 39.5 | 34.9 |
Performance on InfiniteBench tasks:
Base LLM | Method | C.D | M.F | MC | R.KV | R.P | R.N |
---|---|---|---|---|---|---|---|
Mistral v2 | InfLLM (4k+2k) | 29.4 | 26.6 | 43.2 | 95.6 | 100.0 | 99.8 |
EM-LLMSM+C | 28.2 | 27.1 | 42.8 | 99.0 | 100.0 | 99.8 | |
LLaMA 3 | InfLLM (4k+4k) | 30.5 | 23.7 | 43.7 | 5.0 | 100.0 | 99.0 |
EM-LLMS | 31.7 | 16.9 | 40.6 | 4.2 | 100.0 | 99.6 | |
LLaMA 3.1 | InfLLM (4k+4k) | 22.6 | 33.7 | 46.7 | 81.0 | 100.0 | 100.0 |
EM-LLMSM | 22.6 | 34.0 | 47.6 | 90.2 | 100.0 | 100.0 |
Human-like Event Segmentation
Additionally, our analysis reveals strong correlations between EM-LLM's surprise-based event segmentation and human-perceived events, suggesting a bridge between these two systems. For example, consider the figure below:
These graphs present results from a study where participants listened to a podcast and indicated points they perceived as event boundaries. We then compared various AI segmentation methods, including EM-LLM, against these human annotations. The height of each bar represents how closely the method aligns with human judgments. Notably, our surprise-based approaches (S, SM, SC) consistently outperform fixed-interval methods (F, FM, FC), with EM-LLM closely mirroring human intuition. This alignment suggests that EM-LLM's event detection mechanism captures something fundamental about how humans naturally segment continuous experiences.
Conclusion
EM-LLM represents a significant step forward in the development of language models with extended context-processing capabilities. By bridging insights from cognitive science with machine learning, our approach not only enhances the performance of LLMs on long-context tasks but also provides a scalable computational framework for testing hypotheses about human memory.
Cite Us
@inproceedings{fountas2025humaninspired,
title={Human-inspired Episodic Memory for Infinite Context {LLM}s},
author={Zafeirios Fountas and Martin Benfeghoul and Adnan Oomerjee and Fenia Christopoulou and Gerasimos Lampouras and Haitham Bou Ammar and Jun Wang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=BI2int5SAC}
}