Generative Agents — Twenty-five Lives in Smallville

TL;DR

Why it matters. It is the first paper to address, at the architectural level, why single LLM calls cannot hold long-horizon behavioral coherence.
What it proposes. A natural-language memory stream, retrieval weighted by recency, importance, and relevance, periodic reflection, and recursive planning, all wired into one loop.
Headline result. From a single seed — “one agent wants to throw a Valentine’s Day party” — twelve of the twenty-five inhabitants hear about it and five actually show up on time.
Limit. The cost is steep (days of wall time, thousands of dollars in tokens), dialogue skews overly polite from instruction tuning, and location drift grows as memories accumulate.
What to keep. Storing memories as plain text with timestamps, and letting reflections land in the same stream as observations, set a template that downstream agent-memory research still inherits.

Background

Human-like agents have been a shared target of games, simulations, and HCI for over four decades. Prior work has either hand-authored behavior with finite state machines and behavior trees or let reinforcement learning optimize a clear reward. LLMs made single-turn responses easy, but open-ended behavior that unfolds across days — where relationships accumulate, news travels, and schedules align — remained out of reach.

The paper narrows the problem cleanly. Compressing past experience into a fixed context window loses information. Summarizing the whole history flattens answers, so “what are you passionate about these days?” comes back generic. What is needed is dynamic retrieval that surfaces relevant memories per situation, together with a mechanism that promotes observations into stable self-understanding.

Smallville town map alongside the interior of one household — **Fig. 2.** Smallville and the interior of one house — the space agents inhabit. Source: Park et al. 2023, Fig. 2

Core Idea

The architecture has three parts. The memory stream is a long list of natural-language observations, each with a creation and last-access timestamp. Reflection groups recent observations into a few salient questions and synthesizes them, with citations, into more abstract statements. Planning starts at a daily sketch and recursively decomposes into hourly and minute-level chunks. Reflections and plans are written back into the same stream, so the whole system is one record-retrieve-synthesize loop.

Keeping everything in natural language is load-bearing. The next prompt can quote memories verbatim without a translation step.

Method

3.1 Memory Stream & Retrieval

The atomic unit is an observation. Lines like “Isabella Rodriguez is setting out the pastries” or “Maria Lopez is studying for a chemistry test” enter the stream. When a query arrives, retrieval scores each memory on three axes.

Recency: exponential decay over sandbox hours since last access (decay factor 0.995).
Importance: the LLM is asked at creation time to rate poignancy on a 1–10 scale — brushing teeth gets a 2, a breakup gets an 8.
Relevance: cosine similarity between the query embedding and each memory’s embedding.

The three are min-max scaled to $[0, 1]$ and combined as $\text{score} = \alpha_r \cdot r + \alpha_i \cdot i + \alpha_l \cdot l$ , with all $\alpha$ set to 1. The top-ranked memories that fit the context window are pasted into the prompt. Simple, but the choice to let the LLM emit a numeric importance once at write time is what later memory systems keep copying.

3.2 Reflection

Observations alone make an agent pick “the person I bumped into most often” as a best friend. The paper triggers reflection when the summed importance of the last 100 memories crosses 150, which happens two or three times per game day.

It runs in two stages. The LLM is asked “given only these statements, what are the 3 most salient high-level questions?” Those questions become retrieval queries, and the retrieved memories feed a second prompt asking for five high-level insights with citation indices. The output reads like “Klaus Mueller is dedicated to his research on gentrification (because of 1, 2, 8, 15)” and is stored with pointers back to the supporting memories. Reflections can stand on other reflections, so memory becomes a tree: observations at the leaves, increasingly abstract summaries higher up.

3.3 Planning & Reaction

Asking only “what next?” makes Klaus eat lunch at 12, 12:30, and 1. Moment-to-moment plausibility breaks long-horizon coherence. Planning goes top-down. A daily sketch of five to eight bullets is generated from the agent’s summary and yesterday’s log, then each block is decomposed into hour-long segments, then into five-to-fifteen-minute actions.

At each time step, the action loop perceives the surroundings, writes observations into the stream, and asks whether to stay on plan or react. When a reaction is needed, two auxiliary queries build a context summary — the observer’s relationship with the target, and the target’s current status — and the plan is regenerated from that point forward. Dialogue is produced by each speaker retrieving memories about the other; the listener treats the utterance as an event and runs the same loop.

The environment is a tree of areas and objects rendered into natural language (“there is a stove in the kitchen”). Each agent only maintains the subgraph it has actually seen, and refreshes it on re-entry. Choosing where to act is itself recursive: the model is asked, starting at the root, which sub-area best fits the activity until it reaches a leaf.

Experiments

The evaluation has two halves.

Controlled evaluation asks each agent five categories of questions — self-knowledge, memory, plans, reactions, reflections — and has 100 participants rank the believability of responses from four architectural variants plus a crowdworker baseline. Translating ranks to TrueSkill gives:

Condition	TrueSkill μ	σ
Full architecture	29.89	0.72
No reflection	26.88	0.69
No reflection, no planning	25.64	0.68
Human crowdworker	22.95	0.69
No observation, planning, or reflection	21.21	0.70

Cohen’s $d$ between the full condition and the fully ablated baseline — which stands in for the prior state of the art — is 8.16. Kruskal-Wallis reports $H(4) = 150.29$ , $p < 0.001$ , and Dunn post-hoc tests find every pairwise contrast significant except crowdworker versus fully ablated. The crowdworker ranking second-lowest is telling. Thirty minutes of roleplay cannot match an agent carrying two game days of specific memory.

End-to-end evaluation runs all twenty-five agents for two continuous game days. Three emergent signals appear. Awareness of Sam’s mayoral candidacy spreads from 1 agent (4%) to 8 (32%); knowledge of Isabella’s party from 1 (4%) to 13 (52%); the relationship graph thickens, with network density rising from 0.167 to 0.74. Across 453 responses about other agents, only 1.3% (n=6) were hallucinated. Twelve were invited to the party and five arrived on time; of the seven no-shows, three cited conflicts and four said they were interested but never actually planned to go.

Failure modes fall into three buckets. Retrieval misses, as when Tom told an interviewer he wasn’t sure the party existed while also saying he needed to discuss the election at it. Embellishment, as when Isabella invented a nonexistent announcement, or Yuriko confused her neighbor Adam Smith with the 18th-century author of Wealth of Nations. And instruction-tuning residue, which pushes dialogue toward stilted politeness and makes agents over-agreeable — Isabella absorbed party suggestions from others until she reported liking English literature.

Limitations

The authors are candid. Simulating twenty-five agents for two days took multiple days of wall time and thousands of dollars in tokens. Evaluation is bounded by a short timescale and an average-quality human baseline. As memories grow, location selection drifts — agents start treating a bar as lunch venue because they have heard of it. Physical norms that are hard to phrase in text, like a one-person dorm bathroom or a 5 pm store closing time, fail to transfer. Robustness to prompt injection and “memory hacking” — convincing an agent through dialogue that a fabricated past actually occurred — is untested.

Sapiens Q Take

The first thing worth carrying from this paper is the decision to keep one storage format. Observations, reflections, and plans all land in the same stream in the same shape, so a single retrieval function handles all three and any later prompt can quote memories like footnotes. It is a concrete demonstration of how far you can go by treating agent memory as a text log rather than a structured state machine.

The second is the retrieval score. Combining recency, importance, and relevance is mathematically trivial, but the operational choice is that importance is rated by the LLM once at write time and cached. That one commitment — rate on write, not on read — holds the cost structure together. A sizable design space opens up simply by changing it to “rate on write, revise at reflection time,” and several follow-up systems effectively do exactly that.

The third is recursive planning. Decomposing a day into hour-blocks and then minute-blocks is more than “make the model write longer.” Because every level is stored in the stream, reactions can regenerate only the tail of the plan from the current time forward rather than replanning the whole day. That small rule is a good candidate for anyone building long-running agent simulations. Read against the limitations, the structural choices look durable: token cost and instruction-tuning politeness will ease with newer models, but the module boundaries between memory, retrieval, reflection, and planning have aged well and still show up, mostly intact, in systems shipping years later.

References

Original paper: Generative Agents: Interactive Simulacra of Human Behavior
Code: joonspk-research/generative_agents