HomeAboutCareerResearch Explore FOMC Forecasts Contact

Generative Agents — Twenty-five Lives in Smallville

A close reading of the 2023 UIST paper that fused memory streams, reflection, and recursive planning into an LLM loop, with the numbers and failure modes kept intact.

Sapiens Q 7 min read

Background

Human-like agents have been a shared target of games, simulations, and HCI for over four decades. Prior work has either hand-authored behavior with finite state machines and behavior trees or let reinforcement learning optimize a clear reward. LLMs made single-turn responses easy, but open-ended behavior that unfolds across days — where relationships accumulate, news travels, and schedules align — remained out of reach.

The paper narrows the problem cleanly. Compressing past experience into a fixed context window loses information. Summarizing the whole history flattens answers, so “what are you passionate about these days?” comes back generic. What is needed is dynamic retrieval that surfaces relevant memories per situation, together with a mechanism that promotes observations into stable self-understanding.

Smallville town map alongside the interior of one household
Fig. 2. Smallville and the interior of one house — the space agents inhabit. Source: Park et al. 2023, Fig. 2

Core Idea

The architecture has three parts. The memory stream is a long list of natural-language observations, each with a creation and last-access timestamp. Reflection groups recent observations into a few salient questions and synthesizes them, with citations, into more abstract statements. Planning starts at a daily sketch and recursively decomposes into hourly and minute-level chunks. Reflections and plans are written back into the same stream, so the whole system is one record-retrieve-synthesize loop.

Keeping everything in natural language is load-bearing. The next prompt can quote memories verbatim without a translation step.

Method

3.1 Memory Stream & Retrieval

The atomic unit is an observation. Lines like “Isabella Rodriguez is setting out the pastries” or “Maria Lopez is studying for a chemistry test” enter the stream. When a query arrives, retrieval scores each memory on three axes.

  • Recency: exponential decay over sandbox hours since last access (decay factor 0.995).
  • Importance: the LLM is asked at creation time to rate poignancy on a 1–10 scale — brushing teeth gets a 2, a breakup gets an 8.
  • Relevance: cosine similarity between the query embedding and each memory’s embedding.

The three are min-max scaled to [0,1][0, 1] and combined as score=αrr+αii+αll\text{score} = \alpha_r \cdot r + \alpha_i \cdot i + \alpha_l \cdot l, with all α\alpha set to 1. The top-ranked memories that fit the context window are pasted into the prompt. Simple, but the choice to let the LLM emit a numeric importance once at write time is what later memory systems keep copying.

3.2 Reflection

Observations alone make an agent pick “the person I bumped into most often” as a best friend. The paper triggers reflection when the summed importance of the last 100 memories crosses 150, which happens two or three times per game day.

It runs in two stages. The LLM is asked “given only these statements, what are the 3 most salient high-level questions?” Those questions become retrieval queries, and the retrieved memories feed a second prompt asking for five high-level insights with citation indices. The output reads like “Klaus Mueller is dedicated to his research on gentrification (because of 1, 2, 8, 15)” and is stored with pointers back to the supporting memories. Reflections can stand on other reflections, so memory becomes a tree: observations at the leaves, increasingly abstract summaries higher up.

3.3 Planning & Reaction

Asking only “what next?” makes Klaus eat lunch at 12, 12:30, and 1. Moment-to-moment plausibility breaks long-horizon coherence. Planning goes top-down. A daily sketch of five to eight bullets is generated from the agent’s summary and yesterday’s log, then each block is decomposed into hour-long segments, then into five-to-fifteen-minute actions.

At each time step, the action loop perceives the surroundings, writes observations into the stream, and asks whether to stay on plan or react. When a reaction is needed, two auxiliary queries build a context summary — the observer’s relationship with the target, and the target’s current status — and the plan is regenerated from that point forward. Dialogue is produced by each speaker retrieving memories about the other; the listener treats the utterance as an event and runs the same loop.

The environment is a tree of areas and objects rendered into natural language (“there is a stove in the kitchen”). Each agent only maintains the subgraph it has actually seen, and refreshes it on re-entry. Choosing where to act is itself recursive: the model is asked, starting at the root, which sub-area best fits the activity until it reaches a leaf.

Experiments

The evaluation has two halves.

Controlled evaluation asks each agent five categories of questions — self-knowledge, memory, plans, reactions, reflections — and has 100 participants rank the believability of responses from four architectural variants plus a crowdworker baseline. Translating ranks to TrueSkill gives:

ConditionTrueSkill μσ
Full architecture29.890.72
No reflection26.880.69
No reflection, no planning25.640.68
Human crowdworker22.950.69
No observation, planning, or reflection21.210.70

Cohen’s dd between the full condition and the fully ablated baseline — which stands in for the prior state of the art — is 8.16. Kruskal-Wallis reports H(4)=150.29H(4) = 150.29, p<0.001p < 0.001, and Dunn post-hoc tests find every pairwise contrast significant except crowdworker versus fully ablated. The crowdworker ranking second-lowest is telling. Thirty minutes of roleplay cannot match an agent carrying two game days of specific memory.

End-to-end evaluation runs all twenty-five agents for two continuous game days. Three emergent signals appear. Awareness of Sam’s mayoral candidacy spreads from 1 agent (4%) to 8 (32%); knowledge of Isabella’s party from 1 (4%) to 13 (52%); the relationship graph thickens, with network density rising from 0.167 to 0.74. Across 453 responses about other agents, only 1.3% (n=6) were hallucinated. Twelve were invited to the party and five arrived on time; of the seven no-shows, three cited conflicts and four said they were interested but never actually planned to go.

Failure modes fall into three buckets. Retrieval misses, as when Tom told an interviewer he wasn’t sure the party existed while also saying he needed to discuss the election at it. Embellishment, as when Isabella invented a nonexistent announcement, or Yuriko confused her neighbor Adam Smith with the 18th-century author of Wealth of Nations. And instruction-tuning residue, which pushes dialogue toward stilted politeness and makes agents over-agreeable — Isabella absorbed party suggestions from others until she reported liking English literature.

Limitations

The authors are candid. Simulating twenty-five agents for two days took multiple days of wall time and thousands of dollars in tokens. Evaluation is bounded by a short timescale and an average-quality human baseline. As memories grow, location selection drifts — agents start treating a bar as lunch venue because they have heard of it. Physical norms that are hard to phrase in text, like a one-person dorm bathroom or a 5 pm store closing time, fail to transfer. Robustness to prompt injection and “memory hacking” — convincing an agent through dialogue that a fabricated past actually occurred — is untested.

References