Simulating 1,000 People — Generative Agents From a Single Interview

TL;DR

Why it matters. Demographic-prompted “proxy humans” reproduce stereotypes. This paper seeds each agent with a two-hour qualitative interview of a specific real person and predicts that individual’s attitudes and behaviors directly.
What it proposes. An agent architecture that injects the full interview transcript into the LLM prompt and augments it with “expert reflections” generated by four domain personas — psychologist, behavioral economist, political scientist, demographer.
Headline. On the GSS, agents reach a normalized accuracy of 0.85 against participants’ own two-week self-replication rate, outperforming demographic-based (0.71) and persona-based (0.70) baselines by 14-15 points.
Limits. The full transcript must be prompt-loaded (expensive context), the bank is fixed to 1,052 US adults, and extrapolation to stakeful decisions is not validated.
What to keep. Long interviews outperform demographic summaries on both accuracy and subgroup fairness; evaluation is more informative when it reports normalized accuracy together with a DPD split rather than a single headline number.

Background

Prior work using LLMs as “proxy humans” took two routes. One prompts on demographic attributes (age, gender, race, ideology) and asks “how would such a person answer”; the other attaches a short persona paragraph. Both are convenient, but a growing literature finds that they tend to flatten minority groups toward stereotypes and to match only average treatment effects. The agent-based-modeling tradition, in contrast, hand-specifies behaviors to preserve interpretability, at the cost of generality outside narrow domains.

This paper closes a specific gap. Social scientists have long used in-depth interviews to capture idiosyncrasies that closed-form surveys miss. The authors feed those transcripts verbatim to an LLM to target individual-level prediction. Evaluation also shifts: not “did we recover a population-average effect” but “did we predict this specific person’s response as well as they predict themselves two weeks later.”

Core Idea

Pipeline from a two-hour interview to an agent, then to response comparisons with the human participant — **Fig. 1.** A two-hour interview becomes the agent's memory; human and simulated responses are then compared across four batteries. Source: Park et al. 2024, Fig. 1

The architecture has three pieces. First, an AI interviewer runs a semi-structured protocol from the American Voices Project, collecting on average 6,491 words of transcript per participant (roughly two hours of voice). Second, that transcript becomes the agent’s memory stream. Third, at query time, the entire transcript plus relevant “expert reflections” is injected into the LLM prompt.

Expert reflection is the architecturally distinctive move. Per participant, GPT-4o is prompted four times, each time adopting a different domain-expert persona (psychologist, behavioral economist, political scientist, demographer), producing 5-20 observations or inferences per expert. At query time the model first classifies which expert is most relevant to the question, then appends that expert’s reflections to the transcript before generating the final prediction. It is a scaffold for latent traits a single chain-of-thought pass would skip.

Method

3.1 Interview protocol

The sample is 1,052 US residents recruited by Bovitz, stratified on nine axes: age, census division, education, ethnicity, gender, income, neighborhood, political ideology, and sexual orientation. Interviews are voice-to-voice in English, conducted by an AI interviewer pipelining Whisper, GPT-4, and TTS. The protocol ranges from “tell me the story of your life” to views on race relations and policing. A deliberate design choice: the interview topics do not overlap with the evaluation instruments — the specific items of the GSS, BFI-44, and economic games are never asked. High downstream accuracy therefore reflects generalization rather than prompt leakage.

3.2 Agent architecture

At query time the flow is: (1) load the transcript into memory; (2) retrieve pre-generated expert reflections (cached offline, four personas); (3) have the LLM pick the most relevant expert for the question and append those reflections after the transcript; (4) prompt GPT-4o with chain-of-thought to produce the prediction. Multi-step decision tasks add short textual summaries of prior stimuli and responses to preserve continuity. Two baselines anchor the comparison: a demographic-based agent seeded with age, gender, race, and ideology extracted from the participant’s GSS answers, and a persona-based agent using a self-written paragraph by the participant.

3.3 Evaluation battery

Four instruments. (i) GSS core module, 177 items (mean 3.70 options per question, chance accuracy 27.03%). (ii) Big Five, the 44-item BFI. (iii) Five economic games — dictator, trust (both roles), public goods, prisoner’s dilemma — played for real monetary incentives. (iv) Five experimental replications drawn from a recent large-scale replication effort (Ames & Fiske 2015; Cooney et al. 2016; Halevy & Halali 2015; Rai et al. 2017; Schilke et al. 2015). Participants complete the battery twice, two weeks apart. The main dependent variable is normalized accuracy: the agent’s prediction accuracy divided by the participant’s own two-week self-replication rate. A value of 1.0 means the agent predicts the participant as well as the participant predicts themselves.

Experiments

Configuration	GSS normalized accuracy	Big Five normalized correlation	Economic-games normalized correlation
Interview-based	0.85	0.80	0.66
Persona-based	0.70	0.75	—
Demographic-based	0.71	0.55	—

Predictive accuracy on GSS and normalized correlations on Big Five and economic games, comparing Participants, Interview-Based, Demographic-Based, Persona-Based, and Random Baseline — **Fig. 2.** Predictive accuracy (GSS) and normalized correlation (Big Five, economic games) with 95% confidence intervals. Interview-based agents sit closest to the Participants baseline, while Demographic- and Persona-based agents cluster lower. Source: Park et al. 2024, Fig. 2

Raw GSS accuracy is 68.85%; participants’ own two-week self-replication is 81.25%. Interview-based agents lead both baselines by 14-15 normalized points, confirmed by ANOVA and Tukey post-hoc tests. Big Five shows the same ordering. Economic games show no significant MAE difference across the three conditions. On the replication studies, human participants replicate four of five; interview-based agents replicate the same four. Effect-size correlation between agents and humans is r = 0.98, essentially indistinguishable from participants’ own internal consistency (0.99).

The bias analysis is where the architecture’s second claim lands. Demographic Parity Difference (DPD) — the gap between the best and worst performing subgroup — falls on the ideology axis from 12.35% (demographic agent) to 7.85% (interview agent) on GSS, from 0.165 to 0.063 on Big Five, and from 0.50 to 0.19 on economic games. Racial subgroups shrink somewhat as well. Richer individual context reduces between-group accuracy gaps. Ablations reinforce the story. Randomly dropping 80% of the transcript (96 of 120 minutes) still yields 0.79 on GSS, and replacing the transcript with a bullet-point summary keeps GSS at 0.83. The driver appears to be information richness rather than linguistic signal.

Limitations

The main constraints are access policy and domain scope. Citing participant privacy, the authors withhold raw interviews and gate the agent bank behind a two-pronged system: open access to aggregated responses on fixed tasks, restricted API access to individualized responses on open tasks for approved researchers. That slows external replication. The sample is US adults only, and the economic-games result is not significantly better than the baselines — the architecture’s lift is task-dependent.

The authors also flag structural limits. Prompt-loading the full transcript is expensive and susceptible to long-context degradation (lost-in-the-middle). Results are tied to a single backbone (GPT-4o at the time of evaluation) and only five experimental replications, constraining external validity. Extrapolation to stakeful decisions — markets, strategic negotiation, incentive-compatible reporting — is not guaranteed by these measurements.

Sapiens Q Take

Three takeaways. First, the cost-performance curve for personalization is surprisingly flat: dropping 80% of the transcript barely dents accuracy, and a bullet summary recovers most of the gain. A production path that caches expert reflections offline once and then prompts against compact summaries is economically plausible; full-transcript prompting is not the only operating point. Second, the evaluation frame is more useful than the numbers. Normalized accuracy against each subject’s own self-consistency, combined with a DPD split, surfaces issues a single headline accuracy would hide — a pair worth adopting as a default for any individual-level agent benchmark. Third, the paper empirically challenges the implicit assumption that demographic prompting is a cheap stand-in for individuality: on every instrument, structured long-form text outperforms short personas, and the subgroup fairness picture shifts direction. For tasks that genuinely require individual fidelity, short personas are better read as an underpowered baseline than as a default.

References

Original paper: Generative Agent Simulations of 1,000 People
Code: joonspk-research/generative_agent
Prior architecture: Park et al. (2023) Generative Agents: Interactive Simulacra of Human Behavior; Argyle et al. (2023) Out of One, Many
Interview protocol: American Voices Project, Stanford Center on Poverty and Inequality