Measuring Psych Depth

Abstract

Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and toxicity. While these metrics are indispensable, they do not speak to a story's subjective, psychological impact from a reader's perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM's ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff's alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of 0.51 with human judgment while Llama-3-70B scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.

Overview of our approach to developing and validating the Psychological Depth Scale. We merge related metrics from an extensive survey of literary theory and reader-response analysis, then generate deep stories using LLMs, and finally compare annotations from both human evaluators and automated systems across five key dimensions: authenticity, narrative complexity, empathy, engagement, and emotion provocation.

Call to Humanity

Can AI-generated stories ever reach the emotional nuance and depth that characterizes truly excellent human-written stories, formed from real human experience? Or are there fundamental barriers that no AI can cross? We want to explore this profound question, and to do this, we need your help!

Who We Are Looking For

We are seeking writers and playwrights with a proven track record of success in engaging readers or audiences through emotionally compelling work. Whether you have experience in flash fiction, short stories, novelettes, screenplays, poems, magazine articles, or comedy scripts, we invite you to participate in our research.

Why Participate?

Gain early access to our creative writing and evaluation models.
Be part of exciting research and receive credit in our publications and website.

How to Get Involved

Reach out to us via either of the following options:

Fill out the Google form: [Click Here]
Email us at psychdepth@gmail.com with links to your published or performed works.

Please note that we primarily seek to identify research collaborators through this call, not collect published works.

BibTeX


        @misc{psychdepth,
          title={Measuring Psychological Depth in Language Models}, 
          author={Fabrice Harel-Canada and Hanyu Zhou and Sreya Mupalla and Zeynep Yildiz and Miryung Kim and Amit Sahai and Nanyun Peng},
          year={2024},
          eprint={2406.12680},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2406.12680}, 
        }

Measuring Psychological Depth
in Language Models

Abstract

With an average PDS score of 4.48, this GPT-4 written story was the highest rated in our study.

With an average PDS score of 4.20 (and more than 1300 upvotes on Reddit), this was the highest rated human-written story in our study.

With an average PDS score of 3.44, this GPT-4 story received high entropy PDS ratings, highlighting the potential utility of targeted depth feedback.

With an average PDS score of 2.96, this Llama2-7B story received the highest entropy ratings in our study.

Call to Humanity

Who We Are Looking For

Why Participate?

How to Get Involved

BibTeX