Making AI More Trustworthy and Morally Aligned by Integrating Human Cognition

Danica Dillion
Debanjan Mondal
Wenlong Zhao
Niket Tandon
Kurt Gray

0 evaluations Published on Aug 11, 2025

This article on Sciety

Abstract

People often mistrust the moral decisions of AI, in part because it uses opaque black-box processes that differ from human reasoning. We introduce a method—“cognitive” bottlenecks—for more trustworthy and transparent AI by aligning large language models (LLMs) with human moral cognition. Bottlenecks selectively focus AI categorization decisions on a small set of key features, and human moral judgments often similarly center on a small set of key psychological features, like perceived harm, agent intention, and victim vulnerability. We implement and test “cognitively aligned” bottleneck models across multiple LLMs and moral frameworks. Compared with standard end-to-end models, people rate bottleneck models as more transparent and trustworthy. Analyses show that narrowing LLMs’ “focus” to a few key features improves their ability to capture human moral judgments. Implementing cognitively aligned bottlenecks is simple, requiring no additional training or data. This work demonstrates the benefits of integrating psychological theory into AI and offers a scalable path to more morally aligned AI.

Related articles are currently not available for this article.