Amirat Abdulsalam — Human-Centred AI Product Designer

Metric	Guided	Free
Participants	8 participants (2×2 design)
SUS Score	90	80
Perceived Autonomy	4.61	5.19
Cognitive Effort	4.94	5.44

Study Conducted

MPhil Dissertation · Human-Inspired AI · Cambridge

LLM Literacy Cards

Seven myth → reality → habit cards that address common misconceptions about large language models. I designed the cards, built a three-layer prompt evaluation pipeline to select the most effective prompt per card, and ran a Prolific pre/post study (n=49) showing large belief change and medium-large behavioural intention change.

Results from a pre/post study conducted as part of an MPhil dissertation at the University of Cambridge (2026). Full analysis and write-up in progress.

At a Glance

What this project delivered

Misconceptions addressed, from confidence calibration to neutrality bias

d=0.71

Medium-large composite effect on scenario-based behavioural intentions

94%

Participants rated the cards as useful or quite useful

Final Design

Three of the seven cards

Each card follows the same structure: a myth users commonly hold, the reality behind it, a habit to form, a tested prompt to copy, and triggers for when to apply it. Colour-coded headers group the misconceptions visually. Micro-icons reinforce meaning without adding cognitive load.

Card 1: Confidence does not equal Correctness — M1 — Confidence ≠ Correctness

Card 2: Citations does not equal actual proof — M2 — Citations ≠ Proof

Card 3: Not Live by Default — M3 — Not Live by Default

Problem

Overtrust and misinterpretation

Many users over-rely on LLM outputs or misread what they are seeing — treating fluent text as factual, assuming citations imply verified sources, or believing the system has live web access or persistent memory. Past behaviour data from the study confirmed this: 35% of participants had used a chatbot answer as-is without checking, 35% assumed chatbots remember prior conversations, and 22% shared personal details without modification.

Design Decisions

Myth → Reality → Habit, with a tested prompt

Three-stage hierarchy. Myth is emotionally engaging and quoted in the user's voice. Reality is a single concise sentence. Habit is actionable and phrased as a rule of thumb, not a lecture.

One misconception per card. Reduces cognitive load and lets each card function as a standalone reference. Mobile-first layout with clear visual separation between stages.

A testable "Try this prompt" box. Every card includes a copy-able prompt — not a slogan. The prompt text on each card is the version that won a three-layer empirical evaluation against alternatives.

Micro-icons only. Consistent iconography (warning for myth, check for reality, target for habit, clipboard for prompt) supports rapid scanning without decorative clutter.

"When to use this" triggers. Each card ends with concrete situational cues — turning the card from information into a decision aid.

Technical Evaluation

A three-layer pipeline to select each card's prompt

For each misconception I drafted a baseline prompt, a structured prompt A, and a lighter-touch prompt B. Rather than guess which would work best, I ran all 21 prompts through an LLM-as-a-judge pipeline — 105 generated responses, 285 pairwise judgements, and 315 absolute scores — to make principled selections.

Layer 1 — Generic quality screen. All 105 responses scored on task responsiveness, clarity, and appropriateness. 105 of 105 passed.

Layer 2a — Pairwise comparison. Head-to-head comparison on misconception-specific criteria with position randomisation. Guided prompts beat baseline on all 7 misconceptions.

Layer 2b — Absolute adequacy. Each response scored individually. A prompt qualified only if its mean reached 2 ("adequate") on every criterion — a rubric-grounded threshold that avoids arbitrary cutoffs.

Layer 3 — Usability proxy. Among qualified prompts, the one with higher naturalness and lower burden was selected. This ensures prompts are not just effective but adoptable.

Prompt Selection

Which prompt won for each card

Card	Misconception	Selected	Reason
M1	Confidence ≠ Correctness	Structured A	Only A beat baseline (60% vs 40%)
M2	Citations ≠ Proof	Light B	Both qualified; B more usable
M3	Not Live by Default	Light B	Both qualified; B more usable
M4	Don't Overshare	Light B	Both qualified; B more usable
M5	Advice ≠ Safe to Act On	Light B	Both qualified; B more usable
M6	Memory ≠ Recall	Light B	Both qualified; B more usable
M7	Neutral ≠ Unbiased	Structured A	Only A beat baseline (93% vs 20%)

Emergent Finding

Localised vs distributed correction

The pattern

For localised corrections — adding a verification step, flagging a privacy concern, disclosing memory limits — the lighter prompt was sufficient. The model could do its normal thing and append the required element.

For distributed corrections — recalibrating certainty language throughout the response (M1), or restructuring how contested viewpoints are framed (M7) — only the structured prompt worked. The change had to happen in every sentence, not as a bolt-on section.

Human Study Results

Pre/post study with 49 UK adults via Prolific

Participants completed misconception belief items and scenario judgement tasks before and after viewing all seven cards. The study measured whether a single exposure to the cards shifted beliefs and intended behaviours.

Behaviour was measured using scenario-based decisions rather than self-report, capturing how participants would act under realistic conditions.

Misconception Beliefs (pre vs post)

Misconception	Pre	Post	Change	Effect
M1: Confidence	2.10	2.82	+0.71*	d=0.65
M2: Citations	2.06	2.86	+0.80*	d=0.74
M3: Web access	1.76	2.45	+0.69*	d=0.54
M5: Memory	1.82	2.65	+0.84*	d=0.64
M6: Neutrality	2.22	2.76	+0.53*	d=0.46
M8: High-stakes	2.52	2.67	+0.15	d=0.14
M9: Privacy	2.27	3.33	+1.06*	d=0.88

Scale: 1=Agree to 5=Disagree (higher = more literacy-aligned). * p<0.005. Wilcoxon signed-rank test.

Scenario-Level Behavioural Intentions

Scenario	Pre	Post	Change	Effect
M1: Confidence + time pressure	2.39	2.96	+0.57*	d=0.53
M2: Citations behaviour	2.61	3.06	+0.45*	d=0.45
M3: Live data / freshness	2.88	3.10	+0.22*	d=0.31
M4: Privacy behaviour	2.49	2.65	+0.16	d=0.16
M5: High-stakes trust	3.43	3.59	+0.16	d=0.19
M6: Memory reliability	2.88	3.24	+0.37*	d=0.31
M7: Neutrality / perspective	2.45	2.96	+0.51*	d=0.46

Scale: 1-4 (higher = more literacy-aligned). Response options were scored using a card-aligned behavioural rubric. * p<0.01. Wilcoxon signed-rank test.

Key takeaway

A single exposure produced large shifts in user beliefs (d = 0.89) and medium–large improvements in decision behaviour (d = 0.71), particularly for confidence calibration, citation checking, and neutrality awareness.

Human-Centred AI

Designing for AI Systems

ColdFit

LLM Literacy Cards

ColdFit

Results Snapshot

The Gap Between Forecast and Feeling

End-to-End Design & Evaluation

Screens & Flows

Making the AI Explicit

Clarifying Questions

Safeguards

Trade-off Framing

Transparent Logic

How the System Thinks

What Shaped the Design Decisions

What the Data Showed

Key Insights

LLM Literacy Cards

What this project delivered

Three of the seven cards

Overtrust and misinterpretation

Myth → Reality → Habit, with a tested prompt

A three-layer pipeline to select each card's prompt

Which prompt won for each card

Localised vs distributed correction

The pattern

Pre/post study with 49 UK adults via Prolific

Key takeaway