Human-AI Interaction · AI Safety · Research & Evaluation · AI Experience Design
I design and prototype AI experiences in Figma, then validate them with usability testing.
I focus on trust calibration, uncertainty communication, and verification cues for LLM and multimodal systems.
ColdFit: SUS 90 (guided), n = 8 · LLM Literacy Cards: large belief change (d = 0.89) and medium–large shifts in scenario-based responses (d = 0.71), n = 49
Featured Work
AI Decision-Support Experience
A multimodal AI clothing assistant that helps international students make confident, climate-appropriate clothing decisions in UK winter. I designed the interaction flow, recommendation language, visual-output strategy, and study plan, then measured how different interface structures affected autonomy, effort, trust, and decision confidence.
View Case Study
Educational Micro-Intervention
Seven myth → reality → habit cards targeting common LLM misconceptions. Built a three-layer LLM-as-a-judge evaluation pipeline and ran a Prolific study (n=49), showing large belief change (d=0.89) and medium–large shifts in scenario responses (d=0.71).
View Case Study
Case Study
Designing interaction patterns for an AI decision-support experience — a multimodal system supporting climate-adaptive clothing decisions.
Study: 2×2 mixed-methods design, n=8 (guided vs. free chat × text-dominant vs. image-dominant output)
SUS: 90 (guided) · 80 (free)
Key takeaway: Guided interaction improved speed and usability; free chat increased autonomy but required more effort.
Design implication: Hybrid approach recommended — structured onboarding with the option to switch to free input.
Problem
International students often misinterpret UK temperature forecasts, leading to discomfort and uncertainty. Numerical weather data doesn't translate into embodied understanding — knowing it's 8°C doesn't tell you what to wear if you've never experienced 8°C.
I designed an AI decision-support experience to translate abstract weather data into confidence-supporting, embodied clothing decisions.
My Role
Interface Design
AI Interaction Design Patterns
Rather than hiding the AI's decision-making, I designed interaction patterns that make its reasoning legible to users.
The system narrows ambiguity before recommending. Instead of guessing context, it asks targeted questions about activity type, exposure duration, and personal cold sensitivity — reducing hallucination risk.
When inputs fall outside the system's scope (e.g., medical advice, extreme weather), it explicitly declines rather than generating unreliable output. The refusal is designed to feel helpful, not dismissive.
Rather than presenting a single "correct" outfit, the system offers two structured options: Outdoor-Optimised and Indoor-Optimised. This helps users compare warmth, flexibility, and indoor comfort, while keeping the final decision with the user.
Each recommendation includes brief reasoning about why certain layers, outerwear, or accessories are suggested.
Interaction Logic
The system maps UK temperatures into three experience bands, each linked to a clothing strategy. The guided path walks users through structured questions; the free path parses natural language and flags missing context.
Error handling is designed to be graceful: unrecognised inputs trigger clarification requests rather than generic error messages, and the system explicitly distinguishes between "I need more information" and "this is outside my scope."
Constraints & Trade-offs
Evaluation & Validation
| Metric | Guided | Free |
|---|---|---|
| Participants | 8 participants (2×2 design) | |
| SUS Score | 90 | 80 |
| Perceived Autonomy | 4.61 | 5.19 |
| Cognitive Effort | 4.94 | 5.44 |
MPhil Dissertation · Human-Inspired AI · Cambridge
Seven myth → reality → habit cards that address common misconceptions about large language models. I designed the cards, built a three-layer prompt evaluation pipeline to select the most effective prompt per card, and ran a Prolific pre/post study (n=49) showing large belief change and medium-large behavioural intention change.
Results from a pre/post study conducted as part of an MPhil dissertation at the University of Cambridge (2026). Full analysis and write-up in progress.
At a Glance
Final Design
Each card follows the same structure: a myth users commonly hold, the reality behind it, a habit to form, a tested prompt to copy, and triggers for when to apply it. Colour-coded headers group the misconceptions visually. Micro-icons reinforce meaning without adding cognitive load.
Problem
Many users over-rely on LLM outputs or misread what they are seeing — treating fluent text as factual, assuming citations imply verified sources, or believing the system has live web access or persistent memory. Past behaviour data from the study confirmed this: 35% of participants had used a chatbot answer as-is without checking, 35% assumed chatbots remember prior conversations, and 22% shared personal details without modification.
Design Decisions
Three-stage hierarchy. Myth is emotionally engaging and quoted in the user's voice. Reality is a single concise sentence. Habit is actionable and phrased as a rule of thumb, not a lecture.
One misconception per card. Reduces cognitive load and lets each card function as a standalone reference. Mobile-first layout with clear visual separation between stages.
A testable "Try this prompt" box. Every card includes a copy-able prompt — not a slogan. The prompt text on each card is the version that won a three-layer empirical evaluation against alternatives.
Micro-icons only. Consistent iconography (warning for myth, check for reality, target for habit, clipboard for prompt) supports rapid scanning without decorative clutter.
"When to use this" triggers. Each card ends with concrete situational cues — turning the card from information into a decision aid.
Technical Evaluation
For each misconception I drafted a baseline prompt, a structured prompt A, and a lighter-touch prompt B. Rather than guess which would work best, I ran all 21 prompts through an LLM-as-a-judge pipeline — 105 generated responses, 285 pairwise judgements, and 315 absolute scores — to make principled selections.
Layer 1 — Generic quality screen. All 105 responses scored on task responsiveness, clarity, and appropriateness. 105 of 105 passed.
Layer 2a — Pairwise comparison. Head-to-head comparison on misconception-specific criteria with position randomisation. Guided prompts beat baseline on all 7 misconceptions.
Layer 2b — Absolute adequacy. Each response scored individually. A prompt qualified only if its mean reached 2 ("adequate") on every criterion — a rubric-grounded threshold that avoids arbitrary cutoffs.
Layer 3 — Usability proxy. Among qualified prompts, the one with higher naturalness and lower burden was selected. This ensures prompts are not just effective but adoptable.
Prompt Selection
| Card | Misconception | Selected | Reason |
|---|---|---|---|
| M1 | Confidence ≠ Correctness | Structured A | Only A beat baseline (60% vs 40%) |
| M2 | Citations ≠ Proof | Light B | Both qualified; B more usable |
| M3 | Not Live by Default | Light B | Both qualified; B more usable |
| M4 | Don't Overshare | Light B | Both qualified; B more usable |
| M5 | Advice ≠ Safe to Act On | Light B | Both qualified; B more usable |
| M6 | Memory ≠ Recall | Light B | Both qualified; B more usable |
| M7 | Neutral ≠ Unbiased | Structured A | Only A beat baseline (93% vs 20%) |
Emergent Finding
For localised corrections — adding a verification step, flagging a privacy concern, disclosing memory limits — the lighter prompt was sufficient. The model could do its normal thing and append the required element.
For distributed corrections — recalibrating certainty language throughout the response (M1), or restructuring how contested viewpoints are framed (M7) — only the structured prompt worked. The change had to happen in every sentence, not as a bolt-on section.
Human Study Results
Participants completed misconception belief items and scenario judgement tasks before and after viewing all seven cards. The study measured whether a single exposure to the cards shifted beliefs and intended behaviours.
Behaviour was measured using scenario-based decisions rather than self-report, capturing how participants would act under realistic conditions.
Misconception Beliefs (pre vs post)
| Misconception | Pre | Post | Change | Effect |
|---|---|---|---|---|
| M1: Confidence | 2.10 | 2.82 | +0.71* | d=0.65 |
| M2: Citations | 2.06 | 2.86 | +0.80* | d=0.74 |
| M3: Web access | 1.76 | 2.45 | +0.69* | d=0.54 |
| M5: Memory | 1.82 | 2.65 | +0.84* | d=0.64 |
| M6: Neutrality | 2.22 | 2.76 | +0.53* | d=0.46 |
| M8: High-stakes | 2.52 | 2.67 | +0.15 | d=0.14 |
| M9: Privacy | 2.27 | 3.33 | +1.06* | d=0.88 |
Scale: 1=Agree to 5=Disagree (higher = more literacy-aligned). * p<0.005. Wilcoxon signed-rank test.
Scenario-Level Behavioural Intentions
| Scenario | Pre | Post | Change | Effect |
|---|---|---|---|---|
| M1: Confidence + time pressure | 2.39 | 2.96 | +0.57* | d=0.53 |
| M2: Citations behaviour | 2.61 | 3.06 | +0.45* | d=0.45 |
| M3: Live data / freshness | 2.88 | 3.10 | +0.22* | d=0.31 |
| M4: Privacy behaviour | 2.49 | 2.65 | +0.16 | d=0.16 |
| M5: High-stakes trust | 3.43 | 3.59 | +0.16 | d=0.19 |
| M6: Memory reliability | 2.88 | 3.24 | +0.37* | d=0.31 |
| M7: Neutrality / perspective | 2.45 | 2.96 | +0.51* | d=0.46 |
Scale: 1-4 (higher = more literacy-aligned). Response options were scored using a card-aligned behavioural rubric. * p<0.01. Wilcoxon signed-rank test.
A single exposure produced large shifts in user beliefs (d = 0.89) and medium–large improvements in decision behaviour (d = 0.71), particularly for confidence calibration, citation checking, and neutrality awareness.
Get in Touch
Available from June 2026. Open to internships and entry-level roles :