Human-AI Interaction · AI Safety · Research & Evaluation · AI Experience Design

I evaluate and improve human interaction with AI systems

I study how people understand, trust, and make decisions with AI, then design interventions that improve the interaction.

MPhil in Human-Inspired AI, University of Cambridge. Experience working on AI products and multimodal systems. I combine usability research, LLM evaluation, and interaction design.

ColdFit: SUS 90 (guided), n = 8 · LLM Literacy Cards: large belief change (d = 0.89) and medium–large shifts in scenario-based responses (d = 0.71), n = 49

What I Do

Three things, end to end

Research the interaction

Study how users interpret AI outputs, calibrate trust, handle uncertainty, and decide whether to act. Mixed-methods: usability testing, pre/post studies, think-aloud, surveys.

Evaluate the AI

Build rubric-based evaluation pipelines for LLM outputs. Pairwise comparison, absolute scoring, position randomisation, inter-model agreement. Not just "does it work" but "does it work safely."

Design the fix

Turn research findings into interaction patterns, prompt structures, and user-facing interventions. Cards, conversation flows, verification cues, uncertainty framing.

Methods & Tools

What I work with

Usability testing Pre/post experimental design Think-aloud protocol Semi-structured interviews Survey design (Qualtrics) LLM-as-a-judge evaluation Rubric-based scoring Pairwise comparison Prolific recruitment SUS / Likert scales Wilcoxon signed-rank Cohen's d effect sizes Figma prototyping Wizard of Oz Python (evaluation scripts) Value Sensitive Design

Featured Work

Research & Evaluation

Study Conducted

MPhil Dissertation · Cambridge

LLM Literacy Cards: Can a short intervention change how people understand AI?

d=0.89
Belief effect (large)
p<0.001
Significant pre–post improvements
d=0.71
Scenario effect (med-large)
94%
Found the cards useful
49
UK adults via Prolific

Designed seven “myth → reality → habit” cards targeting common LLM misconceptions: confidence, citations, web access, privacy, memory, high-stakes trust, and neutrality. Built a three-layer LLM-as-a-judge evaluation pipeline to select the strongest prompt format per card from 105 generated responses and 285 pairwise judgements. Then ran a Prolific pre/post study measuring belief change and behavioural intention.

What I found: A single viewing of the cards produced significant belief shifts on 6 of 7 misconceptions, with privacy (d = 0.88) and citations (d = 0.74) showing the largest effects. Behavioural intentions also shifted significantly on 5 of 7 scenario tasks (composite d = 0.71), with the strongest changes on confidence under time pressure (d = 0.53) and neutrality awareness (d = 0.46). Five misconceptions showed significant effects on both belief and scenario measures: confidence, citations, web access, memory, and neutrality. The technical evaluation suggested that misconceptions requiring a broader change in response style benefited from more structured prompts, while those requiring a specific added behaviour worked with lighter cues.

What I built: Built a three-stage prompt evaluation pipeline (generation → rubric scoring → pairwise LLM-as-judge comparisons) to select the best-performing prompt per card from 105 candidates and 285 pairwise judgements. Implemented 7 Python scripts for evaluation and analysis, designed analytic rubrics with anchored descriptors, developed the Qualtrics pre/post survey (randomised card order and varied response-option ordering), managed Prolific recruitment with screen-out handling, and produced the card intervention.

Results from a pre/post study conducted as part of an MPhil dissertation at the University of Cambridge (2026).

View design process View all 7 cards
Study Conducted

HCI Module · Cambridge

ColdFit: How much structure should an AI recommendation give you?

2×2 mixed-methods design, n=8 · SUS 90 (guided) / 80 (free) · Guided improved speed and usability; free chat increased autonomy but required more effort

Designed and evaluated a multimodal AI clothing assistant for international students navigating UK cold weather. Compared guided vs. free-chat interaction and text-dominant vs. image-dominant output across a 2×2 between-subjects design.

What I found: Interaction structure, not AI capability, was the primary driver of trust, effort, and decision confidence. Guided chat scored SUS 90 but users felt less autonomous. Free chat scored SUS 80 but users valued the control. Users wanted support, not prescription.

What I built: Figma prototype, Wizard-of-Oz web setup (Firebase + Netlify), structured interview guide, task-based evaluation protocol, CHI-format report.

View full case study Try the prototype

Industry · Petabyte eSports

Multimodal AI feature integration

Worked on integrating GPT, Whisper, and DALL-E into a multimodal AI system for an eSports platform. Contributed to feature design, prompt engineering, and output quality assessment across text, audio, and image modalities.

Relevance: Hands-on experience with production AI systems, multimodal pipelines, and the gap between model capability and user experience.

What I'm Looking For

Roles where research shapes the product

I'm looking for roles (internship or entry-level) where I can combine research, evaluation, and design to improve human interactions with AI systems.

UX Research (AI products)

User studies, trust calibration research, usability evaluation for AI-powered features and tools

AI Experience Design

Designing interaction patterns, conversation flows, and output framing for LLM and multimodal systems

AI Evaluation & Safety

Rubric-based LLM evaluation, red-teaming, output quality assessment, appropriate reliance testing

Research Internship / PhD

Human-AI interaction, AI literacy, trust and transparency, user mental models of AI systems

Get in Touch

Available from June 2026. Open to internships and entry-level roles in AI-focused UX research, AI experience design, and AI evaluation and safety. Also interested in research internships and PhD opportunities.