CORE: Full-Path Evaluation of LLM Agents Beyond Final State
A framework built on finite automata, with five metrics that score an agent's entire execution path, not just whether the final answer happens to be correct.
arxiv.org/abs/2509.20998 ↗Forward Deployed Engineer at Wonderful, AI researcher the rest of the time. I ship agents into enterprises, and write papers on telling whether they actually work or just look busy.
Also, I really like the sea.
A framework built on finite automata, with five metrics that score an agent's entire execution path, not just whether the final answer happens to be correct.
arxiv.org/abs/2509.20998 ↗Benchmarks nine agent methodologies on a real soybean farm and finds that agents trail expert human yields by 34% without expert context; long horizon performance hinges on it.
openreview.net ↗A compact ML runtime with integer matmul and neural network kernels targeting the Pi 5's VideoCore VII GPU for efficient edge inference.
yiannisha.github.io/qpu-xla ↗All agents are wrong, but some are useful.
after George Box
Design and deploy production AI agent systems for enterprise clients: voice agents, back office automation, and workflow orchestration, wiring LLM agents into CRM, ERP, telephony, and external APIs on AWS. I run engagements end to end, from scoping with executives to production rollout, including work on deployments worth millions.
Built the enterprise Data Quality Framework for a leading Swiss reinsurer: PySpark pipelines, a scalable Ontology architecture, a custom PySpark library for advanced data quality checks, and automated monitoring via TypeScript functions. Also shipped scheduling tool UI features using Vertex Graphs and Workshop.
Cofounded an AI hospitality startup. Designed and productionised a real time KNN recommendation engine that served personalised drink suggestions from taste preferences collected via QR menus, plus the full backend and cloud architecture. Launched across multiple venues.
Evaluated and optimised LLMs across code generation, reasoning, function calling, and instruction following through RLHF workflows, surfacing failure modes and hallucinations, and building high quality evaluation datasets and edge case tests.
I like the gap between research and production. It's usually where the interesting problems are.
A lot of my research is just being suspicious of agents: measuring the whole path one takes, instead of whether it fluked the right answer at the end. That suspicion bleeds into what I build. I'd rather know how something fails than pretend it won't.
Day to day that's turning vague requirements into systems that hold up, and switching between talking to an executive and a compiler without too much whiplash. I've done the founding thing, the enterprise data thing, and the research thing, and I'm not in a hurry to pick one.
When I have something worth saying, it ends up on Medium: agents, physics, math and my occasional poetry.
If the work is interesting, I'm around.
Applied AI, agent evaluation, forward deployed engineering, or something I haven't thought of yet.