I build AI agents, and write papers about why they break.

Forward Deployed Engineer at Wonderful, AI researcher the rest of the time. I ship agents into enterprises, and write papers on telling whether they actually work or just look busy.

Also, I really like the sea.

Athens, Greece
Panos Michelakis paddling a SUP board on calm water off the coast
Field research.
Research

Selected publications

NeurIPS - LAW2025

CORE: Full-Path Evaluation of LLM Agents Beyond Final State

P. Michelakis, Y. Hadjiyiannis, D. Stamoulis

A framework built on finite automata, with five metrics that score an agent's entire execution path, not just whether the final answer happens to be correct.

arxiv.org/abs/2509.20998 ↗
ICML - AIWILD2026

Full-Season Agent Evaluation in Soybean Farm Operations under Real-World Agricultural Process Dynamics

A. Qu, P. Michelakis, Y. Hadjiyianni, F. Li, J. Jiang, D. Stamoulis, J. Liu

Benchmarks nine agent methodologies on a real soybean farm and finds that agents trail expert human yields by 34% without expert context; long horizon performance hinges on it.

openreview.net ↗
MLSys - YPS2026

QPU-first ML kernels for Raspberry Pi 5

Y. Hadjiyianni, P. Michelakis, D. Stamoulis

A compact ML runtime with integer matmul and neural network kernels targeting the Pi 5's VideoCore VII GPU for efficient edge inference.

yiannisha.github.io/qpu-xla ↗

All agents are wrong, but some are useful.

after George Box
Work

Experience

2026 to now
Forward Deployed Engineer
Wonderful · Athens (Hybrid)

Design and deploy production AI agent systems for enterprise clients: voice agents, back office automation, and workflow orchestration, wiring LLM agents into CRM, ERP, telephony, and external APIs on AWS. I run engagements end to end, from scoping with executives to production rollout, including work on deployments worth millions.

2025 to 2026
Palantir Foundry Data Engineer
D ONE · Zurich (Remote)

Built the enterprise Data Quality Framework for a leading Swiss reinsurer: PySpark pipelines, a scalable Ontology architecture, a custom PySpark library for advanced data quality checks, and automated monitoring via TypeScript functions. Also shipped scheduling tool UI features using Vertex Graphs and Workshop.

2023 to 2025
Founding Machine Learning Engineer
Vino AI · Athens (Hybrid)

Cofounded an AI hospitality startup. Designed and productionised a real time KNN recommendation engine that served personalised drink suggestions from taste preferences collected via QR menus, plus the full backend and cloud architecture. Launched across multiple venues.

2024 to 2025
LLM Quality Analyst
DataAnnotation · Remote

Evaluated and optimised LLMs across code generation, reasoning, function calling, and instruction following through RLHF workflows, surfacing failure modes and hallucinations, and building high quality evaluation datasets and edge case tests.

About

How I work

I like the gap between research and production. It's usually where the interesting problems are.

A lot of my research is just being suspicious of agents: measuring the whole path one takes, instead of whether it fluked the right answer at the end. That suspicion bleeds into what I build. I'd rather know how something fails than pretend it won't.

Day to day that's turning vague requirements into systems that hold up, and switching between talking to an executive and a compiler without too much whiplash. I've done the founding thing, the enterprise data thing, and the research thing, and I'm not in a hurry to pick one.

When I have something worth saying, it ends up on Medium: agents, physics, math and my occasional poetry.

Agents & ML: LLM agents, Skills, Function Calling, RAG, MCP, Evals, Context Engineering, PyTorch
Data & infra: PySpark, Palantir Foundry, AWS, Grafana, Prometheus, Dagster
Languages: Python, TypeScript, SQL
Contact

If the work is interesting, I'm around.

Applied AI, agent evaluation, forward deployed engineering, or something I haven't thought of yet.