I ship privacy-first software and measurable AI systems.

LLMs are powerful. They also fail in predictable ways.

I build LLM systems with measurable quality, not vibes.

What I offer

LLM evaluation systems (start here)

  • Build a test set that reflects real user inputs.
  • Define metrics (task success, faithfulness, safety, latency, cost).
  • Add regression tests so quality doesn't drift.

Deliverables: Test set + harness + metrics + regression gates + report + recommended next steps.

Fine-tuning and domain adaptation

  • SFT with LoRA/QLoRA.
  • Data synthesis constrained to source texts, with dedup and quality filtering.
  • Ablations to prove what actually improves performance.

RAG (when the problem is knowledge, not reasoning)

  • Retrieval design, chunking strategy, and evaluation of groundedness.

"How not to use LLMs" reviews

  • Identify failure modes, privacy risks, and where classic ML or rules beat an LLM.

My default approach

1

Baseline

Establish non-LLM or prompt-only baseline.

2

Evaluate

Build evaluation harness first.

3

Improve

Only then consider RAG or fine-tuning.

4

Ship

Deploy with regression gates + monitoring.

Where LLMs are a bad fit

  • Simple rule-based logic or structured data extraction where regex/parsers are faster and more reliable.
  • High-frequency, low-latency operations where cost and speed matter more than reasoning.
  • Deterministic workflows where exact reproducibility and auditability are required.

Tooling I'm comfortable with

  • Parameter-efficient fine-tuning, mixed precision, multi-node GPU training, deployable inference artifacts.
  • Azure ML + Databricks + distributed data pipelines.
  • Evaluation harnesses, CI/CD, and A/B testing.