I ship privacy-first software and measurable AI systems.
LLMs are powerful. They also fail in predictable ways.
I build LLM systems with measurable quality, not vibes.
What I offer
LLM evaluation systems (start here)
- Build a test set that reflects real user inputs.
- Define metrics (task success, faithfulness, safety, latency, cost).
- Add regression tests so quality doesn't drift.
Deliverables: Test set + harness + metrics + regression gates + report + recommended next steps.
Fine-tuning and domain adaptation
- SFT with LoRA/QLoRA.
- Data synthesis constrained to source texts, with dedup and quality filtering.
- Ablations to prove what actually improves performance.
RAG (when the problem is knowledge, not reasoning)
- Retrieval design, chunking strategy, and evaluation of groundedness.
"How not to use LLMs" reviews
- Identify failure modes, privacy risks, and where classic ML or rules beat an LLM.
My default approach
1
Baseline
Establish non-LLM or prompt-only baseline.
2
Evaluate
Build evaluation harness first.
3
Improve
Only then consider RAG or fine-tuning.
4
Ship
Deploy with regression gates + monitoring.
Where LLMs are a bad fit
- Simple rule-based logic or structured data extraction where regex/parsers are faster and more reliable.
- High-frequency, low-latency operations where cost and speed matter more than reasoning.
- Deterministic workflows where exact reproducibility and auditability are required.
Tooling I'm comfortable with
- Parameter-efficient fine-tuning, mixed precision, multi-node GPU training, deployable inference artifacts.
- Azure ML + Databricks + distributed data pipelines.
- Evaluation harnesses, CI/CD, and A/B testing.