Data Engineer · ML · Analytics

Richard
Antoine

Writing about machine learning engineering, data systems, and whatever else I find interesting. 3+ years building production data pipelines at P&G. M.Sc. Applied Mathematics & Statistics.

3+ years building
enterprise data systems
$350M+ business value
enabled via pipelines
11 production ML models
improved

Featured writing

All articles →

All articles

From Prompt to Response in Under a Second: Optimizing a Serverless LLM Inference Pipeline

Benchmarking and tuning every layer of a SageMaker + Lambda + API Gateway stack — cold starts, autoscaling, output constraints, and throttling.

Building a RAG Application with LangChain: Query Your Own Knowledge Base

End-to-end walkthrough of a retrieval-augmented generation system — document loading, text splitting, vector embeddings, Chroma, and a RetrievalQA chain.

CareerPulse: Job Category Imputation Using KNN

How I handled missing category labels in the CareerPulse pipeline using TF-IDF vectorization and a KNN classifier — training, evaluation, and MLflow integration.

Monitoring ML Models in Production: Detecting Data Drift Before It Breaks Your Pipeline

PSI, KS tests, and a Databricks monitoring notebook that fires before accuracy degrades — because retraining schedules alone aren't enough.

Designing for Failure: Idempotency and Error Handling in PySpark Pipelines

Delta Lake MERGE, quarantine paths for bad records, checkpoint-based recovery, schema evolution, and structured logging — what production pipelines actually need.

Deploying an LLM to SageMaker: What the Docs Don't Tell You

Instance types, memory limits, cold start behavior — the gaps in AWS docs that cost me hours.

Serverless Inference with Lambda: Wiring It to SageMaker

How to connect a Lambda function to a SageMaker endpoint with least-privilege IAM and actually handle timeouts.

API Gateway: Turning Your Lambda Into a Public Endpoint

CORS, throttling, and API keys — finishing the serverless MLOps pipeline so the outside world can hit your model.

Coming soon

More articles on Databricks, Delta Lake, medallion architecture, and production pipeline design.

Coming soon

The CareerPulse end-to-end deep dive — architecture decisions, failures, and what I learned.

Coming soon

Essays on whatever I find interesting — outside the data world.

Notable projects

CareerPulse

End-to-end medallion lakehouse pipeline ingesting live job posting data via REST APIs with incremental loading and a downstream XGBoost forecasting model tracked in MLflow.

LLM Inference API

Deployed a 738M parameter model to SageMaker and built a fully serverless inference pipeline via Lambda and API Gateway — optimized to sub-second warm latency.

P&G Data Platform Migration

Led end-to-end migration of legacy enterprise pipelines to Databricks with Delta Lake, PySpark, and ACID-compliant distributed processing at scale.

Building things
at the intersection
of data & ML.

I'm a Data Engineer with a background in Statistics who spent 3+ years at Procter & Gamble building production-grade data infrastructure. I write about what I'm building and learning — from MLOps and lakehouse architecture to whatever rabbit hole I fall into.

Currently seeking remote ML Engineering, Data Engineering, and Analytics Engineering roles and freelance projects.

Python PySpark SQL Databricks Delta Lake AWS Azure MLflow scikit-learn HuggingFace XGBoost LangChain