Data Engineer · ML · Analytics

Richard
Antoine

Writing about machine learning engineering, data systems, and whatever else I find interesting. 3+ years building production data pipelines at P&G. M.Sc. Applied Mathematics & Statistics.

Read the blog GitHub ↗

Python PySpark

SQL Statistics Azure

Databricks AWS

Delta Lake scikit-learn

HuggingFace LangChain

Latest

Featured writing

All articles →

Projects

From Raw API Payloads to a Production-Ready Job Classifier: A PySpark and MLflow Deep Dive

A technical walkthrough of building a job category classifier on Databricks — covering medallion architecture design decisions, text embedding comparisons, and MLflow experiment tracking

Mar 202610 min readMLOps Series

All articles

Projects

CareerPulse: Building a Job Category Classifier with PySpark, Sentence Transformers, and MLflow

Turning raw job posting data from the muse API into a production level category classification model in Databricks

Mar 202610 min

ML & AI

From Prompt to Response in Under a Second: Optimizing a Serverless LLM Inference Pipeline

Benchmarking and tuning every layer of a SageMaker + Lambda + API Gateway stack — cold starts, autoscaling, output constraints, and throttling.

Mar 202610 min

ML & AI

Building a RAG Application with LangChain: Query Your Own Knowledge Base

End-to-end walkthrough of a retrieval-augmented generation system — document loading, text splitting, vector embeddings, Chroma, and a RetrievalQA chain.

Mar 20269 min

Projects

CareerPulse: Job Category Imputation Using KNN

How I handled missing category labels in the CareerPulse pipeline using TF-IDF vectorization and a KNN classifier — training, evaluation, and MLflow integration.

Mar 20267 min

ML & AI

Monitoring ML Models in Production: Detecting Data Drift Before It Breaks Your Pipeline

PSI, KS tests, and a Databricks monitoring notebook that fires before accuracy degrades — because retraining schedules alone aren't enough.

Mar 20268 min

Data Engineering

Designing for Failure: Idempotency and Error Handling in PySpark Pipelines

Delta Lake MERGE, quarantine paths for bad records, checkpoint-based recovery, schema evolution, and structured logging — what production pipelines actually need.

Mar 20269 min

ML & AI

Deploying an LLM to SageMaker: What the Docs Don't Tell You

Instance types, memory limits, cold start behavior — the gaps in AWS docs that cost me hours.

Oct 20258 minPart 1 of 3

ML & AI

Serverless Inference with Lambda: Wiring It to SageMaker

How to connect a Lambda function to a SageMaker endpoint with least-privilege IAM and actually handle timeouts.

Oct 20256 minPart 2 of 3

ML & AI

API Gateway: Turning Your Lambda Into a Public Endpoint

CORS, throttling, and API keys — finishing the serverless MLOps pipeline so the outside world can hit your model.

Oct 20255 minPart 3 of 3

Data Engineering

Coming soon

More articles on Databricks, Delta Lake, medallion architecture, and production pipeline design.

Coming soon

Projects

Coming soon

The CareerPulse end-to-end deep dive — architecture decisions, failures, and what I learned.

Coming soon

General

Coming soon

Essays on whatever I find interesting — outside the data world.

Coming soon

Work

Notable projects

Project

CareerPulse

End-to-end medallion lakehouse pipeline ingesting live job posting data via REST APIs with incremental loading and a downstream XGBoost forecasting model tracked in MLflow.

Python · PySpark · Databricks · MLflow

ML & AI

LLM Inference API

Deployed a 738M parameter model to SageMaker and built a fully serverless inference pipeline via Lambda and API Gateway — optimized to sub-second warm latency.

AWS · SageMaker · Lambda · HuggingFace

Data Engineering

P&G Data Platform Migration

Led end-to-end migration of legacy enterprise pipelines to Databricks with Delta Lake, PySpark, and ACID-compliant distributed processing at scale.

Databricks · Delta Lake · PySpark

Building things
at the intersection
of data & ML.

I'm a Data Engineer with a background in Statistics who spent 3+ years at Procter & Gamble building production-grade data infrastructure. I write about what I'm building and learning — from MLOps and lakehouse architecture to whatever rabbit hole I fall into.

Currently seeking remote ML Engineering, Data Engineering, and Analytics Engineering roles and freelance projects.

Python PySpark SQL Databricks Delta Lake AWS Azure MLflow scikit-learn HuggingFace XGBoost LangChain

RichardAntoine

Featured writing

From Raw API Payloads to a Production-Ready Job Classifier: A PySpark and MLflow Deep Dive

All articles

CareerPulse: Building a Job Category Classifier with PySpark, Sentence Transformers, and MLflow

From Prompt to Response in Under a Second: Optimizing a Serverless LLM Inference Pipeline

Building a RAG Application with LangChain: Query Your Own Knowledge Base

CareerPulse: Job Category Imputation Using KNN

Monitoring ML Models in Production: Detecting Data Drift Before It Breaks Your Pipeline

Designing for Failure: Idempotency and Error Handling in PySpark Pipelines

Deploying an LLM to SageMaker: What the Docs Don't Tell You

Serverless Inference with Lambda: Wiring It to SageMaker

API Gateway: Turning Your Lambda Into a Public Endpoint

Coming soon

Coming soon

Coming soon

Notable projects

CareerPulse

LLM Inference API

P&G Data Platform Migration

Building thingsat the intersectionof data & ML.

New articles in your inbox.

Richard
Antoine

Building things
at the intersection
of data & ML.