Data Engineer · ML · Analytics
Writing about machine learning engineering, data systems, and whatever else I find interesting. 3+ years building production data pipelines at P&G. M.Sc. Applied Mathematics & Statistics.
Latest
Cold starts, payload constraints, autoscaling, and throttling — a benchmark-driven walkthrough of every optimization applied to a SageMaker + Lambda + API Gateway inference stack, with before/after numbers at each layer.
Archive
Benchmarking and tuning every layer of a SageMaker + Lambda + API Gateway stack — cold starts, autoscaling, output constraints, and throttling.
End-to-end walkthrough of a retrieval-augmented generation system — document loading, text splitting, vector embeddings, Chroma, and a RetrievalQA chain.
How I handled missing category labels in the CareerPulse pipeline using TF-IDF vectorization and a KNN classifier — training, evaluation, and MLflow integration.
PSI, KS tests, and a Databricks monitoring notebook that fires before accuracy degrades — because retraining schedules alone aren't enough.
Delta Lake MERGE, quarantine paths for bad records, checkpoint-based recovery, schema evolution, and structured logging — what production pipelines actually need.
Instance types, memory limits, cold start behavior — the gaps in AWS docs that cost me hours.
How to connect a Lambda function to a SageMaker endpoint with least-privilege IAM and actually handle timeouts.
CORS, throttling, and API keys — finishing the serverless MLOps pipeline so the outside world can hit your model.
More articles on Databricks, Delta Lake, medallion architecture, and production pipeline design.
The CareerPulse end-to-end deep dive — architecture decisions, failures, and what I learned.
Essays on whatever I find interesting — outside the data world.
Work
End-to-end medallion lakehouse pipeline ingesting live job posting data via REST APIs with incremental loading and a downstream XGBoost forecasting model tracked in MLflow.
Deployed a 738M parameter model to SageMaker and built a fully serverless inference pipeline via Lambda and API Gateway — optimized to sub-second warm latency.
Led end-to-end migration of legacy enterprise pipelines to Databricks with Delta Lake, PySpark, and ACID-compliant distributed processing at scale.
I'm a Data Engineer with a background in Statistics who spent 3+ years at Procter & Gamble building production-grade data infrastructure. I write about what I'm building and learning — from MLOps and lakehouse architecture to whatever rabbit hole I fall into.
Currently seeking remote ML Engineering, Data Engineering, and Analytics Engineering roles and freelance projects.