Data Engineer · ML · Analytics
Writing about machine learning engineering, data systems, and whatever else I find interesting. 3+ years building production data pipelines at P&G. M.Sc. Applied Mathematics & Statistics.
Latest
How to handle NULL job categories in a medallion lakehouse pipeline — building a KNN classifier on TF-IDF text features to impute missing labels and keep the downstream time series gap-free.
Archive
Imputing NULL job categories with a TF-IDF + KNN classifier — keeping the medallion pipeline's daily time series complete and forecast-ready.
Instance types, memory limits, cold start behavior — the gaps in AWS docs that cost me hours.
How to connect a Lambda function to a SageMaker endpoint with least-privilege IAM and actually handle timeouts.
CORS, throttling, and API keys — finishing the serverless MLOps pipeline so the outside world can hit your model.
More articles on Databricks, Delta Lake, medallion architecture, and production pipeline design.
Essays on whatever I find interesting — outside the data world.
Work
End-to-end medallion lakehouse pipeline ingesting live job posting data via REST APIs with incremental loading and a downstream XGBoost forecasting model tracked in MLflow.
Deployed a 738M parameter model to SageMaker and built a fully serverless inference pipeline via Lambda and API Gateway.
Led end-to-end migration of legacy enterprise pipelines to Databricks with Delta Lake, PySpark, and ACID-compliant distributed processing at scale.
I'm a Data Engineer with a background in Statistics who spent 3+ years at Procter & Gamble building production-grade data infrastructure. I write about what I'm building and learning — from MLOps and lakehouse architecture to whatever rabbit hole I fall into.
Currently seeking remote ML Engineering, Data Engineering, and Analytics Engineering roles and freelance projects.