This portfolio tells a story of my journey starting as a scientist making software
and using data, to becoming an ML engineeer building reliable systems for industry
Highlighted projects
AI4citations: AI-powered citation verification
- Checks the veracity of scientific claims against cited text
- Web app built with Gradio frontend, deployed on Hugging Face
- Built API integration tests with continuous integration (CI) in GitHub Actions to ensure working product with each commit
- App collects user feedback into Hugging Face datasets
- The culmination of my ML engineering capstone project, demonstrating skills in deploying and monitoring models and feedback collection for constant improvements
pyvers: NLP data processing and model training
- Implemented data processing from multiple data sources with normalized labels
- Devised shuffled training method to achieve 7% improvement in F1 score over SOTA models (see blog post for details)
- Implemented in PyTorch lightning for reproducible & scalable training
- This project was the foundation for improved training using multiple datasets and leveraged software frameworks for model deployment locally or on cloud services, building my skills in data and software engineering
CHNOSZ: Enabling reproducible scientific workflows
- Developed open-source R package with reproducible workflows to model chemical systems and make intuitive visualizations
- Maintained on CRAN since 2009 and cited more than 200 times by researchers around the world
- Automated data consistency checks increase confidence in the community-driven thermodynamic database
- Massive documentation effort, including help pages, examples, demos, and vignettes
- API supports third-party contributions, including a Shiny frontend and a Python interface
Software projects and packages
- CRAN packages I maintain:
- orpML: Predicting oxidation-reduction potential from microbial abundances
- Supporting code for a research project in environmental microbiology
- Classical machine learning with scikit-learn and deep learning with PyTorch
- Improved performance of ML models by deriving features from thermodynamic models
- R-svg-intepreter: R script to visualize an SVG file with base R graphics
- Written with AI assistance using Cursor
- Posted on the R-help mailing list to answer a user’s request
PRs, issues, and discussions
- Created CHNOSZ Discussions forum on GitHub for user support and engagement
- Answered LangChain question: Getting top k documents for ParentDocumentRetriever
- Made sense of the documentation to show how to pass keyword arguments to the search function
- Answered LangChain issue: Using local Hugging Face pipeline
- Digested the error messsage and docs to correctly specify a missing component needed to build private chatbots
- Answered LangGraph question: Extract tool name during streaming
- Found solution for displaying tool names used in a chatbot application (first answer, 6 months after OP)
- Posted LangSmith issue: Fixes for experiment logging
- Updated evaluation notebooks for compatibility with current API (allows reproducing steps in LangSmith onboarding videos)
- Committed to Gradio: Fix ValueError in Controlling the Reload demo
- Fixed bug in documentation example to correctly handle pipeline output (PR accepted by Gradio maintainers)
Academic research
- Reviewer for software and ML papers in science journals
- Published 20+ first-author papers and 50+ coauthored papers
- Long-term committment to reproducible research
- JMDplots package reproduces plots for 20 of my papers going back to 2006
- See my research overview for more info
- Transferable skills: scientific computing, project management, writing, critical thinking