This portfolio tells a story of my journey starting as a scientist making software
and using data, to becoming an ML engineeer building reliable systems for industry

Highlighted projects

AI4citations: AI-powered citation verification

AI4citations screenshot

  • Checks the veracity of scientific claims against cited text
  • Web app built with Gradio frontend, deployed on Hugging Face
  • Built API integration tests with continuous integration (CI) in GitHub Actions to ensure working product with each commit
  • App collects user feedback into Hugging Face datasets
  • The culmination of my ML engineering capstone project, demonstrating skills in deploying and monitoring models and feedback collection for constant improvements

pyvers: NLP data processing and model training

pyvers banner

  • Implemented data processing from multiple data sources with normalized labels
  • Devised shuffled training method to achieve 7% improvement in F1 score over SOTA models (see blog post for details)
  • Implemented in PyTorch lightning for reproducible & scalable training
  • This project was the foundation for improved training using multiple datasets and leveraged software frameworks for model deployment locally or on cloud services, building my skills in data and software engineering

CHNOSZ: Enabling reproducible scientific workflows

CHNOSZ banner

  • Developed open-source R package with reproducible workflows to model chemical systems and make intuitive visualizations
  • Maintained on CRAN since 2009 and cited more than 200 times by researchers around the world
  • Automated data consistency checks increase confidence in the community-driven thermodynamic database
  • Massive documentation effort, including help pages, examples, demos, and vignettes
  • API supports third-party contributions, including a Shiny frontend and a Python interface

Software projects and packages

  • CRAN packages I maintain:
    • CHNOSZ: Thermodynamic calculations and diagrams for geochemistry
    • canprot: Chemical analysis of proteins
    • chem16S: Chemical features of microbial communities
  • orpML: Predicting oxidation-reduction potential from microbial abundances
    • Supporting code for a research project in environmental microbiology
    • Classical machine learning with scikit-learn and deep learning with PyTorch
    • Improved performance of ML models by deriving features from thermodynamic models
  • R-svg-intepreter: R script to visualize an SVG file with base R graphics

PRs, issues, and discussions

Academic research