Wangpeng An
Currently building VLM systems at Instacart

Hello! I'm Wangpeng An.

Full Stack Staff Machine Learning Engineer

I build multimodal LLM and VLM systems end-to-end. A decade in, I've learned the model is rarely the hard part — making inference cheap and boring enough to trust at scale is. From CVPR research to billions of videos a day.

Currently at Instacart (Caper Cart) | Previously at TikTok & Meta AI Research (FAIR) | CVPR Spotlight Author

About Me

My path started in control engineering, not ML — and that framing never left. My CVPR Spotlight work came from a simple reframe: training a deep network is a feedback-control problem, so a PID controller can drive its optimizer (later extended to IEEE TNNLS, IF 11.368). I did my Master's at Tsinghua on deep learning optimization, pose estimation, and face recognition, with work at CVPR, IEEE TNNLS, and Pattern Recognition, plus a U.S. patent for automated-checkout tracking.

Across Instacart, TikTok, Meta FAIR, and AiFi, the recurring lesson is that production ML lives or dies below the model. At TikTok, collapsing ~800 narrow perception models into one multimodal LLM only became viable once INT8 and tensor-parallel serving on Inf2 cut inference cost in half. At AiFi, going RGB-only — no shelf sensors — forced detection and tracking to be good enough that the hardware could be cheap. I work the full stack: research, large-scale training and inference, deployment on Docker, Kubernetes, TensorRT and edge, and the serving and product layers on top. I also review for CVPR, ECCV, ICCV, and IEEE TPAMI.

Research & Projects

01

PID Controller for Deep Learning

SGD with momentum only uses the present and the past — it's a PI controller missing its derivative term. Adding D to anticipate overshoot gave faster, more stable training. CVPR 2018 Spotlight, extended to IEEE TNNLS (IF 11.368). The idea came from treating optimization as control, not the reverse.

  • CVPR Spotlight
  • Optimization
  • PyTorch
PID Controller for Deep Learning brand logo
02

Caper Cart Shopping Assistant

At Instacart: a VLM shopping assistant that runs on the Caper Cart itself. The cart sets fixed latency, power, and cost budgets — so the interesting decisions are about what has to run on-device versus off, not how big the model can be. I own it end-to-end, model through product.

  • VLM
  • Multimodal
  • Edge Inference
  • Full Stack
Caper Cart Shopping Assistant brand logo
03

Multimodal LLM at Scale

At TikTok: one multimodal LLM that turns video into dense text and replaces ~800 narrow perception models. Consolidating that many models only pays off if the one is cheaper to run than the many — so the real work was INT8 quantization and tensor-parallel serving on Inf2, which cut inference cost in half and made it affordable at billions of videos a day. Later expanded to recommendation, ads, and search (AWS re:Invent 2024).

  • LLM
  • Multimodal
  • Inf2
  • re:Invent 2024
Multimodal LLM at Scale brand logo
04

Autonomous Checkout System

At AiFi: RGB-only autonomous checkout (like Amazon Go, but no shelf weight sensors). Dropping every sensor except cameras makes the store cheap to build — and pushes all the difficulty onto vision: person and product detection + tracking that survives occlusion and crowds well enough to bill the right cart. Led those algorithms; U.S. Patent US11393213B2.

  • Object Detection
  • Tracking
  • Patent
Autonomous Checkout System brand logo
05

PolyU-Face System

At HK PolyU: an end-to-end face pipeline — detection, alignment, recognition. Building the whole chain teaches what papers skip: alignment errors upstream cap recognition accuracy downstream far more than the recognition model itself. Also contributed to multi-person 2D pose estimation (ICME 2018 oral).

  • Face Recognition
  • Pose Estimation
  • Research
PolyU-Face System brand logo

My Skills

ML Frameworks

  • PyTorch
  • TensorFlow
  • Keras
  • ONNX
  • TensorRT
  • OpenCV

Programming Languages

  • Python
  • C++

Deployment

  • Docker
  • Kubernetes

Web Development

  • TypeScript
  • React
  • Next.js
  • Tailwind

Tools

  • Git
  • PostgreSQL

My Experience

Instacart Inc. - Full Stack Staff ML Engineer

California, USA

Building a Vision Language Model based Shopping Assistant on the Caper Cart — owning the system end-to-end across model, inference, and the on-cart product experience.

TikTok Inc. - Senior ML Engineer

San Jose, California, USA

Trust & Safety and Content Ads: Built a full-stack multimodal LLM system that converts videos into dense text and detects unsafe content, processing billions of videos daily on Inf2 and outperforming 800 traditional perception models. Expanded the multimodal LLM to recommendation, ads, and search (presented at AWS re:Invent 2024), and lifted audio moderation metrics from 5% to 10% by integrating Whisper.

Meta (Facebook) - CV Engineer at FAIR

Menlo Park, California, USA

Facebook AI Research: Worked on Explainable AI to open the 'black box' of computer vision models, and Responsible AI for inclusive & fairness in Computer Vision systems.

AiFi Inc. - Senior Research Engineer

Santa Clara, California, USA

Led computer vision detection algorithms for RGB-only autonomous checkout solution (similar to Amazon Go). Developed customer detection + tracking and product detection + tracking systems. U.S. Patent holder.

HK Polytechnic University - Research Assistant

Hong Kong

Developed PID optimizer to accelerate CNN training (CVPR 2018 Spotlight). Built the PolyU-Face system for face detection, alignment, and recognition.

Tsinghua University - Master's Degree

Beijing, China

Control Engineering with research focus on deep learning optimization, human pose estimation, object detection, and face recognition. Student instructor for Big Data class.

Taiyuan Iron and Steel Corp. - Control Technician

Taiyuan, China

Focused on tuning PID controller parameters in steel milling machines. Applied control theory to industrial automation.

Kunming University of Science and Technology - Bachelor's Degree

Kunming, China

Instrument and Measurement Science with concentration in control theory. Foundation in PID control systems and signal processing.