Wangpeng An
Building VLM systems at Instacart

Hi, I'm Wangpeng An.

Full Stack Staff Machine Learning Engineer

I build multimodal LLM and VLM systems, from training through deployment. I've spent about ten years on this, and most of my time goes into making inference cheap and reliable enough to run at scale, not into the model itself. I've worked on everything from CVPR research to systems that process hundreds of millions of videos a day.

Currently at Instacart (Caper Cart) | Previously at TikTok & Meta AI Research (FAIR) | CVPR Spotlight Author

About Me

I started out in control engineering rather than ML, and it still shapes how I think. My CVPR Spotlight paper came out of that background: if you treat training a deep network as a feedback-control problem, you can use a PID controller to drive the optimizer (later extended to IEEE TNNLS, IF 11.368). I did my Master's at Tsinghua working on deep learning optimization, pose estimation, and face recognition, with papers at CVPR, IEEE TNNLS, and Pattern Recognition, and a U.S. patent for automated-checkout tracking.

Working at Instacart, TikTok, Meta FAIR, and AiFi taught me that most of what makes production ML succeed or fail happens around the model, not in it. At TikTok, replacing roughly 800 narrow perception models with a single multimodal LLM only worked once INT8 quantization and tensor-parallel serving on Inf2 had cut inference cost in half. At AiFi, we used cameras only, no shelf sensors, which kept the hardware cheap but put all the pressure on detection and tracking. I'm comfortable across the whole stack: research, large-scale training and inference, deployment on Docker, Kubernetes, TensorRT and edge, and the serving and product layers on top. I also review for CVPR, ECCV, ICCV, and IEEE TPAMI.

Research & Projects

01

PID Controller for Deep Learning

SGD with momentum is essentially a PI controller: it uses the present gradient and the accumulated past, but no derivative term. Adding the D term lets the optimizer anticipate overshoot, which made training faster and more stable. CVPR 2018 Spotlight, extended to IEEE TNNLS (IF 11.368).

  • CVPR Spotlight
  • Optimization
  • PyTorch
PID Controller for Deep Learning brand logo
02

Caper Cart Shopping Assistant

At Instacart: a VLM on the Caper Cart that confirms each item as a shopper scans it. It reaches 95% top-1 accuracy across more than 300K SKUs, and the serving and evaluation infra I built brought mis-identifications down by 30%. The cart runs on fixed latency, power, and cost budgets, so a lot of the work is deciding what runs on-device and what runs off it. I own this end-to-end, from the model to the product experience.

  • VLM
  • Multimodal
  • Edge Inference
  • Full Stack
Caper Cart Shopping Assistant brand logo
03

Multimodal LLM at Scale

At TikTok: a single multimodal LLM that turns video into dense text and replaces about 800 narrow perception models. Replacing that many models is only worth it if the one is cheaper to run than all of them combined, so most of the effort went into INT8 quantization and tensor-parallel serving on Inf2. That cut inference cost in half and made it affordable at 300 million videos a day. Later expanded to recommendation, ads, and search (AWS re:Invent 2024).

  • LLM
  • Multimodal
  • Inf2
  • re:Invent 2024
Multimodal LLM at Scale brand logo
04

Autonomous Checkout System

At AiFi: camera-only autonomous checkout, similar to Amazon Go but without the shelf weight sensors. Using cameras only keeps the store cheap to build, but it means the vision system has to do everything: person and product detection and tracking that holds up through occlusion and crowds, accurately enough to charge the right cart. I led those algorithms. U.S. Patent US11393213B2.

  • Object Detection
  • Tracking
  • Patent
Autonomous Checkout System brand logo
05

PolyU-Face System

At HK PolyU: a full face pipeline covering detection, alignment, and recognition. Building the whole chain made something clear that papers tend to gloss over: alignment errors early on limit recognition accuracy later much more than the recognition model does. Also contributed to multi-person 2D pose estimation (ICME 2018 oral).

  • Face Recognition
  • Pose Estimation
  • Research
PolyU-Face System brand logo

My Skills

ML Frameworks

  • PyTorch
  • TensorFlow
  • Keras
  • ONNX
  • TensorRT
  • OpenCV

Programming Languages

  • Python
  • C++

Deployment

  • Docker
  • Kubernetes

Web Development

  • TypeScript
  • React
  • Next.js
  • Tailwind

Tools

  • Git
  • PostgreSQL

My Experience

Instacart Inc. - Full Stack Staff ML Engineer

California, USA

Built a Vision Language Model on the Caper Cart that confirms items as shoppers scan them: 95% top-1 accuracy across 300K+ SKUs, with a 30% drop in mis-identifications. I own the system end-to-end, across model, inference, and the on-cart product experience.

TikTok Inc. - Senior ML Engineer

San Jose, California, USA

Trust & Safety and Content Ads: Built a full-stack multimodal LLM system that converts videos into dense text and detects unsafe content, processing 300 million videos daily on Inf2 and outperforming 800 traditional perception models. Expanded the multimodal LLM to recommendation, ads, and search (presented at AWS re:Invent 2024), and lifted audio moderation metrics from 5% to 10% by integrating Whisper.

Meta (Facebook) - CV Engineer at FAIR

Menlo Park, California, USA

Facebook AI Research: Worked on Explainable AI to open the 'black box' of computer vision models, and Responsible AI for inclusive & fairness in Computer Vision systems.

AiFi Inc. - Senior Research Engineer

Santa Clara, California, USA

Led computer vision detection algorithms for RGB-only autonomous checkout solution (similar to Amazon Go). Developed customer detection + tracking and product detection + tracking systems. U.S. Patent holder.

HK Polytechnic University - Research Assistant

Hong Kong

Developed PID optimizer to accelerate CNN training (CVPR 2018 Spotlight). Built the PolyU-Face system for face detection, alignment, and recognition.

Tsinghua University - Master's Degree

Beijing, China

Control Engineering with research focus on deep learning optimization, human pose estimation, object detection, and face recognition. Student instructor for Big Data class.

Taiyuan Iron and Steel Corp. - Control Technician

Taiyuan, China

Focused on tuning PID controller parameters in steel milling machines. Applied control theory to industrial automation.

Kunming University of Science and Technology - Bachelor's Degree

Kunming, China

Instrument and Measurement Science with concentration in control theory. Foundation in PID control systems and signal processing.