Plato runs a conference called Plato Elevate and in early 2024, I had the opportunity to present a talk. The theme of the conference was along “How did company X overcome challenge Y” – and I presented the talk “How did we build an AI/ML Platform at DoorDash”.
In this talk, I address questions around how to deconstruct a challenge such as building an ML Platform from the ground up — some notes:
How has DoorDash used AI/ML to improve its platform?
- DoorDash has identified over 120 unique AI/ML needs.
- Some of the key areas where they are applying these technologies are:
- Predictions (e.g., estimated time of arrival, short/long-term forecasting)
- Search & Discovery (Recommendations, Personalization)
- Logistics & Operational Excellence (Dasher to task matching, batching, missing items)
- Fraud (Promotions, Payment)
- Customer Support (Support tickets, refunds)
- Robotics (Drone delivery, self vending)
What are the principles for scaling and evolving an ML platform?
Here is what I suggest as a step process for building a ML Platform roadmap.
- Know your game: Understand what AI/ML can do for your business. Define your problem space and solution space. (In the case of DoorDash the list of categories and potential use cases are answered in the previous question)
- Dream Big. Start Small: Identify the best first use case (a “Hero Use Case”) that will enable adoption and scaling of the platform
- Get 1% Better Every Day: Continuously iterate on architecture and scale. Add more sophisticated use cases over time
The talk dives into how we handled each step with some stories and details.
What were some key decisions we made at when building our ML platform?
Some examples of key decisions include “Build v Buy” decisions. ML is a vast solution space and we wanted to ensure we have the flexibility to evolve and choose the best of breed solutions. It’s not all about vendor based vs open source vs inhouse solutions. It’s also about the right long term choices within each of these choices. For e.g. as a model training library/framework we could have gone with either PyTorch or with TensorFlow. After a deliberate set of discussions we chose PyTorch and in hidsight this was a good decision – the PyTorch community has thrived over the years and we have many MLEs join DoorDash who have previously worked with PyTorch.
Another area was the choice in building an Inference service.
We decided on a centralized prediction service. This means that all models are served from a single endpoint. While there are tradeoffs to this approach, it provides better CPU utilization, easier management, and allows the flexibility to move towards other approaches later.
Some of the other decisions and features we built along the way are ..
- The ability to shadow models to help with testing and collecting prediction logs for training purposes was also important.10
- DoorDash also made decisions to help with latency, including
- Support prediction batching (useful for recommendation use cases)
- Leveraging the C++ layers to help with prediction latency.
Plato also converted this talk into a blog with key points outlined and you can read it here . This blog goes a bit deeper in addressing some decisions made.
DoorDash of course is not the only company that faced this challenge. Many similar companies went through this evolution as ML gained prominence over the years. This blog does a decent job of indexing various companies and their ML Platform efforts.
If your company has gone through a similar exercise or if you are about to embark on one, please share your story as well!

Leave a comment