Table of Contents
- OpenAI’s Engineering Approach
- 💡 Summary – Key Lessons
- Reflection from an AI Tech Lead’s Perspective:
- 1. Monorepos Work — If Your Team Does
- 2. FastAPI + Pydantic = Power Without Complexity
- 3. Managed Cloud Is Not a Shortcut — It’s a Strategic Choice
- 4. Engineering Culture > Engineering Committees
- 5. Bias for Action Must Be Matched with Tooling Maturity
- 6. Your Architecture Reflects Your Product
- Source:
If you are a new reader, my name is Danar Mustafa. I write about product management focusing on AI, tech, business and agile management. You can visit my website here or visit my Linkedin here. I am based in Sweden and founder of AImognad.se – the leading AI maturity Model Matrix. Get your free assessment here.
Want to Learn from OpenAI’s Engineering Team? Here’s the Secret Sauce Behind Their Success
Curious how OpenAI ships cutting-edge AI products at scale? Whether you’re an engineer looking to grow, a tech lead scaling your team, or exploring your next big career move — this post breaks down the real engineering practices behind OpenAI’s success.
From monorepos and FastAPI to team autonomy and CI challenges, here’s what makes their engineering culture work — and what we can all learn (and apply) from it.
OpenAI’s Engineering Approach
- Monorepo Strategy – A unified codebase (mainly Python) supports visibility and reuse but demands strong internal standards to maintain quality.
- Modern API Stack – Using FastAPI and Pydantic enables rapid development of robust, type-safe APIs, reducing friction between teams.
- Cloud Native at Scale – Running fully on Azure (AKS, CosmosDB, BlobStore) shows the effectiveness of managed services for speed and scalability.
- Talent Advantage – Hiring engineers from infrastructure-heavy companies like Meta accelerates adoption of proven, battle-tested practices.
- Customized Core Systems – Rebuilding critical infrastructure (like internal auth systems) offers control and alignment with company needs.
- Use-Case-Driven Architecture – Designing systems around core user experiences (e.g., chat) leads to focused, scalable product delivery.
- Decentralized Decision-Making – Empowering teams to own decisions boosts velocity but requires clear communication and alignment.
- Bias for Action – Fast iteration is a cultural cornerstone, supporting rapid progress at the cost of occasional technical debt.
- CI/CD and Tooling Gaps – Rapid growth outpaced test infrastructure, highlighting the need to invest early in developer productivity tooling.
OpenAI balances autonomy, action, and alignment — a model that encourages innovation, but only works with strong foundations in tooling, hiring, and product-driven engineering.
Let’s dive into each section.
🗂 Monorepo & Python
What:
OpenAI uses a giant monorepo that’s mostly Python, although there’s some Rust and Go sprinkled in for specific use cases.
Takeaway:
A monorepo offers visibility and shared tooling, but without enforced style guides, things can get messy. If we move in this direction, we need strong code conventions.
⚙️ API and Validation
What:
Most services are built with FastAPI for APIs and Pydantic for validation.
Takeaway:
FastAPI + Pydantic is a powerful combo for building well-structured and type-safe APIs. Worth considering for internal and external services.
☁️ Everything Runs on Azure
What:
OpenAI runs entirely on Azure, especially relying on:
- Azure Kubernetes Service (AKS)
- CosmosDB
- BlobStore
Takeaway:
Even large-scale companies use managed cloud services. Picking the right ones can significantly simplify infrastructure and operations.
🔁 Meta → OpenAI Pipeline
What:
There’s a strong talent pipeline from Meta to OpenAI, especially on the infrastructure side.
Takeaway:
Hiring experienced engineers from companies with similar scale and systems can bring major advantages in terms of best practices and velocity.
🧱 In-House TAO Implementation
What:
OpenAI built its own internal version of TAO, Meta’s identity/auth system.
Takeaway:
When identity and access are core to your platform, building a tailored solution might be worth the investment.
💬 Chat-Centric Codebase
What:
Much of the architecture is centered around the concept of chat messages and conversations (e.g., ChatGPT).
Takeaway:
Designing systems around your primary use case leads to more coherent, optimized solutions.
🛠 Decentralized Code Decisions
What:
No central architecture or planning committees — teams that plan to do the work make the decisions.
Takeaway:
Empowers teams and speeds up delivery — but can result in duplicated or inconsistent code unless offset by shared understanding and documentation.
⚡ Bias for Action
What:
OpenAI encourages bias toward action — teams move fast and build things quickly.
Takeaway:
Great for speed and innovation, but can lead to tech debt if not balanced with code quality and refactoring time.
🧪 CI and Tooling Challenges
What:
Rapid team growth + weak tooling led to:
- CI breaking often on
master - Slow test suites (~30 mins on GPU)
- “Dumping ground” backend repos
Takeaway:
Invest early in reliable CI/CD and developer tooling, especially during team expansion. Poor tooling can bottleneck productivity.
💡 Summary – Key Lessons
| Area | Key Insight |
|---|---|
| Monorepo | Works well with strong conventions |
| APIs | FastAPI + Pydantic = fast, safe APIs |
| Cloud Infra | Pick reliable managed services |
| Hiring | Bring in talent with large-scale experience |
| Code Ownership | Empower teams but coordinate standards |
| Test Infra | Robust CI/CD is not optional |
| Engineering Culture | Bias for action = speed, but manage tech debt |
Reflection from an AI Tech Lead’s Perspective:
Reading through OpenAI’s engineering choices isn’t just interesting — it’s grounding. It reinforces a few things I’ve come to believe as a tech lead, but also surfaces blind spots we all need to watch out for.
1. Monorepos Work — If Your Team Does
The monorepo strategy only works because OpenAI engineers trust each other to follow (implicit) conventions. That kind of discipline is earned, not assumed. For our team, it’s a reminder that scaling code means scaling habits — not just tools.
2. FastAPI + Pydantic = Power Without Complexity
It’s validating to see OpenAI lean on tools many of us use. Sometimes we over-engineer; OpenAI shows you can ship state-of-the-art systems with the same open-source stack everyone else has — if your team is sharp enough.
3. Managed Cloud Is Not a Shortcut — It’s a Strategic Choice
They didn’t invent their own infra stack — they used Azure, and went deep. That’s a huge lesson: focusing on your differentiator (the model, the experience) means letting go of unnecessary complexity. Not every problem needs a custom solution.
4. Engineering Culture > Engineering Committees
I’m struck by their decentralized decision-making. It’s bold. It means trusting teams to build what they need. But that only works with strong culture and clear guardrails. For us, it’s a call to level up our internal documentation, not our approval chains.
5. Bias for Action Must Be Matched with Tooling Maturity
They move fast — but they paid the price in broken CI and slow tests. That’s the trade-off. For my team, it’s a reminder: shipping fast is great, but the long-term cost of neglected infra is real and non-trivial.
6. Your Architecture Reflects Your Product
The fact that their system is built around “chat” concepts is brilliant. It’s not just API endpoints — it’s a domain model rooted in how users interact. It reminds me that architectural clarity starts with product empathy.
OpenAI’s setup isn’t magic. It’s a series of disciplined choices, made with deep product alignment and cultural intent. As a tech lead, I walk away reminded that our real leverage isn’t in tools or languages — it’s in clarity, culture, and keeping engineers close to the problem we’re solving.
Source:
This content started with this blog post. https://x.com/deedydas/status/1945366936893972710/. I have since then made some research and added some of my own ideas and thoughts.
Discover more from The Tech Society
Subscribe to get the latest posts sent to your email.