Platform Engineering: Build vs Buy

Since I posted my Security Platform Engineering blog, one of the most common questions I’ve received is: “How do I know when I should build vs. buy?” In this blog, I aim to demystify some of the decision-making around this and, ideally, help you make better decisions if you’re faced with the same problem. This is my personal view, and you might need to tailor it to your company values and risk profile, but I’ll include specific examples from my past that have had input into my decision-making.

Retrospective

In the previous blog, I suggested avoiding a rebuild of a commodity. A commodity is an off-the-shelf vendor product that meets most of your needs. Developers are generally better spent improving your core products or improving things internally than rebuilding something you could buy and deploy within a few weeks in most cases.

Platform engineering teams should spend their time building three things:

New and innovative services
Niche problems that don’t have a vendor ecosystem. In the past, I’ve called these sub-venture scale problems.
Glue services that connect various internal and vendor services.

You’re in a good spot if you’re working on things in these three categories. My only advice is to validate what you’re doing and do a proper analysis to ensure there isn’t a hidden market you’re unaware of. Of course, just because you’re working with internal customers doesn’t mean you shouldn’t follow the build, measure, and learn feedback loop. You should treat what you do as a product. Instead of revenue or growth, you might measure something like developer hours saved, services secured, or something less tangible.

Exceptions

Of course, there are always exceptions, and I want to dive deep into some of those further; these were:

Cases where you can build it cheaper than what a vendor is offering
When you have issues with the vendor’s security posture and want to keep it internal
The only vendors in that space are competitors you don’t want to give money to
Avoiding vendor lock-in either because you think they’ll increase prices or not build out future features you require
Where the solutions provided can’t scale to your needs

Cost

If someone tells me they can build something cheaper than a vendor, I’m immediately skeptical because I don’t think most people can accurately forecast the actual cost of maintenance in the long term. Anything you build needs to be maintained, get security patches, get new features, and you usually need to fix broken APIs that connect to other vendors. That being said, I have seen some incredibly valid examples of people saving significant amounts of cash. A common example these days is replacing Datadog with open-source alternatives and hiring a few engineers to run it. There are so many alternatives now, such as Grafana/Prometheus, which is within the realm of possibility when you hear that companies such as Coinbase spend more than $65m on Datadog.

You need to scrutinize the public examples, though. The famous post from DHH comes to mind here, where things like reliability, hiring, and security were not factored into the costs. While it still may be cheaper, there are also less tangible benefits, like being able to hire people who understand the tools on day one rather than needing to ramp up on a new stack.

My advice here is to evaluate two things:

Evaluate the feature set of the vendor you are using. Building yourself may be a viable option if you use one or two smaller features. If you’re using the entire suite, then you have more work to do.
Evaluate the price, both short-term and projected cost, over several years. Remember that, in general, SaaS increases its price by about 10% per year.

Something you need to be acutely aware of is how much engineers cost. Again, you need to factor in some finger-in-the-air math for pay increases and promotions to make it an even comparison. I’ve seen too many projects in the past that would save 100k, only for it to require 400k+ of engineering resources to build it to get approved because of “cost savings.”

Security Posture

I believe that SaaS is good for security overall (but admittedly, I’m biased here). One of the benefits of SaaS is that there is a security team that is fully dedicated to securing and alerting on malicious behavior in that one product. After all, most companies can’t hire 100+ people to secure each SaaS vendor.

On the flip side, however, every vendor that you use increases your risk surface area and ultimately increases the likelihood you’ll be involved in a serious data breach through one of your downstream vendors. The best approach here is good data lifecycle management and good IT practices to ensure you aren’t duplicating data elsewhere, but you may find yourself in a position where you have extremely sensitive customer data that you really want to lock down.

I can give you an actionable example from my past: the tool Ngrok. Ngrok allows you to expose local services to the public internet, which is fantastic for development but terrible for security. They’ve improved security significantly in recent years, but this used to be a solo operation with basically no security features at all. If an attacker got a hold of the hosted URL, they could access developer services and potentially workstations. This caused me and many other security practitioners to be overwhelmed by bug bounty reports that were costing us time, money, and effort to resolve.

We decided to build our own version of this, with SAML auth for users, token authentication for service accounts, full logging and long randomized URLs for every instance. We had the additional benefit of hosting ourselves, which was also cheaper, so it was a no-brainer at the time.

Competitors

It’s a simple one. You want to avoid giving competitors money or any profound insight into your company. Storing your data with a competitor and paying them money to do it goes against basic company strategy.

In the past, I worked at Atlassian, which competed with Microsoft in areas such as docs (Sharepoint/Confluence). Of course, we could only partially avoid Microsoft. You’d struggle to completely avoid Windows, Azure, Office, and everything else, but we did try to keep it to a minimum.

I have heard of cases where companies would instead build an internal service to avoid giving money to a competitor, but these are rarer cases, and I don’t have any real-world examples to provide from my past.

Avoiding Vendor Lock-In

Contracts these days tend to be long, with significant discounts given for 3-5 year contracts. This is good from a pricing perspective, but a lot can happen in three years. It’s not uncommon to sign a contract and then lose all of your leverage for the future, with prices increasing and the features you desperately need to sit in the to-do pile.

A strategy I often see people use is to build a layer on top of the vendor, acting as the controller. This lets you consume data from the vendor in question but have your own front end and logic that theoretically could be switched out with another.

At Shopify, we built a custom device posture tool that took data from our various MDMs and EDR tools. If we changed these tools, we could quickly develop a new module in a couple of weeks and rip out the old vendor with very little work. This was a massive boon regarding flexibility at the cost of a little extra work. This acted as a hub and spoke model, where our internal tool was the hub and connected to many vendor spokes. A similar approach is often taken in the world of cloud dev platforms, with people building custom internal dev tools that reach out to many cloud vendors.

This can be a good approach, and it worked for us for a good chunk of time, but eventually, vendors can catch up and become better hubs. In our case, ZTNA tooling caught up, and we started to evaluate the market to replace our internal tool and then started to build additional spokes instead, which let us secure our most sensitive services with additional controls.

Scale

When I talk to my huge enterprise and FAANG friends, they often build more than I do. Google is notorious for building many things internally and relying on vendors less than others. The reason for this tends to be scale because the needs of a 200k-employee company with people in every corner of the world differ from those of a 100-person company in a single office.

Many companies I’ve worked at have had scale issues with these larger customers. A classic example from my past is that Confluence had a hard 10k person limit for many years in our early cloud deployments, and performance significantly dropped going beyond it. Building for scale is complex, and even having multi-region, multi-az, multi-sharded setups can cause problems with your own downstream vendors.

Additionally, these companies have different expectations. If I were to slow a build by 1 second at Canva to give us significant improvements elsewhere, I’d probably not get too much pushback. If I were to suggest this at Google, my pull request would never make it to prod because 1 second is a huge number for them.

In these cases, you may decide to build instead to keep things internal. This gives you more control over your work and allows you to make these decisions rather than having a vendor determine it for you. This may go against the rest of the advice in this blog, as it’ll often cost more and may even be a commodity, but if the commodity doesn’t work at your scale, then it’s justified unless you can work with that vendor to remediate their issues quickly.

Retrospective#

Exceptions#

Cost#

Security Posture#

Competitors#

Avoiding Vendor Lock-In#

Scale#