Background
Update: This blog was posted back in 2021 on my previous blog and was based on a 2020 talk I did. In migrating to a new platform I’ve gone through and applied a few updates for 2022 mostly focusing on new features available on the market and upcoming changes such as WebAuthN improvements with passwordless.
I spent the last few years building out a Zero Trust architecture as the Head of Corporate Security in Atlassian and I figured it’s time to write a blog going into some of the design decisions we made and how we implemented the changes at enterprise scale. Back in 2017 when we started there were sparse details around what Zero Trust was and what it really meant beyond the initial whitepaper from Google and a few people trying out new ideas.
- Part one of this blog will go into the building of the system and the decision-making process.
- Part two will detail the current state of the design and future improvements.
The Starting Point
To set the stage back in 2017 Atlassian was primarily a server company, we used SaaS products but we also had a huge on-prem footprint in a number of data centres located around the world. Most of our customers were on-prem and essentially dealt with security themselves, it was up to them to secure their instances appropriately. That being said we had a number of cloud products which were starting to get customers and the company’s focus was starting to shift to a cloud-first model.
- Company culture was very open, everyone had admin and everyone was considered a developer (Pretty normal in developer-heavy environments).
- Heavy MacOS presence but also some other desktop platforms.
- We used an IDP, but many applications were not set up with SAML.
- Employees used a VPN to access on-prem and IP-restricted SaaS services.
- Some services were not able to be IP restricted and as such these were just on the internet with basic auth.
- Multiple auth methods were present (Log In With Google, SAML, Basic Auth etc)
- 2FA was present and had good coverage but it allowed less-secure methods such as SMS & Call.
- Bring Your Own Device was common, with nothing to stop you from accessing resources from a personal mobile.
- Endpoint telemetry was limited, but network-based telemetry was serviceable.
- We had MDM and set controls on day one but allowed users to override them and posture dropped off over time.
Problem Statement
We had read the early details on Zero Trust and we’re intrigued, boiling it down it really means that you can identify:
X User on Y Device accessing Z Service
Combine this with regular security posture checks and we decided that this model was much better than trying to rely on MDMs and user awareness to do the work. We got the IT, CorpSec and Networking Engineering teams together and looked at building out a solution internally that would work for us. We looked at available tooling and at the time there was basically nothing so we used the BeyondCorp whitepapers and general designs to drive our initial planning.
Identity
Our first focus was on identity as it was the starting point for everything. We identified three things we wanted to achieve:
- Strong authentication everywhere.
- All Apps In SSO
- Tiered Applications
Strong Auth Everywhere
At the time this meant Yubikeys but these days it’s much simpler given the prevalence of native security methods like TouchID and the upcoming additions like Apple Passkeys.
We essentially rolled out Yubikeys to every employee, sending them to each office and doing mass rollouts. We used data from our 2FA tool to look at the numbers and set hard cutoff dates depending on the usage. Call was barely used by anyone so it was easy to disable quickly, SMS was still used in some scenarios like people travelling that Yubikeys could fairly easily replace. This project wasn’t overly complex and just required a steady amount of communications to end users to pick up their keys and enroll them prior to the hard switchover date.
In the end, we had stronger authentication on every service in our SSO platform but did continue to use non-phish resistant Push-based 2FA for many services; It’s hard to disable if you have strong mobile usage since the Yubikey experience with mobile is limited at best. It’s likely upcoming WebAuthN changes will fix this however with Passkeys and other passwordless mechanisms becoming more prevalent.
If you want to deploy Yubikeys yourself in 2022 I’d recommend going with Yubicos Enterprise Delivery which allows you to bulk purchase items and send them to remote employees or simply sending one along with the laptop if you use a supplier such as CDW.
All Apps In SSO
This was an easy change, albeit one with a long lead time. We built a set of minimum security requirements and listed SAML one of them. This applied to all SaaS applications coming into our procurement process and this was a hard blocker. If a tool didn’t support SAML then it didn’t get approved.
We re-evaluated vendors every year or so and naturally all apps that didn’t support SAML either built support for us or we stopped using them. We did this in a surprisingly manual way, asking end-users to raise tickets for the integration but as we developed the system further we were looking at a system to fire off tickets to IT automatically for SAML integration every time an application gets approved and purchased by procurement.
I can highly recommend working closely with your Procurement team to implement changes like this and communicating it to end-users. Some services sell SAML as an extra cost and there’s nothing worse than getting budget approval for a tool only to find out it costs more at the last moment.
Tiered Applications
We realized early on that because we had a heavy mobile and BYO presence we couldn’t disable it entirely, a solution we used here was to have our security team tier each application as it came through the procurement process. What this meant was categorizing each one by a set of criteria. We decided simplicity is key; some people have complex user security scores here but we never wanted a user to not know why they couldn’t access a service.
Tier | Description |
---|---|
Open Tier | Completely open, accessible from any device whether it’s managed or not. Very limited apps were in this tier and it primarily included things like video conferencing apps and training tools. |
Low Tier | These apps would be accessible by BYO devices and Atlassian-issued devices. This would include collaboration tools like Slack, GSuite and Confluence. |
High Tier | Sensitive internal and customer data. Imagine things like Splunk, AWS, Data Lakes and others being in this category. |
This process started off manual, became automated and then we saw a number of issues that made it become only semi-automated. The biggest challenge was when employees want to use low/no code automation tools like Mulesoft and Workato to connect apps together it makes the tiering a bit of a grey area. We implemented guidance that data should only flow down, but could flow up in very limited circumstances. Automated checks didn’t really take this into consideration and we asked users to identify what information would be flowing through the tool but did some manual checks to verify during the procurement pathway.
Devices
Devices was probably our biggest challenge and it had a few unique areas we wanted to tackle:
- Cleanup the device inventory
- Deal with Corporate Devices
- Deal with BYO devices
- Increase security posture
Device Inventory
Maybe unsurprisingly we used Jira as our device inventory, we used Atlassian products to do pretty much all of our ITSM operations (A great talk here about that from the amazing Ryan and Warren).
We fell into the trap of having a device inventory but only updating it occasionally. Zero Trust forces you to fix your bad workflows and of course in order to verify a valid device we needed to assign them a user in Jira. This was a large piece of manual work from our IT team, making sure that all devices were accounted for, assigned to the correct user and had the correct status. Once it was updated however it was relatively easy and automated in terms of maintenance.
I don’t think I can stress the requirement of having an updated asset inventory here. I’ve seen others try to use MDM as their source of truth and it usually fails for a number of reasons in the long run including lost/stolen devices, not being able to accurately assign users and problems with adding new operating systems over time.
Deploy Corporate Devices
Most of our userbase had a device issued by us, but there were a few exceptions that didn’t. One example of this was our partner sites like a company called Spartez (which we soon acquired). These folks did a lot of development and customer support type work and functionally they were essentially a part of Atlassian. Until now we had a site-to-site VPN and did various audits but going forward we wanted to simplify things, even if that came with an upfront dollar cost.
We deployed Macbooks to every Spartez employee, they had two machines, one they could use for Atlassian work and one they could use for other Spartez work. We took the same thing with contractors as without a corporate device you could not access sensitive systems like AWS or Splunk which is quite limiting for developer type contractors. Once we acquired Spartez we cycled out their old devices and they were back on one single Atlassian-issued device anyway.
Once every user had a device we made sure that all of them were enrolled into our mobile device management tools like Jamf ensuring that any future changes we make would be applied to those devices and gave us a good starting point to improve device posture.
Bring Your Own Device
BYO is the more interesting space here and there’s a lot of tradeoffs you may have to make to enable usability in your org. In our case we took an employee privacy-centric approach but built it in a way where we could at least get things like DNS logs to investigate incidents. Some places with less freedom of culture may decide to ban BYO entirely or may decide to do something such as deploy company-issued mobiles to on-call staff.
We knew early on that everyone has a different opinion on this matter, you get people at all sides of the spectrum but in general we wanted to give as much freedom for employees as possible.
We tinkered with a few approaches and used some of the network telemetry and a peer comparison to drive our design but ultimately we decided to give a stipend if you enroll in our corporate MDM. This was around the cost of a low data mobile plan in each country and gave people the choice to either buy a cheap new phone to use exclusively for work or enroll their current device and pocket the extra money as data expenses. This was fully automated and if you had a valid device which met our security posture at the end of the month then a hook would be fired off to include a line item in their paycheck, didn’t even require anyone to file expense reports.
If you are planning to take a similar approach I think it’s important to consider the potential side effects of your decision. You don’t want to be in a scenario where a personal device is used to access corporate resources and not be able to investigate because the employee won’t comply with any litigation orders. We made sure that we only gave access to collaboration tools where we already had eDiscovery and data storage in place like Slack, Email and others where we knew we would not need to ever image the physical device.
Certificates
Certificates and PKI infrastructure is something that is complex and could probably have a blog all to itself so I’ll keep this section short. We spun up a PKI infrastructure and connected it with our MDM tools, what this meant is that we would deploy two types of certificates to devices depending on the group they were in.
If an Atlassian issued device was in our device store, assigned to a user, was present in our MDM tool and met our security checks then they would get a high-tier certificate to allow them to access high-tier applications.
If the device was a BYO mobile enrolled in our BYO MDM and met our security posture checks then they would get a low-tier certificate to access collaboration tools.
These certificates would be deployed to the secure storage of the device such as the TPM and slowly over time we migrated to platforms which only had secure storage phasing out hardware that didn’t. We could then check for the presence of these certificates in various locations, primarily in our IDP and various proxies to determine if the user had a valid device and combine that with strong authentication.
Certificates are one of the biggest gotchas in Zero Trust, there are lots of right and wrong ways to manage and deploy them. We’ve tinkered with a few options and in the end, we settled on deploying long-lived certificates with a suspension mechanism, we would suspend the certs if the devices ever went against our security posture such as if a user disabled encryption or if malware was detected. There are many ways you can fuck up cert management, I’ve seen people deploy certificates to user keychains where they can be easily exported, I’ve seen examples where people just used MDM to delete certificates instead of revoking them and I’ve seen more than a few insecure PKI infrastructures in my time. If you have to double down on security in one area it’s here, make sure you absolutely spend your time on this in the design phase.
User Focused Security
Once we had the device and identity elements in place we wanted to start running checks prior to accessing services. You can and should run a check every time a certificate is deployed but security posture is going to degrade over time and part of this is making users keep up with patching and required controls.
We build a tool called “Posture” which was functionally similar to Netflix Stethoscope taking a play from Jesse Kriss and his User Focused Security playbook. Albeit we did much more hard enforcement on controls than Netflix as we had stronger demands to meet from both customer requests and compliance requirements such as SOC2, HIPAA and eventually FedRAMP. This tool was essentially a web front end for employees to see their posture and a backend which stored data from various MDMs and our device inventory. It pulled those data points on a regular cadence and the employee could force a manual sync to save them from having to wait for an automated one to happen. MDMs often have fairly strict API limits and will throttle or completely block you if you call them too often. Having this data go to a centralised store was incredibly important for us to negate this problem although if you’re using on-prem tools you could probably scale them up to real time if you needed to.
Once we had posture in place we started to pump up the settings on the various platforms aligning to the CIS benchmarks as much as we could. This meant checking device versions, browser versions, encryption status, rooted status and others. We had significantly stronger checks on Atlassian-issued devices and included things like SIP, EDR tool status and OSQuery status and more.
Really here you can include anything you like, in some Zero Trust designs I’ve seen people build in things like yearly security training, more stringent software security checks like connections into Google Santa and various kinds of hardware security like checking the device is valid from Apple and hasn’t been tampered with a regular certificate check. One of the big benefits to building your own systems rather than using tools like Google BeyondCorp Enterprise (their GCP SaaS product) is that you can create whatever checks you want rather than relying on the limited ones provided by a tool and their partnerships.
We built out a device security policy that was viewable by all employees which stated what was required for Zero Trust and what our general recommendations were. Not every security check was built into posture however, certain controls were excluded due to their nature. One example of which was the MacOS firewall. Enabling it with full block mode can cause some issues such as blocking Airdrop. A workaround we developed was to allow users to override settings for this and we simply re-enabled it every day with the MDM limiting exposure.
Telemetry
Now we couldn’t have done any of the work above without telemetry. We started off in a good situation since we had network logs from our VPN which we could edit to collect data on what devices were connecting, user agent strings, operating systems and such which gave us a good indicator of how many devices we had and what types they were. We used this data and set an OKR of 100% devices connecting to the system were managed or trusted (MDM enrolled or BYO respectively), over time this OKR grew to about 90% at which point we pulled the trigger on our hard block enforcements. We threw in exception rules for the 10% and dealt with them over time, either deploying devices or dealing with use cases one by one after the fact as these exceptions often take the bulk of the time and enabling early meant at least we could reduce the risk on the majority of the fleet.
We rolled out software such as osquery to all of our workstations and stored the resulting logs in our SIEM. We also went through a process to identify high-value systems and get logs into our SIEM for those as well, the list of which included MDM for device logs, WinLogBeat for Windows system event logs and various SaaS tools related to Finance, HR and others. This gave the ability for our detections team to write their own detections but also gave us historical data to inform our decisions during the project. A great example of this was when we looked into the firewall problem mentioned above where we were able to see how often users were turning it off and what they were turning it off for. This allowed us to tinker with settings to give people the best user experience possible while maintaining a decent security posture in a remote work environment.
I don’t think I can stress how critical this data was in informing our choices. Every change we made was backed with data behind it, without that we would have been totally blind to many of the problems if we were operating with point in time data from MDM reports.
In 2022 I can highly recommend using your NGAV/EDR tool of choice, Crowdstrike for example comes with an add-on called https://github.com/CrowdStrike/FDR which gives you the raw events from your devices which saves running yet another agent that uses resources. Using this or something like osquerycombined with a SIEM or data lake will give you the data you need to investigate issues and make informed decisions. In 2022 if you are looking at osquery i’d highly recommend FleetDM to manage your deployments.
Part Two
Part one focused on the journey and framed the decision-making process for the various challenges we encountered along the way. Part two is going to focus more on the technical implementation and how the overall design works.