Threat Modelling Enterprise AI Search

Enterprise AI search tools are a simple concept. They take in all of the data from all of your productivity tools and give you a single pane of glass to search across your company’s entire corpus. This lets you search across all of the documents, email and chat regardless of what tool they are stored in. The productivity benefits are apparent here because who hasn’t spent half an hour searching for that doc you made three years ago? Or even better spent searching through your company policies to work out if you can install video games on your work laptop (speaking for a friend here).

Several tools exist in this category, and the market is rapidly shifting. Glean, Atlassian Rovo, and Guru are the more prominent ones, but there’s plenty available in this space, both big and small. Right now, there is very much a race for the enterprise search market because of how dispersed information is within modern companies.

These tools, though, act like scarecrows to ward off security engineers. There’s nothing more terrifying to us than giving a single tool access to every data store in the entire company. Doing this throws the idea of segmentation and risk surface area reduction out the window. If one of these tools were compromised, it could give a malicious actor access to every single data store in your company in an absolute worst-case scenario.

Having looked at a few tools like this over the past few years with the rise of AI, I decided to write a detailed guide into threat modeling these tools, diving into topics such as:

How to make risk-based decisions when evaluating enterprise search.
Actions you can take to recede risk to a tolerable level.
Hidden caveats you might have yet to catch when doing an initial evaluation.

Productivity vs Security

First and foremost, I recommend genuinely considering the benefits of tools like this for yourself and your organization. No doubt, when you look at these tools, that inner security voice will whisper in your head: “This is bad”. You’ll have to wrestle with that demon and think about this in terms of business risk and business productivity. At the end of the day, security is there to reduce risk to a tolerable level, and your focus should be on communicating the risk, the tradeoffs, and the controls.

Decision One - Risk Analysis

The first decision you’ll need to make is whether to proceed with procuring one of these tools. As mentioned above, you must ensure these tools are viable in your organization and won’t get you into hot water with your customers or go against your compliance requirements.

Particular areas of focus would be:

Are these tools up to your compliance standards if you adhere to SOC 2, HIPAA, FedRAMP, or any advanced customer requirements like data residency?
What sort of operational cost will this have in the long term?
What people and control investments need to be completed before going live?
Are there certain tools you can’t connect, which could potentially limit it’s effectiveness due to the above reasons? The reasons are further explained in future decisions in this blog.
Of course, you should also look at the viability of the product and if the investment is worth the productivity outcomes as you would with any other vendor.

The main deciding factor in this decision will be whether you are OK with centralizing your data into this single vendor. Before you make this decision, you should list all of the applications you plan to integrate and check the level of confidentiality of the data within each of these applications, which we’ll talk about in Decision Three below.

Decision Two - Cloud or On-Prem

Some of these vendors allow you to deploy an “on-prem” version, which is a self-hosted version of their tool in your cloud infrastructure. This comes with the benefits that you can self-maintain if required and limit access to company employees. A compromise of their staff should have little downstream impact on you, but there are strong considerations to make here and self-hosted can be less segmented than you think.

Operational Cost - Hosting your own version of anything involves significant operational costs. There’s a reason we mostly use SaaS these days rather than doing everything ourselves, after all. Setting up this infrastructure, updating it, maintaining it, and more. You can get the vendor to help you set it up and maintain it, but you’ll have to grant them access only on demand, which only reduces risk somewhat since they won’t have long-lived standing access.

SaaS Connectors - If you use Cloud, at least some people at the company will likely have access to your API keys unless there is some form of BYOK capabilities. This being said, with on-prem, sometimes these tools still need to contact their cloud infrastructure. Glean has an “scio-proxy” service, which is required with certain vendors like Atlassian, so these API keys would be stored on Glean’s infrastructure and not your on-prem service, limiting its effectiveness if you want to connect Atlassian tools.

Long-Term Designs - Many vendors do not like to support on-prem in the long term, preferring to solidify their cloud setup to the point where customers use that. SaaS is very much the primary goal of most of these vendors due to recurring revenue, and there’s too much negative press like “X vendor hacked!” when it’s actually just an exposed on-prem customer instance they didn’t update for the past ten years.

There is no right option here; on-prem can reduce your risk surface area to the vendor, but you need to look deep to see if it’s really going to reduce risk in the right areas for you. SaaS is likely the best bet in the long term, but some of these products may not hit your SaaS requirements yet as it’s a new space.

You could also use some of the components of these vendors or generic AI tools like Claude to build a tool like this yourself. Still, you’ll likely need to invest significant resources regarding access control limitations and creating new connectors if you have a large SaaS sprawl.

Decision Three - Connected SaaS Applications

You might have customer data in Jira tickets, legal contracts in DocuSign, or people’s DMs from Slack. If you plan to integrate these vendors into the search tool, you need to feel confident in the vendors’ ability to store this level of confidential data.

The best plan of action for you is to look at:

What internal applications do you have that can be integrated?
What classification or risk is there for the data stored within those applications?
What is the cost-benefit tradeoff in connecting these? Are people going to search that dataset significantly?

From here, you can decide on a list of applications you want to connect and ones where you feel the risk is too high.

These tools generally maintain access control (aka, you cannot access documents you didn’t already have access to). Still, a significant access control bug or internal misconfiguration could cause potential issues here, if unlikely.

Control Levers

The three decisions above are what I would call the surface-level risk analysis. Ideally, this is done before you buy the tool, but of course, it’s not unheard of for people to buy first and ask questions later.

Once you’ve done that, you’ll want to conduct a more detailed threat modeling session and delve deeper into your particular setup. These tools tend to have few controls you can enable, but there are some levers you can tinker with to get to your desired level of security.

Access: You can access these tools through an application, browser extension, mobile app, Slack bot, or a custom portal you’ve built. Of course, you can also limit these as a form of control.

Integrations: As mentioned above, your primary control is what data you connect. If you have an ultra-sensitive data store, simply don’t connect it.

SaaS, On-Prem, or Self-Build: Again, you can control the deployment mechanism. You can always pivot through the project, but this will come at a cost.

On-Prem Customisation: With an on-prem model, you can control the vendor’s access level. Potentially locking down security groups and firewall rules in your cloud infrastructure and locking out vendor accounts. It is worth verifying how this works with the vendor in advance, as they all have different setup requirements. Do not assume there is no access, usually it’s timeboxed, limited access.

API Keys: You control the service accounts for the integrations, and these tools likely have a set of dedicated IPs. Using these service accounts, you can write detections looking for malicious activity in the absence of preventative controls.

Threat Modeling

Below I’ve included two sections, one is a selection of what I perceive to be the top risks when it comes to deploying these systems. This isn’t inclusive, and of course is going to depend heavily on the tool, architecture etc. I’m including this as a basis of top things you should consider, some of which are edge cases you could easily overlook.

Top Risks

Risk	Threat	Mitigations
Zero Trust Bypass (Personal Device Usage)	In zero trust designs, applications are often tiered into categories depending on risk and controls are applied to the application when authenticating via SAML. Enterprise search tools crawl data from many sources, thus potentially giving a bypass to your controls and allowing employees to search sensitive data stores from an unmanaged device with a misconfiguration. A note to make is that these tools often provide browser extensions, mobile apps and Slack/Teams connectors for chat. These should all be considered as they can open up more risks here.	Preventative: These tools need to be subject to the same rules as the most confidential application you connect to them Build a strategy around how people interact with these tools.
Supply Chain Compromise	A compromise of an employee or system at the vendor could potentially give access to search data, or in a worst-case scenario, stored API keys for connectors.	Preventative: Use an on-prem architecture with locked-down ingress/egress controls. Detective: Look at building detection rules that look for API key usage from alternate locations.
Over Permissive Access	Enterprise search tools generally respect the permissions of the target application. The caveat here is that these tools make data inherently more discoverable. If you have private documents in Google Drive open to the entire company, the risk of someone discovering these becomes more pronounced.	Preventative: Complete a DLP audit on connected applications before going live. Consider implementing a blocklist of protected search terms in the enterprise search tool. Detective: Have a response playbook to remove access to data when it is discovered and reported.
Privilege Escalation	One of the worst possible scenarios for tools like this is allowing people to access data they shouldn’t have. Generally, these tools build their own permission models, and in my personal testing, they seem to hold up, but of course, bugs can always occur.	Preventative: Cursory testing of new integrations permission models prior to connection. Ensure you use minimally scoped connectors, and follow best practice guidance for each integration.
Synchronization Delays	These tools pull data from many different sources. If you remove access to a file, you want to ensure access is also removed promptly in the enterprise search tool.	Preventative: Testing time delay prior to connecting integrations.
Privacy Assessment	Connectors for these tools pull in all data from the target. Depending on what the target application is, this could contain customer data, usernames, addresses, personal messages and more. Even connecting Slack to these tools can expose private channels and DMs to the search tool, which can be incredibly sensitive. While this data is not usually directly accessible, even to the tool’s administrators, some information will, of course, be stored in the database for search queries.	Preventative: Conduct a full privacy assessment in advance. Select what data you are willing to connect and whether anything requires consent. These tools often have the ability to block certain types of data from being ingested or, alternatively, have an opt-in allowlist for things like personal messages and email.
Session Token Theft	This is an age-old classic. This is an elevated risk with enterprise search tools as an attacker doesn’t need to get into every tool but can query all of them from just one if a personal API token for the enterprise search tool is leaked.	Preventative: Implement sane session lengths based on the risk Detective: Build controls looking for abnormal behavior Build playbooks to remove a session token in an emergency.
AI Training	The data searched and stored by these tools is generally more sensitive than the stuff you would put into something like ChatGPT. The risk is that the enterprise search tool either trains its internal models based on this data or potentially sends the data to a downstream vendor that does.	Preventative: Build robust controls into your contract at the time of purchase to avoid this.

Diagram

Below is a very basic design of a cloud-hosted enterprise search model that shows you the general flow of data in a typical search query. This of course, depends heavily on the tool and is oversimplified for ease of understanding.

I haven’t included a self-built or on-prem design, but these would be more complex. You have much more in the way of potential controls since you can follow normal on-prem design with network segmentation, security groups, IP controls, and infrastructure hardening, which can reduce the risk of supply chain compromise and external access if done correctly.

Threat model diagram showing workflow of how enterprise ai search tools work

Productivity vs Security#

Decision One - Risk Analysis#

Decision Two - Cloud or On-Prem#

Decision Three - Connected SaaS Applications#

Control Levers#

Threat Modeling#

Top Risks#

Diagram#