04: The Vendor Concentration Risk No One Models

Written by Sarah Clarke | Jun 24, 2026 2:00:02 PM

*This is post 4 of 6 in Avertium's "The Trust Problem in Enterprise Security" blog series

By Sarah Clarke, Consultant - QSA and AI Architect

On October 20, 2025, AWS US-EAST-1 went down for roughly fifteen hours. The root cause was a DNS resolution failure in the DynamoDB service endpoint, which cascaded into IAM, EC2, Network Load Balancer, and dozens of other services. Netflix, Snapchat, and thousands of e-commerce sites went dark. Downdetector logged over seventeen million outage reports.

This wasn't only an availability event. For every organization whose authentication, data storage, fraud detection, and security telemetry all happen to run on the same hyperscaler, the outage exposed something a procurement matrix doesn't usually capture: The same vendor was supporting a dozen distinct security and compliance dependencies, and when that vendor went down, all of them went with it.

From the assessor side of the table, this is the concentration question I find security and compliance teams least prepared to answer. They can usually tell me how many third-party service providers they use, but they struggle when I ask which of those providers actually carries the weight of the scope.

what concentration actually means in compliance terms

Most vendor risk programs treat concentration as a procurement metric: How much of our spend goes to one vendor, how many alternatives exist in the market, how locked in we are to a particular technology stack. Those are real questions, just not the security ones.

The security and compliance question is different: For a single critical vendor, how many distinct trust-bearing roles do they hold in your environment? When your cloud provider runs your application hosting, your database, your message queue, your CI/CD, your container registry, your identity broker for service-to-service auth, your secrets manager, and your monitoring telemetry — that's one vendor holding eight roles. Each role represents an attack surface, a failure mode, and often a separate compliance dependency.

When that vendor has an incident, you experience eight failures simultaneously, across systems your scope diagram treats as separate.

the compliance angle almost nobody audits cleanly

PCI DSS 4.0 has two requirements that directly govern third-party service providers (TPSP), and most organizations interpret them more narrowly than the standard intends.

Requirement 12.8 says you must:

Maintain a list of TPSPs
Document the services each one provides
Monitor their compliance status
Have written agreements acknowledging their responsibilities

Requirement 12.9 places matching obligations on the TPSP side: they must acknowledge in writing their responsibility for cardholder data they store, process, or transmit.

Reading 12.8 strictly is where the gap shows up: It requires documenting which services each TPSP provides. A hyperscaler may be your IaaS provider, your IdP host, your DDoS protection, your DLP backend, your SIEM platform, and your AI service vendor; but the 12.8 list usually shows them once, even though the actual concentration is much higher.

HIPAA's analog is business associate (BA) concentration. When your business associate agreement (BAA) covers a hyperscaler that hosts your electronic health record (her), your patient communication app, your billing system, and your AI summarization service, you have a single BA holding four distinct ePHI-touching roles. That consolidation amplifies the compliance obligation: A single incident triggers multiple breach assessments simultaneously.

the ai vendor concentration layer

The AI angle to this is recent enough that most concentration analyses don't include it yet. It also compounds the AI agent scope problem covered in the previous post: Each agent pulls scope on its own, and the model vendors running them are clustering on the same handful of hyperscalers.

In most enterprises, AI capabilities flow through two or three model vendors. The diversity looks real on a procurement chart, but it often disappears at the infrastructure layer. OpenAI runs on Azure, Anthropic on AWS and GCP, Google's models on GCP. When an organization signs contracts with two model vendors while routing the inference through the same hyperscaler, the diversity exists only on the procurement chart.

There's a related issue at the model-vendor layer itself. Most AI applications inside an enterprise call out to one or two model endpoints. A model vendor outage, a policy change, or a price shock cascades into every product that depends on it. The blast radius scales with how broadly the model is adopted internally, and few organizations track this the way they would track an outage on a critical SaaS application.

For an AI architect designing for resilience, the practical question is whether the abstraction layer between your applications and your models is real or theatrical. If swapping vendors requires you to rewrite prompts, retrain evaluators, and reconfigure tool definitions across every product, the abstraction wasn't real; it was a single-vendor dependency with a fallback that fails under pressure.

how to quantify this for a board

Board-level conversations about vendor risk tend to live in two registers, financial exposure and contractual diversification, and neither captures what I'm describing. A useful third register is concentration of scope.

A simple version of the metric is for each critical service in your environment, list the vendor that provides it. For each vendor, count the distinct critical services they support. The vendor with the highest count is your concentration risk; the number itself is your headline figure.

A more rigorous version adds two dimensions. First, weigh each service by the scope it sits in (CDE, ePHI environment, sensitive-data zones). A vendor running ten dev-tooling services is a different risk profile than a vendor running five CDE-adjacent services. Second, model blast radius against if the vendor fails for a defined duration, how many of your compliance obligations are temporarily unmet, and what's your time-to-recover for each?

The output is a single artifact you can take into a board meeting: What percentage of regulated data flows depend on a single vendor, how much of your scope depends on one vendor's continued operation, and which three vendors have the highest cross-scope concentration. That's a conversation a board can act on.

what compensating architecture and contracts look like

The architecture and contractual answers run in parallel, because the problem has both dimensions.

A few patterns worth considering:

Recognize when "multi-cloud" is theater. A multi-cloud strategy that runs primary workloads on one provider and "disaster recovery" on a second provider, with no live traffic, is concentration with extra accounting. Real diversification means cross-vendor failover that's exercised regularly enough to prove it works.
Identify your irreducible concentration points. Some categories of dependency are genuinely hard to diversify such as your identity provider, your primary cloud, your AI model vendor. For those, the answer is harder-edged contracts and tested incident plans rather than more vendors.
Negotiate scope-specific responsibility into the contract. PCI DSS 12.9 says the TPSP must acknowledge their responsibility in writing. The contract should push that further by specifying which scopes the vendor's services sit in, what their notification obligations are by scope, and what they'll provide for your assessor in the event of an incident.
Build a real model abstraction for AI. Separate the application layer from the model vendor layer with a translation interface, an eval suite that runs against multiple models, and prompt templates that don't bake in vendor-specific behavior. The goal is to be able to switch a model under pressure without rewriting the application.
Map dependency overlap, then design specific contingencies. For each top concentration vendor, write down what the first hour, first day, and first week of their outage would look like for your compliance posture. Most organizations have never written this down for the obvious top vendor, let alone the second or third.

an exercise for you: What to do this week

A vendor concentration exercise. No tooling required.

1. Build a matrix: Down one axis, list the critical services in your environment that:

Touch regulated data
Hold privileged trust, or
Could affect compliance scope

2. Across the top, list every vendor that supports any of those services

3. Fill in the cells where a vendor provides a service

Now look at the columns. The vendor with the most filled-in cells is your concentration link.

4. Beside their column, write three things:

How many of your scopes (PCI, HIPAA, SOC 2) they sit in
What your contractual recourse looks like during an incident
How long you've gone since last exercising a real failover

For most organizations, the headline number is uncomfortable, the recourse documentation is sparse, and the failover hasn't been tested in years. That's the artifact a board should be seeing once a quarter, and the artifact an assessor will eventually ask about under PCI DSS 4.0's tightened third-party provisions.

Building it now is much cheaper than building it during the next outage.

Stay tuned for our fifth post in the series, ”Where the Frameworks Fall Short,” which will publish on July 1, 2026 at 10:00 am EST.

View full post