Blog: Dark Data, Dark Risks: What You Don’t See Can Hurt You

Ahmad Fida August 21, 2025

Dark Data, Dark Risks: What You Don’t See Can Hurt You

Most enterprises are now driven by content that isn’t sitting neatly in databases. IDC reports that ~90% of enterprise data is unstructured (docs, emails, images, transcripts, etc.). That same analysis notes a lot of this unstructured data is “dark”—organizations don’t know what they have, why they have it, or how long to keep it.

At the same time, only about a third of the data available to enterprises is actually put to work. An IDC study for Seagate showed that while firms capture only ~56% of potentially available data, they use just ~57% of what they capture—net: ~32% used, ~68% unleveraged.

Why this matters in 2025: attackers increasingly pair encryption with extortion/data theft. Verizon’s latest DBIR shows ransomware + other extortion now account for ~32% of breaches and rank among the top threats in 92% of industries—meaning unseen, poorly governed data is exactly what adversaries look to steal or threaten with.

What Makes “Dark Data” So Risky?

Unknown sensitivity: If you haven’t discovered/classified it, you can’t prove control over PII, financials, or IP. IDC explicitly highlights that unstructured repositories often contain redundant/obsolete content and dark data that organizations lack visibility into.
AI amplifies the problem (or fixes it—if governed): IDC notes that RAG can reduce hallucinations and improve accuracy, but only if the underlying unstructured content is high-quality and governed. Poor quality = poor AI outcomes.
Growth outpaces control: IDC projects unstructured data will grow ~30% CAGR over the next five years, which widens the governance gap if you don’t act.

A Practical Playbook

Discover & classify continuously
Treat unstructured stores as first-class citizens. Crawl shares, mailboxes, wikis, collaboration tools, and object storage; classify sensitivity and ownership. IDC’s guidance is clear: AI success depends on unstructured data quality—accuracy, completeness, recency, and context.
Correlate identity, data, and network signals
When someone touches old, sensitive archives for the first time, you want the who (identity behavior), the what (data movement), and the where (lateral paths/exfil channels) in one incident view. This is how you catch quiet staging before it becomes extortion—the pattern DBIR highlights.
Harden AI use with governed content
If you’re rolling out copilots/LLMs, anchor them on curated, high-quality unstructured data (RAG), and guard against sensitive information disclosure—a risk called out in OWASP’s LLM Top 10.
Measure resilience, not just “blocks”
Track time to correlate identity+data+network into one case, time to contain the user/host/data path, and blast radius (datasets/identities affected). These metrics demonstrate control even when incidents occur.

Where LinkShadow Fits

Data Visibility (DSPM-style): Continuous discovery/classification to surface dark and sensitive data across on-prem, cloud, and SaaS.
Identity Analytics (ITDR): Spot unusual post-authentication behavior—e.g., a service account or user suddenly accessing an archive they’ve never touched.
Network Analytics (NDR): Detect staging and exfil paths (odd SMB/RDP bursts, suspicious egress) that often accompany data-theft-led extortion.
One correlated incident: Merge identity + data + network into a single storyline so you can isolate the right user/host/data path fast and limit impact—exactly what resilience demands.

Request Demo

Latest Blog

Dark Data, Dark Risks: What You Don’t See Can Hurt You