Stop feeding AI junk

AI & Automation25 Sep 2025

Kumar Goswami at Komprise proposes a systematic approach to unstructured data ingestion

It’s go-time for enterprise AI. A PagerDuty global survey of 1,000 IT and business executives found that 62% of companies using agentic AI expect a return of 171% on average. Yet getting to ROI is no easy task. Recent surveys show mixed results on efforts thus far, with getting strategies right and making data “AI-ready” topping the list of barriers.

Preparing data for AI to ensure high ROI is especially acute with unstructured data; this comprises at least 80% of all data and sprawls across enterprises in the form of documents, PDFs, images, videos, emails, chats, machine data and more. Much of it has been accumulating for years without classification or curation. Feeding this data wholesale into AI systems only guarantees higher processing costs, wasted resources, and lower accuracy.

The solution lies in building a systematic approach to unstructured data ingestion. Without it, AI will continue to consume junk and produce unreliable results and fail to provide the ROI business executives expect.

The hidden cost of poor data Ingestion

AI consumes compute and storage power every time it processes information. If the majority of ingested data is irrelevant, duplicate or outdated, the system wastes the same proportion of processing capacity. This inefficiency translates directly into higher costs, whether the system runs in the cloud or in a data centre.

Worse, bad data reduces accuracy. Poor-quality data not only adds noise but also leads to incorrect outputs that can erode trust in AI systems. The result is a double penalty: wasted money and poor performance. Enterprises must therefore treat data ingestion as a discipline in its own right, especially for unstructured data.

Many current ingestion methods are blunt instruments. They connect to a data source and pull in everything, or they rely on copy-and-sync pipelines that treat all data as equal. These methods may be convenient, but they lack the intelligence to separate useful information from irrelevant clutter. Such approaches create bloated AI pipelines that are expensive to maintain and impossible to fine-tune.

The systematic approach to ingesting unstructured data

To unlock real ROI from AI, enterprises must embrace a deliberate and structured method for preparing and ingesting unstructured data. This involves five key steps that weed out the irrelevant, outdated, duplicate and non-authoritative data to give high-quality unstructured data:

1. Classification: Understand what unstructured data exists and where it resides. This entails tools that scan metadata across the entire data estate, not just within silos. Classification brings visibility and segmentation to identify duplicate and orphaned data, sensitive data, or rarely-accessed data that can be archived or deleted. Look for ways to auto-classify data and build metadata indexes, as a manual approach becomes untenable when you have millions to billions of files and petabytes of data.

2. Curation: Once data is classified, the next step is to curate it. Not all data is equal. Some information may be outdated, irrelevant, or contradictory. Curating data means deliberately filtering for quality and relevance before ingestion. This ensures that only useful content is fed to AI systems, saving compute cycles and improving accuracy. This also ensures that RAG and LLM solutions can utilise their context windows on tokens for relevant data and not get cluttered up with irrelevant junk.

3. Tagging and metadata enrichment: Classification and curation become much more powerful when data is enriched with metadata. Adding context through tags using automation and content-scanning tools makes unstructured data searchable and verifiable. Custom metadata transforms raw files into usable assets that can be systematically routed to the right AI workflows.

4. Segmentation by use case: Generic ingestion pipelines often lump all data into a central bucket. A better approach is to segment data based on specific AI use cases. For instance, a customer support chatbot should ingest curated data relevant to policies, troubleshooting guides, and FAQs, while an HR assistant should focus on employment guidelines and internal communications. Tailoring ingestion to use cases not only improves accuracy but also makes it easier to monitor and refine each workflow.

5. Continuous monitoring and refinement: Data is never static. New documents, communications, and multimedia files are generated every day. A systematic approach requires ongoing monitoring to ensure that ingested data remains current and relevant. Continuous refinement helps prevent outdated or irrelevant information from creeping back into AI systems.

The new role of IT and data teams

This systematic approach changes the role of IT and data teams. Traditionally, storage teams focused on infrastructure: uptime, capacity, and performance. With AI, their responsibilities now extend into data stewardship. They must work with departments, data engineering, and departmental analytics and research teams to classify unstructured files, identify sensitive information, and provide departments with curated data services.

Systematic ingestion is not an incremental shift. It is a redefinition of IT’s value to the business. By curating unstructured data for AI, IT teams directly improve ROI, accuracy, and trust.

Another key aspect of systematic ingestion is designing AI workflows that are data-aware from the outset. Rather than defaulting to generic AI systems, enterprises should build specialised agents for distinct use cases. Each agent should be paired with carefully curated data that aligns with its purpose.

This granular design makes it easier to measure the effectiveness of each workflow, refine data inputs, and demonstrate ROI. When every agent has a clearly defined role and dataset, enterprises can pinpoint what works, what does not, and why.

From data chaos to data discipline

Enterprises that continue to ingest unstructured data indiscriminately will find themselves drowning in costs and disappointed by inaccurate results. AI cannot deliver value if it is trained and fed on junk. The shift from chaos to discipline requires a systematic approach to unstructured data ingestion that prioritises classification, curation, tagging, segmentation, and continuous monitoring.

The payoff is clear: lower costs, higher accuracy, and AI systems that deliver on their promises. Enterprise IT organisations that make the leap will stop feeding AI junk and start unlocking real data value.

Kumar Goswami is the CEO and co-founder of Komprise

Main image courtesy of iStockPhoto.com and Just_Super