According to an August 2023 IDC report, 90% of data generated by organizations in 2022 is unstructured, and only 10% is structured. The report also revealed that only 58% of unstructured data is reused more than once. Organizations spend billions of dollars refining information into knowledge using less than 10% of their organizational brain. The other 90% is the actual “Voice of the Organization.”
Do you remember Brent from Gene Kim’s The Phoenix Project? In the fictional company, Parts Unlimited, he was the person everyone went to when they needed answers. Brent’s information came from where? He kept it in spreadsheets, notebooks and online text files. It was mostly in his head. Brent and Parts Unlimited may be fiction, but the narrative reflects reality. Organizations speak through tacit and often tribal information routes.
It is, however, just information rather than organizational knowledge. Turning this vast wasteland of information into knowledge was nearly impossible in the past. Large organizations have written millions of lines of scripts and robotic automation to mine a fraction of this information, which is further complicated by the technical debt all of this scaffolding requires.
A well-known secret in the AI community became a commodity for the rest of us around 2022: Vector Databases. We commonly refer to this as Retrieval Augmentation (RAG) today. One of the most popular implementations of a RAG is a vector database. Vector databases are specialized databases modeled after large foundational models, for example, GPT-4 used by ChatGPT. Internally, the vectors are based on what’s called a neural network. A vector database offers significant advantages for organizations using advanced natural language processing capabilities.
A vector database uses math and algorithms to understand the meaning of words and documents, not just treating texts as simple strings of characters. For over 75 years, neural networks have been a part of AI. It’s just recently, within the past few years, that some breakthroughs have made solutions like ChatGPT available to the non-AI community.
In a vector database, you load your information, a specific corpus of data, into this vectorized format, and it becomes “knowledge.” It’s like a ChatGPT but with your data instead. You do not have to train the data to create a model; the vector database is already in trained format when you load it. You can ask questions in a ChatGPT format against your data, and you don’t have to share the data outside of your organization. More importantly, the conversations are likely to be less hallucinatory.
With these vector database features, organizations can build high-performance search and question-answer systems for security, compliance and customer service. Examples include:
- Extraction of accurate information from PDFs and images at scale
- Providing evidence retrieval capabilities for question-answering systems
- Responding automatically to customer requests and questionnaires
A vector database can handle millions of documents while maintaining millisecond-level query response times. Database workloads can be isolated from vector workloads and scaled independently. MongoDB’s Atlas Vector Search is an example of a vector database that can isolate different data types.
Today, there are numerous vector database solutions available. I have worked on compliance, risk analysis and auditing use cases that require accuracy and audit trails; vector search provides traceable relevance ranking of this kind of information.
Vector datasets can enable more intelligent text comprehension and unlock a range of benefits, including:
Greater Accuracy and Understanding:
A vector database allows questions to be addressed more precisely and contextually. Complex inquiries are interpreted based on underlying intent, not just keywords. As a result, systems can provide answers that are reliable and insightful.
Automatic Analysis of Large Volumes of Data:
For instant analysis, vector databases can index millions of documents, including scanned images and PDFs. Organizations can extract critical information from contracts, support documents, forms, and other unstructured data using this method.
Continuous Refinement and Improvement Over Time:
As more texts are consumed, vector databases’ algorithms become more intelligent. Models can be continually retrained to adapt to emerging risk, security, and compliance terminology.
In essence, vector databases bring cutting-edge natural language capabilities while removing the typical cost, speed, and scalability barriers of automating unstructured data. This empowers practically any organization to benefit from AI-powered text analytics and drives tangible improvements in efficiency and decision-making quality. It is, essentially, turning unstructured information into knowledge.
In summary, vector databases open new search and relevance opportunities that significantly benefit organizations across many verticals. The scalability, speed and auditability provided make a wide range of search and automation use cases possible.