Semantic Search Engine Developed for Custom R&D Data Lake

The Problem:

Our client, a global consumer goods company, consisted of many research and development groups working to develop new snacks, beverages, and health foods for customers around the world. As part of their broader digital transformation, they had collected tens of thousands of R&D documents spread across disparate business systems, and consolidated them into a single data lake in an Azure cloud environment. We were engaged to develop a custom search engine to enable researchers to effectively find relevant documents given incomplete search terms. 

 

The Data:

The client provided over 20,000 documents as a representative sample of the data lake's contents. The text required extensive cleaning to preserve meaningful information while excluding erroneous UNICODE characters, and masking personally identifiable information (PII). 

 

The Solution:

After text cleaning, we applied NLP preprocessing, and trained a word2vec model on our corpus to capture relational information between tokens in the corpus. In order to measure the model's accuracy, we created a custom web app where our subject matter experts could review and score the results of the model on a preselected list of search terms. Once validated, the model was then deployed in an API in the Azure environment. When a user enters a document search our API returns a list of most closely related words to be included in the semantic search as well. For example if a user searches for almond, they may also see results for peanut, or milk, or extract

 

The Impact:

By employing this method, we were able to accelerate the client's digital transformation. Their researchers were able access all relevant documents with a speed and efficiency that was unattainable prior to the cloud migration.