CHAPTER 3

Build Predictive Models

One common application of text analytics is finding hidden patterns in the data. If you can identify the patterns, you can take corrective actions to resolve the issues. Here’s a dataset that lists different categories of factory reports with written descriptions of the events.

data = readtable("factoryReports.csv",'TextType','string');

head(data)

Once the data is imported and cleaned (similar to the previous section), the next step in building a model is converting the text into numeric form.

 

You can use the bag-of-words modeling approach to create a matrix of words with their frequencies.

% documents are tokenized and cleaned data

bag = bagOfWords(documents); 

 

% remove infrequent words

bag = removeInfrequentWords(bag,2); 

 

% remove empty documents

bag = removeEmptyDocuments(bag); 

Next, use the latent Dirichlet allocation (LDA) method to uncover hidden topics in the dataset. The 
fitlda
 function call will fit an LDA model to the matrix of words and their frequencies.

numTopics = 7; % assumed number of topics

mdl = fitlda(bag,numTopics);

In this example, assume there are seven topics in the data (often you might not know how many topics to expect). One approach to decide on the number of topics is to evaluate the goodness-of-fit of the model using perplexity of the validation set. 
See an example
.

Next, visualize the words for each topic with a word cloud. You should be able to see that some patterns are starting to emerge from the data.

Test the model with a new narrative and see if it correctly identified the topic.

newDocument = tokenizedDocument("Coolant is pooling underneath sorter.");

topicMixture = transform(mdl,newDocument);

figure

bar(topicMixture)

xlabel("Topic Index")

ylabel("Probability")

title("Document Topic Probabilities")

According to the model, the new narrative is a mixture of Topics 3 and 7 (related to coolant and sorter). By examining the Topic 3 and 7 word clouds, you can verify that the model’s prediction is correct. You can then automatically alert the response team about the issue that needs attention. This approach will eliminate the need for a person to manually inspect each incident to take corrective actions. If you have location data available and find a correlation between location and certain topics, that may alert you to a bigger problem with infrastructure.

Test your knowledge!
CORRECT!

When modeling with latent Dirichlet allocation, a document can belong to multiple topics.

INCORRECT!

When modeling with latent Dirichlet allocation, a document can belong to multiple topics.