CHAPTER 2

Access and Preprocess Text Data

Text Analytics Workflow

An end-to-end text analytics workflow involves the following four steps:

 
  1. Access data from databases, the web, and internal file repositories and explore by visualization.
  2. Preprocess data by eliminating extraneous information such as punctuation, common words, or stop words such as “a” and “the.”
  3. Build predictive models by using machine or deep learning algorithms.
  4. Share insights and use predictive models in applications.

If this looks complicated, don’t worry. The following sections include examples featuring accessing data, preprocessing text, and building text analytics models. Here’s an example to get you started with accessing and exploring text data.

Get Started by Accessing and Exploring Text

You may have text data in various formats such as Microsoft® Word® document, PDF, plain text, Microsoft Excel, databases, and web pages. This example demonstrates reading in a page on artificial intelligence from the MathWorks web site and seeing what words are used using simple visualizations.

url = "https://www.mathworks.com/discovery/artificial-intelligence.html";

 

% read the whole page

code = webread(url); 

 

% look at the structure of the page

tree = htmlTree(code); 

 

% find the paragraph elements

subtree = findElement(tree, "p"); 

 

% extract the text from paragraph elements

orgtext = extractHTMLText(subtree);

 

% break it down into words 

text = tokenizedDocument(orgtext); 

wordcloud(text)

Test your knowledge!
CORRECT!

The most common words are highlighted in orange in MATLAB by default.

INCORRECT!

The most common words are automatically highlighted in orange in MATLAB by default, but the most common words do not always convey the most information.

Preprocessing Text Data

You should have noticed that there are some issues with the word cloud, in particular the inclusion of punctuation, symbols, and words like “and”, “the,” and “to” that are not likely to add value. These low-information words are called stop words.

 

Data cleaning is an important step in trying to extract information from the data. Plot the data as a word cloud to see what effects the cleaning has had and whether further data cleaning is necessary.

 

 To clean the data and plot another word cloud, try:

% make text lower case

cleanText = lower(text); 

 

% split text into individual words

cleanText = tokenizedDocument(cleanText); 

commonWords = ["ai", "artificial", "intelligence", "matlab", "simulink", "mathworks"];

 

% remove words that  won’t give us much information

cleanText = removeWords(cleanText, commonWords); 

 

% remove stop words such as "a", "the"

cleanText = removeStopWords(cleanText); 

 

% remove punctuation

cleanText = erasePunctuation(cleanText); 

 

% plot wordcloud again

wordcloud(cleanText) 

% plot word clouds side-by-side for comparison

subplot(1, 2, 1), wordcloud(text); title('Raw data') 

subplot(1, 2, 2), wordcloud(cleanText); title('Clean data')

The clean word cloud relays more information about what is on the page. Data cleaning can often be the most time-consuming part of data analytics. However, it gets easier with experience and knowledge of common data preprocessing.

 

Learn more about common text data preprocessing techniques
.


Sometimes, all you need is simple “string” manipulation techniques. MATLAB includes these useful functions to:

   • Search for specific strings or characters (
regexp
 and 
strfind
)

   • Replace certain words (
regexprep
 and 
replaceWords
)

   • Look at certain sections of a document (
extractBetween
)

   • Compare strings (
strcmp
)

   • Count how many times certain words are mentioned in a document (
count
)
Test your knowledge!
CORRECT!

Another difference is that to create a string, you use double quotes, and to create a character vector, you use single quotes.

INCORRECT!

Character arrays store sequences of characters and string arrays store pieces of text. 

 

Another difference is that to create a string, you use double quotes, and to create a character vector, you use single quotes.

Text Summarization

Another way to look at the text data is to extract a summary. To create a six-sentence summary, use:

% Join all the paragraphs and split into sentences

orgLines = splitSentences(join(orgtext));

 

% Tokenize the sentences

tok_orgLines = tokenizedDocument(orgLines); 

 

% Look at a 6-sentence summary

summary = extractSummary(tok_orgLines, 'SummarySize', 6, 'OrderBy',"position"); 

 

Beyond automated driving , AI is also used in models that predict machine failure , indicating when they will require maintenance ; health and sensor analytics such as patient monitoring systems ; and robotic systems that learn and improve directly from experience .

Data preparation requires domain expertise , such as experience in speech and audio signals , navigation and sensor fusion , image and video processing , and radar and lidar .

In automated driving systems , AI for perception must integrate with algorithms for localization and path planning and controls for braking , acceleration , and turning .

AI models need to be deployed to CPUs , GPUs , and / or FPGAs in your final product , whether part of an embedded or edge device , enterprise system , or cloud .

Statistics and Machine Learning Toolbox ™ makes the hard parts of machine learning easy with apps for training and comparing models , advanced signal processing and feature extraction , classification , regression , and clustering algorithms for supervised and unsupervised learning .

For example , in an automated driving system , you use AI and simulation to design the controller for braking , acceleration , and turning .

Do these sentences describe what the page is about? Yes. These are somewhat long sentences, but they capture the main points of the article.