A Novel Solution to Cleaning Extremely Dirty Unstructured text

IJzIf2m.jpg!web

Photo by Hello I'm Nik on Unsplash

Timothy Tan

Jun 24 ·8min read

After two whole weeks of trial and error and almost thinking it was impossible…

I did it.

I’d managed to turn this:

“Thsi wass a dificult prbl em to resol ve”

Into this:

“This was a difficult problem to resolve”

I thought I’d seen it all after working with the likes ofSinglishor evenTweets.

But never would I have thought I’d come across unstructured text data so dirty.

I’ve obviously exaggerated the example above but the problem boils down to these two key points:

Misspelling detection and resolution
Handling random white spaces in-between words

As humans, we can easily decipher the statement above as:

“This was a difficult problem to resolve”

However, coming up with a methodology to resolve such a data issue proved more challenging than anticipated.

In this post, I will be going through a novel solution I took to clean up an extremely dirty unstructured text dataset.

Here’s a glimpse of how the data looked like:

Example of dirty unstructured text

Note: This is just an example of how the dirty data looked like and not the actual data.

There won’t be much code snippets today but by the end of the post, you will get an idea of the methodology involved to resolve such a data issue if you ever come across it.

Data Quality Issue 1: Misspelling Detection and Resolution

In one of my previous posts on handling misspellings, I used word vectors and did a lot of translations to form a generalised translation vector to handle misspellings.

Using a Generalised Translation Vector for Handling Misspellings and Out-of-Vocabulary (OOV) words…

Here my experiments and results done with a novel approach to handling misspellings and OOV words. Hope you find them…

towardsdatascience.com

In this post, I will be falling back on an algorithmic approach to handling misspellings.

There are two parts to this segment:

Detect misspelled words
Resolve misspelled words

I used SAS Viya’s out of the box misspelling action called tpSpell to do this.

Note: SAS works with something called Actionsets which are synonymous to Python Packages. Within each Actionset, there are many Actions that can be performed.

Part 1: Detect Misspelled words

In this step, the tpSpell action performs what it is known as Candidate Extraction.

Candidate Extraction groups words into two categories:

Correctly-spelled word candidates
Misspelled word candidates

A correctly-spelled word candidate is determined by a predetermined parameter (“minimum parents”) before running the procedure.

This parameter specifies the minimum number of documents a term must appear in to be considered a correctly-spelled word candidate.

All correctly-spelled word candidate takes the form <term>-<role>-<parent> .

For example, refer to figure 1 below:

ZvIJJbb.png!web

Figure 1 — Correctly-spelled candidate example ( source )

Notice how the potential candidate name is a concatenation of the columns “Term”, “Role” and “Parent” to form “algorithms-N-algorithm”.

Here’s the breakdown of the logic.

If the potential candidate name, “algorithms-N-algorithm”, appears 5 times in 4 documents, then the number of documents it appeared in equals to 4.
If the predetermined parameter called “minimum parents” is set to 3, since 4 is more 3, the word “algorithms-N-algorithm” is added into a correctly-spelled candidate list.

Now, what about the misspelled word candidates list? How then do we create this table?

Just like the correctly-spelled word candidate list, there is another predetermined parameter to set.

This time, it is called “maximum children”.

Take a look a figure 2 below for instance:

MvAjIfe.png!web

Figure 2 — Misspelled candidate example ( source )

The logic is simple.

If the “maximum children” parameter is set to 3 and the number of documents a potential candidate name like “algoxxxthzs-N-algoxxxthzs” appears in is less than 3, then “algoxxxthzs-N-algoxxxthzs” is added into the misspelled word candidate list.

This logic is repeated for the whole table specified in figure 2.

Now that we have got a correctly-spelled candidate list and a misspelled word candidate list , what follows is the logic in assigning the correctly spelled word to its respective misspellings.

Part 2: Misspelling Resolution

In this step, Candidate Comparisons are performed to resolve misspellings.

The algorithm will now check all words in the misspelled list against all words in the correctly-spelled list.

It compares a misspelled word candidate to every correctly spelled word candidate, and computes a distance between them.

Another predetermined parameter called “maximum spell distance” determines if the misspelled word has a correct spelling or not.

For instance, if the word “algoxxxthzs” is the given misspelled word and the word “algorithm” is the correctly spelled candidate, then the distance computed between “algoxxxthzs” and “algorithm” would be 50.

If “maximum spell distance” was set to 20. Since 50 is greater than 20, the correctly spelled candidate “algorithm” is now deemed as the correct spelling for the word “algoxxxthzs”.

There are also other advance parameters you could set here to account for multi-term words like “fire engine”, “carry on” or “join in” etc. You can read the documentation here .

Although I made it sound really complicated above…

It is not.

Here’s how to run all of the above.

proc cas;
 textParse.tpSpell /
  table={name="pos", caslib="public"}
  minParents=3
  maxChildren=6
  maxSpellDist=15
  casOut={name="tpSpell_Out", replace=true};
   run;
quit;

The result will look as such:

eAFzeye.png!web

Figure 3 — Misspelling Resolution Results

One thing I love about SAS programming is how easy it is to make use of complicated algorithms with just a few lines of code.

Yes, Python packages do the same thing but sometimes, the packages are not as advanced as SAS.

A good example would be how SAS does optimization for deep learning hyperparameters and parameters within the layers. As an anecdote, it was really easy to do hyper-parameter tuning in SAS.

SAS combines Latin Hypercube with the Genetic Algorithm optimized with the Hyperband method to account for resource utilization. In addition, everything is automatically multi-threaded.

Coding what SAS does out of the box in Python would have definitely taken more time, effort and not to mention, the knowledge needed on how to distribute processing with Python, accounting for resource allocation and utilization — which I think the majority of the people aren’t familiar with.

Data Quality Issue 2: Handling Random White Spaces

This problem really had me testing multiple methodologies. All of which did not perform well except for the methodology I’ll be going through now.

To re fresh your m emry, thee probl em loked lik e thsi

There are a few data quality issues worth pointing out above:

Space in-between the words i.e. “probl em”
Missing characters i.e. “loked”
Swapped characters i.e. “thsi”
Double characters i.e. “thee”
Combination of issues i.e. “m emry” — combination of 1 and 2.

I had to find a way to resolve the spaces in-between words while at the same time , account for the misspellings (numbers 2, 3 and 4).

Thankfully, handling the misspellings can be taken care of relatively easily as I have shown above with tpSpell.

But the complexity comes in with the combinations of issues (number 5).

For a combination problem like this, I first needed to resolve the white spaces before applying the misspelling resolution.

While thinking about the best solution for the job, I took inspiration from how encoder-decoder networks worked.

Could I somehow “encode” the sentences and “decode” them after?

Obviously I’m using the terms inappropriately here. By “encode” I actually meant to remove all white spaces between words. Like such:

torefreshyourmemrytheeproblemlokedlikethsi

By “decode”, I mean to break the words back into their individual words after resolving misspellings:

to refresh your memory the problem looked like this

The Problem with “Decoding” Part 1

One of the biggest problem I had to resolve was when to insert a white space once a string had been “encoded”.

Take the words “ torefresh ” for instance.

How would you know to insert a white space between “ to ” and “ refresh ”? Why not “ to ”, “ re ” and “ fresh ”?

Here’s how I did it and the logic of my SAS script.

Decoded words are based off the longest match of the word.

For example, if I saw the words “ helloworld ”, I am first going to perform a look up for the word “ hell ” in a predefined potential candidate list. i.e. dictionary table

Until the next character “ o ” appears, since “ hell+o ” = “ hello ”, “ hello ” becomes the better potential candidate.

I drop the word “ hell ” as main candidate and keep the word “ hello ” as the new main candidate.

As the next character gets read in, “ hello+w ” = “ hellow ”, since “ hellow ” is not a proper word in the potential candidate list, the best candidate to split off is “ hello ”.

Once these checks are completed, I add a white space after “ hello ” and continue the logic above for “ w ”, “ o ”, “ r ”, “ l ” and “ d ” until I get “ hello world ”

So what is this potential candidate list and how did I get this dictionary?

It is a dictionary I created from the correctly-spelled word list generated from the tpSpell action earlier.

Meaning to say, all correctly-spelled words are placed into this dictionary table for me to look up as I “decode” each string.

The Problem with “Decoding” Part 2

There were other issues that needed to be resolved to make the “decoder” work.

That is, I had to remove stop words, all forms of punctuation, lower cased all words and only keep parent terms.

Why?

Because if I didn’t there would have been a lot of issues that surfaced.

Take this sentence for instance:

“the series of books”

After “encoding” it will be:

“theseriesofbooks”

Recall that the algorithm I created takes the longest match in the dictionary during the “decoding” process.

If I kept all stop words, the “decoder” would have parsed the first word as “ these ” and not “ the ”.

Because of this, the next characters “ ries ” in the word “ se ries ” will no longer be a word in my dictionary, and will be discarded.

This would mean the “decoded” string will end up being “ these of books ”, which is obviously incorrect.

Another intricacy that affects the “decoder” looks something like this:

“bicycle sports” → “bicyclesports” → “bicycles ports” (wrong)

To counter this issue, I only kept parent terms in my dictionary table.

Example 1: bicycle s → bicycle

Example 2: sport s → sport

By only looking at parent terms, my output looked like this:

“bicycle sport” → “bicyclesport” → “bicycle sport” (correct)

Because “ bicycle s ” is no longer in my dictionary and “ bicycle ” is, the decoder parses it correctly and does not output “bicycles port”.

So there you have it!

My very own “encoder-decoder” methodology to clean extremely dirty unstructured text! :wink:

It’s not a perfect solution but it does get the job done enough for me to start my downstream NLP task proper.

Seeing it all in Action!

Just to show you an example of how the would look like after running the code, I ran the code against a books review corpus with mocked up data quality issues.

Here’s how the sample and the process looks like:

Nre2Qfj.png!web

Figure 4 — Data Cleaning Process in Action

Ending Notes

Well, that’s it then!

I hope you found this post insightful! :smiley:

Albeit spending close to 2 weeks trying out many methodologies, I had tons of fun with this!

Till next time, farewell!

Linkedin Profile: Timothy Tan

A Novel Solution to Cleaning Extremely Dirty Unstructured text