Cheatsheets
Text Preprocessing

Text Preprocessing

Understanding Regular Expressions

Character Sets

Character sets in regular expressions are used to match any one of the characters inside square brackets []. For example, the pattern con[sc]en[sc]us will match 'consensus', 'concensus', 'consencus', and 'concencus'.

Optional Characters

An optional character in regular expressions is marked with a question mark ?. It means the character can appear once or not at all. For example, the pattern humou?r matches both 'humour' and 'humor'.

Literal Characters

Literals in regular expressions are exact characters you want to match. For instance, the pattern monkey will match 'monkey' exactly, and also match 'monkey' within 'The monkeys like to eat bananas.'

Fixed Repetitions

Fixed repetitions in regular expressions are indicated by curly braces {}. They specify how many times a character or group should appear. For example, roa{3}r matches 'roaaaar', and roa{3,6}r matches 'roaaaar', 'roaaaaar', and so on.

Alternation

Alternation, shown with a pipe symbol |, allows matching either of two options. For example, baboons|gorillas matches both 'baboons' and 'gorillas'.

Anchors

Anchors like ^ and $ are used to match text at the beginning and end of a string. For example, ^Monkeys: my mortal enemy$ matches 'Monkeys: my mortal enemy' exactly, but not 'Spider Monkeys: my mortal enemy'.

Overview of Regular Expressions

Regular expressions are sequences of characters that define patterns in text. They are useful for searching, matching, and manipulating text data.

Wildcards

Wildcards in regular expressions are represented by a period . and match any single character. For example, .......... matches any 10-character text, like 'orangutan' or 'marsupial'.

Character Ranges

Character ranges in regular expressions specify sets of characters. For instance, [A-Z] matches any uppercase letter, and [0-9] matches any digit.

Shorthand Character Classes

Shorthand character classes make writing regular expressions easier. For example, \w matches any word character, \d matches digits, and \W matches anything that's not a word character.

Kleene Star & Plus

The Kleene star (*) matches the preceding character 0 or more times, and the Kleene plus (+) matches it 1 or more times. For example, meo*w matches 'mew', 'meow', and 'meooow', while meo+w matches 'meow' but not 'mew'.

Grouping

Grouping in regular expressions, using parentheses (), limits the scope of alternation. For instance, I love (baboons|gorillas) matches both 'I love baboons' and 'I love gorillas'.

Basics of Text Preprocessing

Introduction to Text Preprocessing

Text preprocessing involves cleaning and preparing text data for analysis. In Python, libraries like NLTK and re are commonly used for this task.

                                

import re
text = "Five fantastic fish flew off to find faraway functions. Maybe find another five fantastic fish? Find my fish with a function please!"
# Remove punctuation
result = re.sub(r'[\.\?\!\,\:\;\"]', '', text)
print(result)
# Output: Five fantastic fish flew off to find faraway functions Maybe find another five fantastic fish Find my fish with a function please

Removing Unwanted Characters

Removing unwanted characters, or noise removal, involves cleaning text by stripping out unnecessary formatting or characters.

                                

from nltk.tokenize import word_tokenize
text = "This is a text to tokenize"
tokenized = word_tokenize(text)
print(tokenized)
# Output: ["This", "is", "a", "text", "to", "tokenize"]

Breaking Text into Tokens

Tokenization is the process of splitting text into smaller pieces, called tokens, which can be words, phrases, or other elements.

                                

from nltk.stem import PorterStemmer
tokenized = ["So", "many", "squids", "are", "jumping"]
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]
print(stemmed)
# Output: ['So', 'mani', 'squid', 'are', 'jump']

Normalizing Text

Text normalization includes various tasks such as converting text to lowercase, stemming, lemmatizing, and removing common words like 'the'.

                                

from nltk.stem import WordNetLemmatizer
tokenized = ["So", "many", "squids", "are", "jumping"]
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]
print(lemmatized)
# Output: ['So', 'many', 'squid', 'be', 'jump']

Reducing Words to Their Base Forms

Stemming is the process of chopping off the ends of words to get their base forms. This helps in reducing different forms of a word to a common base.

                                

from nltk.corpus import stopwords
# Define set of English stopwords
stop_words = set(stopwords.words('english'))
# Remove stopwords from tokens
statement_no_stop = [word for word in word_tokens if word not in stop_words]
print(statement_no_stop)

Bringing Words to Their Root Forms

Lemmatization is the process of reducing words to their root form. It uses a dictionary to return the base form of a word.

Removing Common Words

Stopword removal involves getting rid of common words like 'is', 'and', or 'the', which don’t add much meaning to the text.

Labeling Parts of Speech

Part-of-speech tagging assigns a label to each word in a sentence, indicating whether it's a noun, verb, adjective, etc. This can help improve text analysis.

Programming Cheatsheets: Quick Reference for Productivity

Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.

ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.