Character sets in regular expressions are used to match any one of the characters inside square brackets []. For example, the pattern con[sc]en[sc]us will match 'consensus', 'concensus', 'consencus', and 'concencus'.
An optional character in regular expressions is marked with a question mark ?. It means the character can appear once or not at all. For example, the pattern humou?r matches both 'humour' and 'humor'.
Literals in regular expressions are exact characters you want to match. For instance, the pattern monkey will match 'monkey' exactly, and also match 'monkey' within 'The monkeys like to eat bananas.'
Fixed repetitions in regular expressions are indicated by curly braces {}. They specify how many times a character or group should appear. For example, roa{3}r matches 'roaaaar', and roa{3,6}r matches 'roaaaar', 'roaaaaar', and so on.
Alternation, shown with a pipe symbol |, allows matching either of two options. For example, baboons|gorillas matches both 'baboons' and 'gorillas'.
Anchors like ^ and $ are used to match text at the beginning and end of a string. For example, ^Monkeys: my mortal enemy$ matches 'Monkeys: my mortal enemy' exactly, but not 'Spider Monkeys: my mortal enemy'.
Regular expressions are sequences of characters that define patterns in text. They are useful for searching, matching, and manipulating text data.
Wildcards in regular expressions are represented by a period . and match any single character. For example, .......... matches any 10-character text, like 'orangutan' or 'marsupial'.
Character ranges in regular expressions specify sets of characters. For instance, [A-Z] matches any uppercase letter, and [0-9] matches any digit.
Shorthand character classes make writing regular expressions easier. For example, \w matches any word character, \d matches digits, and \W matches anything that's not a word character.
The Kleene star (*) matches the preceding character 0 or more times, and the Kleene plus (+) matches it 1 or more times. For example, meo*w matches 'mew', 'meow', and 'meooow', while meo+w matches 'meow' but not 'mew'.
Grouping in regular expressions, using parentheses (), limits the scope of alternation. For instance, I love (baboons|gorillas) matches both 'I love baboons' and 'I love gorillas'.
Text preprocessing involves cleaning and preparing text data for analysis. In Python, libraries like NLTK and re are commonly used for this task.
import re text = "Five fantastic fish flew off to find faraway functions. Maybe find another five fantastic fish? Find my fish with a function please!" # Remove punctuation result = re.sub(r'[\.\?\!\,\:\;\"]', '', text) print(result) # Output: Five fantastic fish flew off to find faraway functions Maybe find another five fantastic fish Find my fish with a function please
Removing unwanted characters, or noise removal, involves cleaning text by stripping out unnecessary formatting or characters.
from nltk.tokenize import word_tokenize text = "This is a text to tokenize" tokenized = word_tokenize(text) print(tokenized) # Output: ["This", "is", "a", "text", "to", "tokenize"]
Tokenization is the process of splitting text into smaller pieces, called tokens, which can be words, phrases, or other elements.
from nltk.stem import PorterStemmer tokenized = ["So", "many", "squids", "are", "jumping"] stemmer = PorterStemmer() stemmed = [stemmer.stem(token) for token in tokenized] print(stemmed) # Output: ['So', 'mani', 'squid', 'are', 'jump']
Text normalization includes various tasks such as converting text to lowercase, stemming, lemmatizing, and removing common words like 'the'.
from nltk.stem import WordNetLemmatizer tokenized = ["So", "many", "squids", "are", "jumping"] lemmatizer = WordNetLemmatizer() lemmatized = [lemmatizer.lemmatize(token) for token in tokenized] print(lemmatized) # Output: ['So', 'many', 'squid', 'be', 'jump']
Stemming is the process of chopping off the ends of words to get their base forms. This helps in reducing different forms of a word to a common base.
from nltk.corpus import stopwords # Define set of English stopwords stop_words = set(stopwords.words('english')) # Remove stopwords from tokens statement_no_stop = [word for word in word_tokens if word not in stop_words] print(statement_no_stop)
Lemmatization is the process of reducing words to their root form. It uses a dictionary to return the base form of a word.
Stopword removal involves getting rid of common words like 'is', 'and', or 'the', which don’t add much meaning to the text.
Part-of-speech tagging assigns a label to each word in a sentence, indicating whether it's a noun, verb, adjective, etc. This can help improve text analysis.
Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.
ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.
Monitor your IT infrastructure effortlessly with Site24x7 and get comprehensive insights and ensure smooth operations with 24/7 monitoring.
Sign up now!