Extract text
Pull specific parts of a text column, such as words or character positions.
Example Data
Follow along with right out of the box example data. Copy following data in the information request of the agent you are working in.
Text columns often contain more structure than meets the eye. For example, a company name might include a legal suffix like "S.r.l." or "Inc.", and we might want to extract just that part. Or we might want the first word of a description, or the last few characters of a code. These operations help turn messy strings into useful features.
In this section, we’ll learn how to extract specific words or character positions from text using Python.
Extract characters from the start or end
Excel
In Excel, you use =LEFT()
or =RIGHT()
to extract characters from the beginning or end of a string.
t0 Prompt
Get the first 5 letters of the seller name
Extract the last 4 characters from the buyer column
Show me the beginning of each company name
Code
The python code looks as follows:
Extract words by position
Excel
In Excel, you use =MID()
or =TEXTSPLIT()
to get words by position, often with helper formulas like FIND()
.
t0 Prompt
Extract the first word of the seller column
Give me the second word in buyer
Get just the legal suffix like Inc., AB, or S.r.l.
Code
The python code looks as follows:
Extract text using custom rules
Excel
In Excel, complex extractions often require nested formulas, like =MID(A1,FIND(" ",A1)+1,...)
.
t0 Prompt
Get everything after "Solara Dynamics"
Remove the first word from the name
Extract custom text pattern X
Code
The python code looks as follows:
Function | Description |
---|---|
str[:N] | Extract first N characters |
str[-N:] | Extract last N characters |
str.split().str[i] | Get the i-th word (0 = first) |
str.replace("x", "") | Replace or remove specific text |