3,768
社区成员




对文本数据进行预处理是将原始文本转化为适合机器学习或深度学习模型输入的重要步骤。以下是一系列常见的文本预处理操作:
首先要从各种渠道收集文本数据,像网页、文档、数据库等。若数据分散在多个文件或数据源中,需要将它们整合到一个统一的数据集中。
import re
text = "Hello! How are you?"
cleaned_text = re.sub(r'[^\w\s]', '', text)
print(cleaned_text)
BeautifulSoup
库来去除这些标签。示例如下:from bs4 import BeautifulSoup
html_text = "<p>Hello, <b>world</b>!</p>"
soup = BeautifulSoup(html_text, "html.parser")
cleaned_text = soup.get_text()
print(cleaned_text)
nltk
库中的停用词列表进行去除。示例代码:import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
text = "This is a sample sentence."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w.lower() in stop_words]
print(filtered_text)
分词是将文本拆分成单个的词语或标记(token)。不同语言的分词方法有所不同,对于英文,可直接按空格分割;对于中文,可使用jieba
等库进行分词。示例如下:
import jieba
chinese_text = "我爱自然语言处理"
words = jieba.lcut(chinese_text)
print(words)
nltk
库中的PorterStemmer
进行词干提取。示例:from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runs", "run"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
nltk
库中的WordNetLemmatizer
进行词形还原。示例:from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
words = ["better", "running", "mice"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
gensim
库训练Word2Vec模型。示例:from gensim.models import Word2Vec
sentences = [["hello", "world"], ["goodbye", "world"]]
model = Word2Vec(sentences, min_count = 1)
vector = model.wv['hello']
print(vector)
将编码后的文本数据划分为训练集、验证集和测试集,一般按照70% - 15% - 15%或80% - 10% - 10%的比例进行划分。在Python中,可使用sklearn
库的train_test_split
函数进行划分。示例:
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 1, 0, 1])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print(X_train, y_train, X_test, y_test)
通过以上步骤,你可以将原始文本数据预处理为适合机器学习或深度学习模型使用的格式。