NLP作业02：课程设计报告

0222-带蓝子学生 2023-06-26 21:48:02

作业头

这个作业属于那个课程

自然语言处理

这个作业要求在哪里

<写上作业的链接>

我在这个课程的目标是

中文情感分析的背景知识，分析步骤和流程，Jieba 分词、去停用词等文本预处理方法与应用，文本数据的向量表示方法

这个作业在那个具体方面帮助我实现目标

<写上具体方面>

参考文献

项目二参考文档

1.设计目的

通过课程设计的练习，加深学生对所学自然语言处理的理论知识与操作技能的理解和掌握，使得学生能综合运用所学理论知识和操作技能进行实际工程项目的设计开发，让学生真正体会到自然语言处理算法在实际工程项目中的具体应用方法，为今后能够独立或协助工程师进行人工智能产品的开发设计工作奠定基础。通过综合应用项目的实施，培养学生团队协作沟通能力，培养学生运用现代工具分析和解决复杂工程问题的能力；引导学生深刻理解并自觉实践职业精神和职业规范；培养学生遵纪守法、爱岗敬业、诚实守信、开拓创新的职业品格和行为习惯。

2.设计要求

2.1 实验仪器及设备

使用64位Windows操作系统的电脑。
使用3.8.5版本的Python。
使用Jupyter Notebook编辑器。
使用 jieba，sklearn

2.2 设计要求

课程设计的主要环节包括课程设计作品和课程设计报告的撰写。课程设计作品的完成主要包含方案设计、计算机编程实现、作品测试几个方面。课程设计报告主要是将课程设计的理论设计内容、实现的过程及测试结果进行全面的总结，把实践内容上升到理论高度。

3.设计内容

通过短信文本数据识别出其中的垃圾短信。本次建模针对某通讯运营商提供的短信文本数据，首先从原始数据集中抽取所需数据，再对其进行数据预处理，其中包括去除重复值、中文分词和停用词过滤，并通过绘制词云图检查数据分词效果，最后建立朴素贝叶斯(Naive Bayes) 分类模型，实现对垃圾短信的精确识别。、

4.设计过程

4.1 设计背景

随着移动通信技术的迅猛发展和智能手机的广泛普及，人们在日常生活中经常收到各种类型的短信信息，其中包括许多垃圾短信。垃圾短信通常由广告、欺诈、诈骗等行为发送，其不仅会占用用户手机的存储空间和通讯网络资源，还会对用户造成骚扰和安全风险。因此，对垃圾短信进行有效识别和过滤，对保障用户通讯安全和提升通讯服务质量具有重要意义。

4.2 任务分析

基于短信文本内容的垃圾短信识别的方法和步骤主要包括以下 6个步骤。

数据收集和整理：首先需要从各种来源收集并整理样本短信，包括垃圾短信和正常短信，并对这些短信进行标注，标注的方式一般有单标签和多标签两种形式。
数据预处理：对收集到的文本数据进行预处理，包括中文分词、去除停用词、提取文本特征等。常用的文本预处理技术包括jieba库，NLTK库等。
特征表示：使用某种方法将预处理后的文本数据转换为计算机可处理的形式，如采用TF-IDF等方法将文本转换为向量形式。
模型选择和训练：选择适合该任务的分类器模型，如朴素贝叶斯、支持向量机、决策树等。使用训练集对分类器模型进行训练，并通过交叉验证等方法对模型进行调优。
模型测试和性能评估：使用测试集对训练好的分类器模型进行测试，计算混淆矩阵、准确率、召回率、F1-score等性能指标，并根据需要对模型进行调整和优化。
模型应用：将训练好的分类器模型应用于实际场景中，并根据需要进行不断的更新和优化。

4.3 绘制流程图

4.4 代码实现

4.4.1 读取数据

4.4.2 数据预处理

4.4.2.1 数据降重

4.4.2.3 读取并去除停用词

4.4.2.4 绘制词云图

4.4.3 贝叶斯模型

4.4.3.1 导入模型训练需要的库

4.4.3.2 构建词频矩阵

4.4.3.3 输出混淆矩阵和模型正确率

4.4.4 构建分类器并进行训练和预测

5.设计总结

本次实验旨在通过训练一个基于文本内容的模型，来识别和分类垃圾短信和正常短信，并比较不同模型对垃圾短信识别的性能差异。在本次实验中，我们选用了TF-IDF方法进行特征提取，选择了朴素贝叶斯、决策树和支持向量机三种算法进行训练，并在测试集上进行了性能评估。

实验结果表明，朴素贝叶斯算法在精度和F1值上表现较好，而支持向量机在召回率上表现优秀。同时，我们用ROC曲线和AUC值对模型进行了更加细致的性能评估，结果显示决策树算法对于垃圾短信的识别效果稍逊。

主要代码

import pandas as pd

import matplotlib.pyplot as plt

import jieba

def data_process(file='C:/Users/LENOVO/Desktop/pc222/message80W1.csv'):

message= pd.read_csv('C:/Users/LENOVO/Desktop/pc222/message80W1.csv', header=None, index_col=0)

message.columns = ['label', 'message']

message.head()

message= pd.read_csv('C:/Users/LENOVO/Desktop/pc222/message80W1.csv', header=None, index_col=0)

message.columns = ['label', 'message']

message.head()

fracs = [message['label'].value_counts()[0],message['label'].value_counts()[1]]

Fracs

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.rcParams['axes.unicode_minus'] = False #

labels = '非垃圾短信', '垃圾短信'

plt.axes(aspect=1)

explode = [0, 0.1]

plt.title('垃圾短信与非垃圾短信数量分布情况') #

# 按类别抽取样本

pos_sample = message[message['label'] == 0].sample(1000, random_state=123)

neg_sample = message[message['label'] == 1].sample(1000, random_state=123)

# 将样本拼接

data_sample = pd.concat([pos_sample, neg_sample], axis=0)

# 去除重复数据

data_clean = data_sample.drop_duplicates()

# 输出结果

print(f"删除重复数据后的样本量：{data_clean.shape[0]}")

print(f"未删除重复数据的样本量：{data_sample.shape[0]}")

# 按类别抽取样本

pos_sample = message[message['label'] == 0].sample(1000, random_state=123)

neg_sample = message[message['label'] == 1].sample(1000, random_state=123)

# 将样本拼接

data_sample = pd.concat([pos_sample, neg_sample], axis=0)

# 去除重复数据

data_clean = data_sample.drop_duplicates()

# 输出结果

print(f"删除重复数据后的样本量：{data_clean.shape[0]}")

print(f"未删除重复数据的样本量：{data_sample.shape[0]}")

plt.pie(

x=fracs,

labels=labels,

explode=explode,

autopct='%1.1f%%',

shadow=True,

labeldistance=1.1,

startangle=90,

pctdistance=0.6,

radius=1

)

plt.show()

plt.close()

# 删除特殊字符

data_clean = data_drop['message'].astype('str').apply(lambda x:re.sub('x|[^\u4E00-\u9FD5]|[0-9]|\\s|\\t','', x)) # 删除 x 、不常见中文、空格等

data_clean.tail()

# 读取停用词表

stopword_file = 'C:/Users/LENOVO/Desktop/pc222/stopword.txt'

stopword = pd.read_csv(stopword_file, sep='bingrong', encoding='gbk', header=None, squeeze=True)

# 自定义停用词

my_stopwords = [' ', ',', '会', '的', '】', '【', '月', '日']

# 拼接停用词

stopwords = list(stopword) + my_stopwords

# 去除停用词

data_delstop = data_cut.apply(lambda x: [i for i in x if i not in stopwords])

data_delstop.head()

def data_process(file='C:/Users/LENOVO/Desktop/pc222/message80W1.csv'):

message = pd.read_csv(file, header=None, index_col=0)

message.columns = ['label', 'message'] # 列名赋值->标签内容

# 抽取部分正例和反例

pos = message[message['label'] == 0].sample(5000, random_state=123)

neg = message[message['label'] == 1].sample(5000, random_state=123)

new_data = pd.concat([pos, neg], axis=0) # 拼接样本数据集

# 数据预处理

data_clean = new_data['message'].astype('str').apply(lambda x: re.sub('x|[^\u4E00-\u9FD5]|[0-9]|\\s|\\t', '', x))

jieba.load_userdict('C:/Users/LENOVO/Desktop/pc222/newdic1.txt') # 加载自定义词典

data_cut = data_clean.astype('str').apply(lambda x : list(jieba.cut(x)))

stopwords_file = 'C:/Users/LENOVO/Desktop/pc222/stopword.txt'

stopwords = pd.read_csv(stopwords_file, sep='bingrong', encoding='gbk', header=None)[0].tolist()

stopwords += [' ', ',', '会', '的', '】', '【', '月', '日']

data_delstop = data_cut.apply(lambda x: [i for i in x if i not in stopwords])

# 输出一些有用的信息

print('原始数据集样本数：', len(message))

print('采样抽取后样本数：', len(new_data))

print('去除无用字符后样本数：', len(data_clean))

print('分词后样本：\n', data_cut.head())

print('去除停用词后样本：\n', data_delstop.head())

labels = new_data.loc[data_delstop.index, 'label'] # 前面操作后下标不对，需要重新赋一下值

import matplotlib.pyplot as plt

import numpy as np

from PIL import Image

from wordcloud import WordCloud,STOPWORDS

from wordcloud import WordCloud

# 调用自编的数据预处理函数 data_process() 读取数据，data_process 函数详见文本分词和去停用词py

_, labels, data_delstop = data_process() # 导入数据

# 词频统计，自编函数，参数为 0,1

adata, data_delstop, labels = data_process()

import numpy as np

from PIL import Image

from wordcloud import WordCloud,STOPWORDS

adata, data_delstop, labels= data_process() # 导入数据

# 词频统计，自编函数，参数为 0,1

def words_count(label=0):

word_dict = {}

for item in list(data_delstop[labels==label]):

for i in item:

if i not in word_dict: # 统计数量

word_dict[i] = 1

else:

word_dict[i] += 1

return word_dict

#print(words_count())

def WordCloud_plot(mask_picture='C:/Users/LENOVO/Desktop/pc222/duihuakuan.jpg'):

p1 = plt.figure(figsize=(20,10),dpi=80) # 确定画布大小

image= Image.open(mask_picture) # 轮廓图片

graph = np.array(image) # 读成像素矩阵

wc =WordCloud(background_color='White', # 设置背景颜色

mask=graph, # 设置背景图片

max_words = 2000, # 设置最大现实的字数

stopwords = STOPWORDS, # 设置停用词

font_path = 'C:/Users/LENOVO/Desktop/pc222/simhei.ttf',# 设置字体格式

max_font_size = 100, # 设置字体最大值

random_state = 30) # 设置随机生成状态，即有多少种配色方案)#绘制 0,1 样本的词云图

for i in [0,1]:

p1.add_subplot(1,2,i+1)

wc.generate_from_frequencies(words_count(i)) # 读进词频数据

plt.imshow(wc) # 绘图

plt.axis("off") # 去除坐标轴

plt.show() # 将图打印

WordCloud_plot()

return data_delstop.apply(lambda x: ' '.join(x)), data_delstop, labels

import matplotlib.pyplot as plt

import numpy as np

from PIL import Image

from wordcloud import WordCloud,STOPWORDS

from wordcloud import WordCloud

# 调用自编的数据预处理函数 data_process() 读取数据，data_process 函数详见文本分词和去停用词py

_, labels, data_delstop = data_process() # 导入数据

# 词频统计，自编函数，参数为 0,1

adata, data_delstop, labels = data_process()

import numpy as np

from PIL import Image

from wordcloud import WordCloud,STOPWORDS

adata, data_delstop, labels= data_process() # 导入数据

# 词频统计，自编函数，参数为 0,1

def words_count(label=0):

word_dict = {}

for item in list(data_delstop[labels==label]):

for i in item:

if i not in word_dict: # 统计数量

word_dict[i] = 1

else:

word_dict[i] += 1

return word_dict

#print(words_count())

def WordCloud_plot(mask_picture='C:/Users/LENOVO/Desktop/pc222/duihuakuan.jpg'):

p1 = plt.figure(figsize=(20,10),dpi=80) # 确定画布大小

image= Image.open(mask_picture) # 轮廓图片

graph = np.array(image) # 读成像素矩阵

wc =WordCloud(background_color='White', # 设置背景颜色

mask=graph, # 设置背景图片

max_words = 2000, # 设置最大现实的字数

stopwords = STOPWORDS, # 设置停用词

font_path = 'C:/Users/LENOVO/Desktop/pc222/simhei.ttf',# 设置字体格式

max_font_size = 100, # 设置字体最大值

random_state = 30) # 设置随机生成状态，即有多少种配色方案)#绘制 0,1 样本的词云图

for i in [0,1]:

p1.add_subplot(1,2,i+1)

wc.generate_from_frequencies(words_count(i)) # 读进词频数据

plt.imshow(wc) # 绘图

plt.axis("off") # 去除坐标轴

plt.show() # 将图打印

WordCloud_plot()

# 加载库

from sklearn.naive_bayes import MultinomialNB, GaussianNB

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.metrics import confusion_matrix, classification_report

from sklearn.model_selection import train_test_split

from sklearn.svm import LinearSVC

# 数据准备，调用自编 data_process() 函数

dataAlready, labels, _ = data_process()

train_data, test_data, train_y, test_y = train_test_split(dataAlready,labels, test_size=0.2) # 数据拆分

# 词频统计和权重确定

transformer = TfidfTransformer() # 转化 tf-idf 权重向量函数

vectorizer = CountVectorizer() # 转化词频向量函数

# 训练集的处理

train_cv = vectorizer.fit_transform(train_data) # 对训练集执行词频转换

tfidf_train = transformer.fit_transform(train_cv.toarray()) # 进一步转成 tf-idf 权重向量

print(' 词频结果：\n',train_cv) # 查看词频结果

print('TF-IDF 结果：\n',tfidf_train) # 查看词的权重

# 数据准备

dataAlready, labels, _ = data_process()

train_data, test_data, train_y, test_y = train_test_split(dataAlready,labels, test_size=0.2)

# 文本特征提取

vectorizer = CountVectorizer() # 定义文本特征提取器

X_train_counts = vectorizer.fit_transform(train_data) # 对训练集文本进行特征提取

print('训练集特征提取结果：', X_train_counts.shape) # 输出训练集特征提取结果（特征向量矩阵的形状）

X_test_counts = vectorizer.transform(test_data) # 对测试集文本进行特征提取

print('测试集特征提取结果：', X_test_counts.shape) # 输出测试集特征提取结果

# 转换为 TF-IDF 权重向量

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts) # 定义转换器

X_train_tf = tf_transformer.transform(X_train_counts) # 对训练集文本进行转换

tfidf_transformer = TfidfTransformer() # 定义转换器

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) # 对训练集文本进行转换

X_test_tfidf = tfidf_transformer.transform(X_test_counts) # 对测试集文本进行转换

print('训练集转换为TF-IDF权重矩阵结果：', X_train_tfidf.shape) # 输出训练集转换结果

print('测试集转换为TF-IDF权重矩阵结果：', X_test_tfidf.shape) # 输出测试集转换结果

# 按训练集的规则处理测试集数据

vectorizer1 = CountVectorizer(vocabulary=vectorizer.vocabulary_) # 按训练集规则的转换

test_cv = vectorizer1.fit_transform(test_data) # 对测试集执行词频转换

tfidf_test = transformer.fit_transform(test_cv.toarray()) # 进一步转成 tf-idf权重向量

import numpy as np

from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.multiclass import OneVsRestClassifier

from sklearn.svm import LinearSVC

from sklearn.naive_bayes import GaussianNB #导入高斯朴素贝叶斯

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer #导入文本特征提取模块转换成词频向量转换成TF-IDF权重矩阵

from sklearn.metrics import multilabel_confusion_matrix

# 使用 MultiLabelBinarizer 转换数据格式

mlb = MultiLabelBinarizer()

train_y = mlb.fit_transform(train_y)

#混淆矩阵------------------------------------------------

cm = multilabel_confusion_matrix(train_y, res)

# 将混淆矩阵转换为DataFrame格式并输出

for i, label in enumerate(mlb.classes_):

print('混淆矩阵:', label)

print(pd.DataFrame(cm[i], columns=['预测负样本', '预测正样本'], index=['实际负样本', '实际正样本']))

# 构建分类器并进行训练和预测

clf = OneVsRestClassifier(LinearSVC())

clf.fit(tfidf_train, train_y)

res = clf.predict(tfidf_train)

print('模型训练正确率',clf.score(tfidf_train, train_y))

# 构建分类器并进行训练和预测

clf = OneVsRestClassifier(LinearSVC())

clf.fit(tfidf_train, train_y)

res = clf.predict(tfidf_train)

print('模型训练正确率',clf.score(tfidf_train, train_y))

设计体会

在完成基于文本内容的垃圾短信识别实训后，我对文本分类和机器学习算法有了更深刻的理解和实践经验。通过实训，我学会了使用Python中的多种机器学习算法和文本预处理技术，例如TF-IDF特征提取、朴素贝叶斯分类器、支持向量机等，以及如何使用Sklearn库来实现这些算法。同时，我还学习了如何使用混淆矩阵和性能报告来评估分类器的性能，并发现了如何调整算法参数来优化模型性能。在实践过程中，我发现文本分类任务并不是很容易，它需要充分理解数据集及其标注信息，同时还需要有一定的文本处理和数据处理能力。对于垃圾短信识别这样的任务，还需要有足够的文本语料库来训练模型，并需要不断完善和更新模型来适应不断变化的数据特征。此外，由于分类器模型在实际使用时面临着诸如数据偏斜、样本不平衡和模型过拟合等问题，需要注意选择合适的数据预处理和模型调整等技巧来提高模型的准确性、泛化性和鲁棒性。通过这次实训为我今后的学习奠定了基础，也让我深刻意识到了自己的不足。

...全文