如何统计汉字个数

ijkh007 2006-01-23 02:16:49

今天突然想到一个问题：
如何统计一个文件中的中文字符个数？？（汉字个数）
象WORD那样是如何做的

...全文

1370 31 打赏收藏转发到动态举报

写回复

用AI写文章

31 条回复

切换为时间正序

请发表友善的回复…

发表回复

losedxyz 2006-02-01

打赏
举报

mark

睡在床板下_ 2006-01-31

打赏
举报

mark

ikiki 2006-01-29

打赏
举报

其实不一定要区分汉字，我觉得只要统计 ASCII 和 Wide Char 就可以了，附我的统计代码

unsigned int ascii = 0, wide = 0, cr = 0, lf = 0, other = 0;
TCHAR * prev = text, * cur = text; //text is input string
cur = CharNext(cur);
while (*prev)
{
if ((BYTE *)cur - (BYTE *)prev == sizeof(char))
if ('\r' == *prev)
cr++;
else if ('\n' == *prev)
lf++;
else
ascii++;
else if ((BYTE *)cur - (BYTE *)prev == sizeof(WCHAR)) {
if (*reinterpret_cast<WCHAR *>(prev) < 0x80)
if ('\r' == *prev)
cr++;
else if ('\n' == *prev)
lf++;
else
ascii++;
else
wide++;
}
else
other++;

prev = cur;
cur = CharNext(cur);
}
TCHAR buf[128];
sprintf(buf, _T("字数统计:\t\n单字节: %u\n双字节: %u\n回车符: %u\n换行符: %u\n其它: %u\n总共: %u"), ascii, wide, cr, lf, other, ascii + wide + cr + lf + other);

fireinsnow 2006-01-26

打赏
举报

const int INT_SIZE = 0xFFFF;
const int EIGHTBIT = 0x00FF;
/*
Name: GetLowEight
Pre: 接收一个整形数字, 返回一个整形数字.
Post: 内联函数，返回传入数字二进制码的低八位
*/
inline int GetLowEight(int num)
{
//num += INT_SIZE;// 以前是这样先把它加成个正数所以负数的时候牵扯到0
//return num / EIGHTBIT;
//return num & INT_SIZE & EIGHTBIT; // 这句就是错误的
return num & EIGHTBIT;
}

fireinsnow 2006-01-26

打赏
举报

对不起，当时我改过一次代码，刚开始的时候并不是用&来取低8位的，所以用了以下代码。
else
{ // 因为负数到正数中间还有一个0, 而事实上127以后的ASCII码是连续的
// 也就是说127的下一个是128而不是负数, 所以加1跳过0
int areaCode = GetLowEight(charater[n]) - CODE_VAL + 1;
int placeCode = GetLowEight(charater[n+1]) - CODE_VAL + 1;
code = areaCode * 100 + placeCode;
//cout << charater[n] << charater[n+1] << ": " << code << "\t";
++ count;
n += 2;
}
而把代码改成用&取低八位，那么这个代码正确的应该是
else
{ // 这里的注释完全不要了。
int areaCode = GetLowEight(charater[n]) - CODE_VAL; // 这两个地方的1
int placeCode = GetLowEight(charater[n+1]) - CODE_VAL; // 也不用加了。
code = areaCode * 100 + placeCode;
//cout << charater[n] << charater[n+1] << ": " << code << "\t";
++ count;
n += 2;
}

fireinsnow 2006-01-26

打赏
举报

致LYH_Studio(李业华)

谢谢你的指正，你说得很不错。

这个代码是我两个月前写的，当时是看见一个朋友用VB写的一个查询区位码的程序想到的用C++做的。然后又有一个朋友说想统计毕业论文的字数，没有word，呵呵。。我才拿这个改了一下给他用的。当时我只知道char有正负之分，没有想过可以unsigned char，所以用计算器算了好久才想到用取低八位这个办法的。现在看看，确实是改成这样子代码更容易理解。

真的很谢谢你！

du51 2006-01-25

打赏
举报

要是只有汉字和英文的话.
#include<stdio.h>
#include<stdlib.h>
int main()
{
char *file="我爱1212中12asdf国asdf我也爱C加加",*p;
int i=0;
p=file;
while(*p){if(*p<0)i++;p++;}
printf("共有汉字%d个",i/2);
system("PAUSE");
return 0;
}
就可以了.

LYH_Studio 2006-01-25

打赏
举报

（一下是我个人看了以上各朋友后的一点终结）
致 fireinsnow(喜欢蓝色)：

int GetCharCode(string & charater)
{
int n = 0;
int code;
int count = 0;

while (charater[n] != '\0')
{
if (charater[n] >= 0)
{
code = charater[n];
//cout << charater[n] << ": " << code << "\t";
++ count;
++ n;
}
else
{ // 因为负数到正数中间还有一个0, 而事实上127以后的ASCII码是连续的
// 也就是说127的下一个是128而不是负数, 所以加1跳过0
int areaCode = GetLowEight(charater[n]) - CODE_VAL + 1;
int placeCode = GetLowEight(charater[n+1]) - CODE_VAL + 1;
code = areaCode * 100 + placeCode;
//cout << charater[n] << charater[n+1] << ": " << code << "\t";
++ count;
n += 2;
}
}
//cout << endl;
return count;
}

你行代码while (charater[n] != '\0')中的charater[n] != '\0'我觉得改为
const short GBCH = 0x80;
const short CODE_VAL = 0xA0;

static_cast<unsigned char/*或者是wchar_t*/>(charater[n]) > GBCH
从代码可读性上来说可能好一些^_^。

int areaCode = GetLowEight(charater[n]) - CODE_VAL + 1;
int placeCode = GetLowEight(charater[n+1]) - CODE_VAL + 1;
也可义写为：
int areaCode = static_cast<unsigned char/*或者是wchar_t*/>(charater[n]) - CODE_VAL;
int placeCode = static_cast<unsigned char/*或者是wchar_t*/>(charater[n+1]) - CODE_VAL;

以上说法有什么问题，望高手指正！

fireinsnow 2006-01-24

打赏
举报

不对吧，组成汉字的两个字符是大于128的，而且还要大于160，因为区位码是与汉字一一对应的编码，用四位数字表示，前两位从01 到94称区码，后两位从01到94称位码。一个汉字的前一半是 ASCⅡ码为“160＋区码”的字符，后一半是ASCⅡ码为“160即A0＋位码”的字符。即是把组成汉字的两个ASCII码各减去160即A0，所得结果组合起来就是区位码。如“国”字，ASCII码减去160即A0后，两部分分别为：25、90，则“国”字的区位码为2590。

hudaojin 2006-01-24

打赏
举报

查找高于128的ASCII码的数量,如果连续两个高于128,只算一个,因为没字的第一个字符一定大于128,第二个字符可能大于128,也可能小于128

fireinsnow 2006-01-24

打赏
举报

我写过一个代码，本来是想求汉字的区位码的，后来就用到了统计字数上，因为我是初学，不一定完全正确，高手请指正。

// GB2312.h
/*
Name: GetGB2312
Copyright: (c) 2005
Author: Melody
Date: 14-12-05 21:13
Description: 区位码是与汉字一一对应的编码，用四位数字表示，
前两位从01 到94称区码，后两位从01到94称位码。
一个汉字的前一半是 ASCⅡ码为“160＋区码”的字符，
后一半是ASCⅡ码为“160即A0＋位码”的字符。
即是把组成汉字的两个ASCII码各减去160即A0，
所得结果组合起来就是区位码。
如“国”字，ASCII码减去160即A0后，
两部分分别为：25、90，则“国”字的区位码为2590。
而半个汉字的ASCII码都在127以后，返回的是一个负数，
我们通过取补码和反码来让其正确返回127以后的数字。
*/
#ifndef GB2312_H_
#define GB2312_H_
#include <string>

const int EIGHTBIT = 0x00FF;
const int CODE_VAL = 0x00A0;

using std::string;

/*
Name: GetLowEight
Pre: 接收一个整形数字, 返回一个整形数字.
Post: 内联函数，返回传入数字二进制码的低八位
*/
inline int GetLowEight(int num)
{
return num & EIGHTBIT;
}

int GetCharCode(string &); // 显示国标码函数声明

#endif

// GB2312.cpp
#include "GB2312.h"
/*
Name: GetCharCode
Pre: 接收一个字符串, 不返回任何值.
Post: 如果ASCII码小于127则显示这个字符的ASCII码, 并且检查下一个字符.
如果ASCII码大于127则计算连续的两个字符, 第一个字符求出区码, 第二个字符
求出位码, 然后显示这个中文字符的区位(国标码). 并且检查中文字符后的
下一个字符.
*/
int GetCharCode(string & charater)
{
int n = 0;
int code;
int count = 0;

while (charater[n] != '\0')
{
if (charater[n] >= 0)
{
code = charater[n];
//cout << charater[n] << ": " << code << "\t";
++ count;
++ n;
}
else
{ // 因为负数到正数中间还有一个0, 而事实上127以后的ASCII码是连续的
// 也就是说127的下一个是128而不是负数, 所以加1跳过0
int areaCode = GetLowEight(charater[n]) - CODE_VAL + 1;
int placeCode = GetLowEight(charater[n+1]) - CODE_VAL + 1;
code = areaCode * 100 + placeCode;
//cout << charater[n] << charater[n+1] << ": " << code << "\t";
++ count;
n += 2;
}
}
//cout << endl;
return count;
}

//main.cpp
/*
Name: CountWords
Copyright: (c) 2005
Author: Melody
Date: 15-12-05 22:50
Description: 程序读取一个文本文件，统计其中的字符数
（全角及半角字符均为一个字符）、行数、
空格字符数、以及除去空格字符后的字符数。
然后在屏幕上显示出来，并且将统计结果存入
cout_words.log文件里。

*/

#include <fstream>
#include <iostream>
#include <string>
#include <cctype>
#include "GB2312.h"
using namespace std;

int main(int argc, char *argv[])
{
int countHaveBlank = 0;
int countNoneBlank = 0;
int countLine = 0;
int countBlank = 0;
string words;

if (argc < 1 || argc > 2)
cout << "指定你要打开的文件\n";
else
{
ifstream inFile;
inFile.open(argv[1]);
if (!inFile.is_open())
{
cerr << "文件打开失败\n";
exit(1);
}
while (inFile.good())
{
getline(inFile, words);
for (int num = 0; words[num] != '\0'; ++ num)
{
switch (words[num])
{
case ' ' : case '\t' : case '\n' :
countBlank ++;
break;
default :
break;
}
// if (isspace(words[num])) // 由于其中包含中文字符，所以不适合
// countBlank ++; // 用这个函数。
}
++ countLine;
countHaveBlank += GetCharCode(words);
}
countNoneBlank = countHaveBlank - countBlank;
inFile.close();
ofstream outFile;
outFile.open("CountWords.log", ios_base::out | ios_base::app);
outFile << argv[1] << "打开成功\n";
outFile << "====================================================\n";
outFile << argv[1] << "字数统计\n";
outFile << "共有\t" << countLine << "\t行\n";
outFile << "共有\t" << countHaveBlank << "\t字符\n";
outFile << "共有\t" << countBlank << "\t空格字符\n";
outFile << "共有\t" << countHaveBlank << "\t字（带空格）\n";
outFile << "共有\t" << countNoneBlank << "\t字（不带空格）\n";
outFile << "====================================================\n\n";
outFile.close();

cout << argv[1] << "字数统计\n";
cout << "共有\t" << countLine << "\t行\n";
cout << "共有\t" << countHaveBlank << "\t字符\n";
cout << "共有\t" << countBlank << "\t空格字符\n";
cout << "共有\t" << countHaveBlank << "\t字（带空格）\n";
cout << "共有\t" << countNoneBlank << "\t字（不带空格）\n";
}

return 0;
}

Kid4you 2006-01-24

打赏
举报

用0X80很好

iawenll 2006-01-24

打赏
举报

我利用宽字符实现了一下，各位看一下，能不能用：
int main( )
{
wifstream infile("words.txt"); //宽字符文件流
if(!infile){
cerr<<"Error: open file error!\n";
return -1;
}

wchar_t words;
unsigned txt_cnt=0;
vector<wchar_t> vec; //可以不用容器，本例是为后面的显示
while(infile>>words){
vec.push_back(words);
if(words>128)
txt_cnt++;
}

wcout<<"汉字共有："<<txt_cnt/2<<endl; //不知为什么，编译器还是将一个汉字
//记作2个字节，所以此处记得除以2

return 0;
}

chenzhichao2008 2006-01-24

打赏
举报

改一下： if( unsigned char(*p) > 0x80 )

fireinsnow 2006-01-24

打赏
举报

朋友，我想问一下，在内存里，字符是以高8位表示地址呢还是低8位表示？

逸学堂 2006-01-24

打赏
举报

在unicode系统中,无论汉字,字符都是两个字节的.

chenzhichao2008 2006-01-24

打赏
举报

int count( char *str )
{
int num = 0;
char *p = str;
while( *p )
{
if( *p > 0x80 )
{
p+=2;
num++;
}
else
{
p++;
}
return num;
}
}

chenzhichao2008 2006-01-24

打赏
举报

每个汉字的高8位都是>0x80的
可以由这个条件来区分汉字与Ascii码

chenzhichao2008 2006-01-24

打赏
举报

int count( char *str )
{
int num = 0;
char *p = str;
while( *p )
{
if( unsigned char(*p) > 0x80 )
{
p+=2;
num++;
}
else
{
p++;
}
}
return num;
}

//郁闷, 竞然把return 括到while里，

oosky2004 2006-01-24