Getting Texts from Word Document file (suitable for Word v.10)

sumtec 2001-11-14 10:14:23
If U just want to know the implements, please read the reply of this note. Here I want to discribe the principle.

Because Word Document File ( .doc, .dot) is saved in "Page format" (1 Page=512 Bytes) , the length of the text pieces is 512(Or 510) when texts are broken into pieces.

Every one of which attempt to pick out the text from .doc file should notice that text sometimes will break into many pieces. You must think that there must be a table to re-join these pieces into a completed text.

If you read the document of .doc structure published by MS, you can find a thing called "piece table". Unfortunately, too difficult or complex for us to understand it.

To find something useful, U should look for the page containing 0x0101 in position 509+n*512. (nth page begining from 0, and the position is counting from 0 at the beginning of the nth page)

Once U find this page, The count of the elements in this page is a byte in position 511 (The last byte of the page)
U can pay no attention to what do those elements actually means. Every elements(4 bytes number) is a start position of a text (Main text/ Foot note/ Header and something else).
Before using the elements in this page, please plus 512 on it because there is filehead in the .doc file with the size of 512 bytes.

Now U might come to with a start. Here I discribe the meaning of the elements clearly for U, eg (a page implement):

... (Pages before it)


0x00000400 0x00000413 0x0000045F 0x00000813 0x00000829 0xFFFFFFFF(This DWord might be something else like 0xFEFEFEFE or 0xFDFDFDFD. Never mind.)
... (Invalid values. Not useful or undocument) ... 0x0101 ElementCount(1 Byte Number)
(End of this page)

... (Pages after it)


0x00000400 Means sth start at position 0x00000600 in the .doc file.
0x00000413 Means another thing start at position 0x00000613 in the .doc file.
0x0000045F Means sth start at position 0x0000065F
0x00000813 Means the previous text was broken in position 0x000006FF and continued at 0x00000A00, ends 0x00000A12. By the time, sth new starts at 0x00000A13
0x00000829 All the text ended at 0x00000A28 (And a return code is pressed at 0x00000A29)

--- End of Note
...全文
76 1 打赏 收藏 转发到动态 举报
写回复
用AI写文章
1 条回复
切换为时间正序
请发表友善的回复…
发表回复
sumtec 2001-11-14
  • 打赏
  • 举报
回复
//demo only
//text will not pass outside, please modify the code if U want to pass outside the function.
void get_the_text(void)
{

fstream f("1.doc",ios_base::binary|ios_base::in);
int i,textpos,lastpos=0,pagemark,elementcount=0;
int pos[255],posc=0;
char AllText[1048576];

for (i=0+512-3;!f.eof();i+=512)
{
f.seekg(i,ios_base::beg)
f.read((char*)&pagemark,2);
if (pagemark==0x0101) break;
}

if (f.eof())
{
cout<<"Error!"<<endl;
// deal with the error
}

f.read((char*)&elementcount,1);
i-=512;
f.seekg(i,ios_base::beg);
for (i=0;i<elementcount;i++)
{
f.read((char*)&textpos,4);
textpos>>=8;
textpos+=2;
if (textpos!=lastpos)
{
pos[posc++]=textpos;
lastpos=textpos;
}
}

for (i=0;i<posc;i++)
{
f.seekg(pos[i],ios_base::beg);
f.read(AllText+i*512,512);
}

AllText[i*512]=0;
AllText[i*512+1]=0;
return;

}

2,586

社区成员

发帖
与我相关
我的任务
社区描述
VC/MFC 资源
社区管理员
  • 资源
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧