利用itext对pdf内容进行检索，在指定位置找出关键词的问题

nohyes 2008-04-10 04:37:43

请教各位，如何用itext对pdf内容进行检索

或者来说就是读取。并且在可以在指定的位置找到关键字

比如：
实现在一个pdf文件中的第一页找到“金融”两个字。

itext能实现这种功能吗？如果可以请各位指教，最好给出例子

itext在网上写pdf文件的例子很多。可是没有读取的。

谢谢

...全文

1141 12 打赏收藏转发到动态举报

写回复

用AI写文章

12 条回复

切换为时间正序

请发表友善的回复…

发表回复

rocbond 2012-04-26

打赏
举报

好资源，标记一下。。。

For_suzhen 2008-04-16

打赏
举报

但是中文的不能读出
=================
不知道getBytes能不能起到作用

nohyes 2008-04-16

打赏
举报

测试了蛮久
getPageContent方法获得的pdf文件内容
只支持英文，并且符合一定规则吧。
在()里面的是pdf里的英文字符，但是中文的不能读出

请问还有人知道怎么实现吗？需要能读出中文，就像我1楼写的那个例子，在有某个含有“金融”这个关键字的pdf文档里能读出“金融”两个字

谢谢大家

KingNE 2008-04-15

打赏
举报

up 下

nohyes 2008-04-15

打赏
举报

用了二楼的方法。
getPageContent(1)
可是读出来的都是些莫名其妙的英文。和我最早的时候试的结果是一样的

内容截取一段
BT
/F2 1 Tf
23.998 0 0 23.998 230.9807 479.0661 Tm
0 0 0 rg
/GS1 gs
-0.0001 Tc
0 Tw
[(Console)-251.1(G)2.9(uide）]Tj

.......后面省略了。反正乱七八糟的看不懂

有没人解释一下这是什么意思啊。。。。

nohyes 2008-04-15

打赏
举报

不好意思。刚回来
2楼的方法我先试下。有效就给分

但是有个问题。
这个方法对中文不好使吧。是不是也要加上itext的远东中文包？
如果是的话对于读取中文要怎么应用这个中文包呢？

szcoder1102 2008-04-12

打赏
举报

***************************************************************************

思想决定行动，交流产生力量。
程序员在深圳QQ群大集

专业分类:
程序员在深圳JAVA群4247660
程序员在深圳c++群15195967
程序员在深圳.NET群Ⅱ:12203296
程序员在深圳TCP/IP协议栈开发:16956462
程序员在深圳JS & AJAX群:12578377
程序员在深圳英语学习群:23864353
深序员在深圳VB:11055959
程序员在深圳c++Ⅱ17409451
程序员在深圳c++群15195967
程序员在深圳嵌入式开发群37489763
程序员在深圳移动开发群31501597
程序员在深圳创业群33653422

不限专业分类:
高级群:17538442
第三群:2650485
第二群:7120862
第五群:29537639
第四群:28702746
第六群:10590618
第七群:10543585
第八群:12006492
第九群:19063074
第十群:2883885
第十一群:25460595
第十二群:9663807

深圳程序员QQ群联盟成立两年多，拥有三十个以上的QQ群,人数达两千多人,有30%以上的成员的经验丰富

的老手,包括国内外顶级大公司的成员（如微软、IBM,SUN，华为）、国内著名高校和研究院成员，和有

丰富实践经验的高级程序(包括参加过上亿元的项目的架构师),有很热爱技术的成员(包括自己写过嵌入

式操作系统),还有少数女程序员。

现推介如下QQ群,如有兴趣速速加入:深程高级群:17538442（此群不欢迎新手，已经在深圳工作的，月薪

6K以下的不欢迎）c++:15195967 .NET:12203296 mobile:31501597嵌入式:37489763 JAVA:4247660
——————————————————————————————————————————
希望大家不要认为群能给你送来什么，这只是一个平台,让同等水平的程序员有个交流的机会或许能得到

一点信息或许能带来一点启发。
有人说常聊QQ的人肯定技术不怎么样，但其实很多技术高朋友不需要做一些简单的重复劳动所以还是有

时间聊天的。

*****************************************************************************

llpgy 2008-04-11

打赏
举报

以前写的代码，把pdf文件第一页和倒数第一、二页的特殊字符换成pdf的总页数

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;
import com.lowagie.text.pdf.PdfWriter;

public class ggg
{
static String PAGESTRING = "#$%@*";

public void creatpdf(String filepath,int num)
{
if(num<1999){
num=1999;
}
// 创建一个Document对象
Document document = new Document();
try
{
PdfWriter.getInstance(document, new FileOutputStream(filepath));

// 添加PDF文档的一些信息
document.addTitle("Hello World example");
document.addAuthor("Bruno Lowagie");
document.addSubject("This example explains how to add metadata.");
document.addKeywords("iText, Hello World, step 3, metadata");
document.addCreator("My program using iText");
// 打开文档，将要写入内容
document.open();
for (int i = 0; i < num; i++)
{
if (i == 2 || i == num-2)
{
Paragraph hhh = new Paragraph(PAGESTRING);
document.add(hhh);
}
Paragraph hhh = new Paragraph("Hello World!===== " + i);
document.add(hhh);
}
}
catch (DocumentException de)
{
System.err.println(de.getMessage());
}
catch (IOException ioe)
{
System.err.println(ioe.getMessage());
}

// 关闭打开的文档
document.close();
}

public void editpdf(String sourFilePath, String destFilePath) throws IOException
{
PdfReader reader = new PdfReader(sourFilePath);
try
{
int p = reader.getNumberOfPages();
String s = new String(reader.getPageContent(1));
String ss = "";
String pageNum = String.valueOf(p);
if (pageNum.length() < PAGESTRING.length())
{
pageNum = (pageNum + " ").substring(0, PAGESTRING.length());
}
if (s.indexOf(PAGESTRING) != -1)
{
ss = s.substring(0, s.indexOf(PAGESTRING)) + pageNum
+ s.substring(s.indexOf(PAGESTRING) + PAGESTRING.length());
reader.setPageContent(1, ss.getBytes());
}
s = new String(reader.getPageContent(p - 1));
if (s.indexOf(PAGESTRING) != -1)
{
ss = s.substring(0, s.indexOf(PAGESTRING)) + pageNum
+ s.substring(s.indexOf(PAGESTRING) + PAGESTRING.length());
reader.setPageContent(p - 1, ss.getBytes());
}
s = new String(reader.getPageContent(p));
if (s.indexOf(PAGESTRING) != -1)
{
ss = s.substring(0, s.indexOf(PAGESTRING)) + pageNum
+ s.substring(s.indexOf(PAGESTRING) + PAGESTRING.length());
reader.setPageContent(p, ss.getBytes());
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(destFilePath));
stamper.close();
}
catch (DocumentException de)
{
System.err.println(de.getMessage());
}
catch (IOException ioe)
{
System.err.println(ioe.getMessage());
}
}

private void renameAndDelPdf(String soureFilePath, String destFilename)
{
File f = new File(soureFilePath);
File df = new File(destFilename);
if (f.exists() && f.isFile() && df.exists() && df.isFile())
{
if (f.delete())
{
if (!df.renameTo(new File(soureFilePath)))
{
System.err.println("file rename error");
}
}
else
{
System.err.println("file delete error");
}
}
else
{
System.err.println("file no exit or incorrect");
}

}

public static void main(String[] args) throws IOException, DocumentException
{
ggg m = new ggg();

long t=System.currentTimeMillis();
m.creatpdf("HelloWorld-old.pdf",200000);
long n=System.currentTimeMillis();
System.out.println(n-t);

t=System.currentTimeMillis();
m.editpdf("HelloWorld-old.pdf", "HelloWorld-new.pdf");
n=System.currentTimeMillis();
System.out.println(n-t);

t=System.currentTimeMillis();
m.renameAndDelPdf("HelloWorld-old.pdf", "HelloWorld-new.pdf");
n=System.currentTimeMillis();
System.out.println(n-t);
}
}

nohyes 2008-04-10