如何从Java程序读取PDF文档中的文本信息?

leonzhao 2002-04-06 02:39:36
最好可以有一些程序片断。
...全文
841 24 打赏 收藏 转发到动态 举报
写回复
用AI写文章
24 条回复
切换为时间正序
请发表友善的回复…
发表回复
thales 2002-04-17
  • 打赏
  • 举报
回复
gz,目前也在研究fop
leonzhao 2002-04-17
  • 打赏
  • 举报
回复
to badkid2001(badkid2001)

成功,现在给分!
charysun 2002-04-16
  • 打赏
  • 举报
回复
我曾经做过用JAVA生成PDF文档的项目,支持多种语言,要想做这类项目必须读懂
PDF Specification。我劝你还是读一读吧,有些事情是急不得的。
yuhan 2002-04-16
  • 打赏
  • 举报
回复
我曾经写过PDF文件格式,也读取出来过。是用流的形式读取出来的。
jimjxr 2002-04-16
  • 打赏
  • 举报
回复
嗯,PJ的帮助很少,我没时间试验了,所以还不知道正文怎么取,抱歉。PJ的API估计是根据PDF结构写的,如果你懂PDF的结构可以研究一下API文档。
badkid2001 2002-04-16
  • 打赏
  • 举报
回复
以前看到的,不知道有没有价值
Re: How to read the a PDF file content using Ethymon PJ?

Body: try this code :

import java.io.*;
import java.util.*;
import com.etymon.pj.*;
import com.etymon.pj.object.*;
import com.etymon.pj.exception.*;

/**
* This is a wrapper for the Pj PDF parser
*/
public class PjWrapper {

Pdf pdf;
PjCatalog catalog;
PjPagesNode rootPage;

public PjWrapper(String PdfFileName,String TextFileName)throws
IOException, PjException {

pdf = new Pdf(PdfFileName);

// hopefully the catalog can never be a reference...

catalog = (PjCatalog) pdf.getObject(pdf.getCatalog());

// root node of pages tree is specified by a reference in the catalog

rootPage = (PjPagesNode) pdf.resolve(catalog.getPages());
}

public static void main (String [] args) throws IOException, PjException
{

/*PjWrapper testWrapper = new PjWrapper(args[0]);
LinkedList textList = testWrapper.getAllText();*/
}

/**
* Returns as much text as we can extract from the PDF.
* This currently includes:
*
* NOTE: Pj does not support LZW, so some text in some PDF's may not
* be indexable
*/
public LinkedList getAllText() throws PjException {

LinkedList stringList = new LinkedList();
Iterator streamIter = getAllContentsStreams().iterator();
PjStream stream;
String streamData;
String streamText;
boolean moreData;
int textStart, textEnd;

//System.out.println("Going through streams...");

while(streamIter.hasNext()) {

//System.out.println("Getting next stream");
stream = (PjStream) streamIter.next();

//System.out.println("Adding text from stream with filter: "
+getFilterString(stream);
stream = stream.flateDecompress();

//System.out.println("Adding text from stream with filter
afterdecompress: " + getFilterString(stream));
streamData = new String(stream.getBuffer());

streamText = new String();
moreData = true;
textStart = textEnd = 0;

while(moreData) {

if ((textStart = streamData.indexOf('(', textEnd + 1)) < 0) {

moreData = false;
break;
}

if ((textEnd = streamData.indexOf(')', textStart + 1)) < 0) {

moreData = false;
break;
}

try {

streamText +=
PjString.decodePdf(streamData.substring(textStart,textEnd + 1));
} catch (Exception e) {

System.out.println("malformed string: " +
streamData.substring(textStart, textEnd + 1));
}
}

//if(streamText.equals("inserted text"))
System.out.println(streamText);

if (streamText.length() > 0)
stringList.add(streamText);
}

return stringList;
}

public static String getFilterString(PjStream stream) throws PjException
{

String filterString = new String();
PjObject filter;
//System.out.println("getting filter from dictionary");

if ((filter = stream.getStreamDictionary().getFilter()) == null) {
//System.out.println("Got null filter");
return "";
}
//System.out.println("got it");


// filter should either be a name or an array of names
if (filter instanceof PjName) {

//System.out.println("getting filter string from simple name");
filterString = ((PjName) filter).getString();
} else {

//System.out.println("getting filter string from array of names");
Iterator nameIter;
Vector nameVector;

if ((nameVector = ((PjArray) filter).getVector()) == null) {

//System.out.println("got null vector for list of names");
return "";
}

nameIter = nameVector.iterator();

while (nameIter.hasNext()) {

filterString += ((PjName) nameIter.next()).getString();

if (nameIter.hasNext())
filterString += " ";
}
}

//System.out.println("got filter string");

return filterString;
}

/**
* Performs a post-order traversal of the pages tree
* from the root node and gets all of the contents streams
* @returns a list of all the contents of all the pages
*/

public LinkedList getAllContentsStreams() throws
InvalidPdfObjectException {

return getContentsStreams(getAllPages());
}

/**
* Get contents streams from the list of PjPage objects
* @returns a list of all the contents of the pages
*/


public LinkedList getContentsStreams(LinkedList pages) throws
InvalidPdfObjectException {

LinkedList streams = new LinkedList();
Iterator pageIter = pages.iterator();
PjObject contents;

while(pageIter.hasNext()) {
contents = pdf.resolve(((PjPage)pageIter.next()).getContents());

// should only be a stream or an array of streams (or refs to
streams)

if (contents instanceof PjStream)
streams.add(contents);
else{
Iterator streamsIter = ((PjArray)contents).getVector().iterator();

while(streamsIter.hasNext())
streams.add(pdf.resolve((PjObject)streamsIter.next()));
}
}


return streams ;
}

/**
* Performs a post-order traversal of the pages tree
* from the root node.
* @returns a list of all the PjPage objects
*/

public LinkedList getAllPages() throws InvalidPdfObjectException {

LinkedList pages = new LinkedList();
getPages(rootPage, pages);
return pages;
}

/**
* Performs a post-order traversal of the pages tree
* from the node passed to it.
* @returns a list of all the PjPage objects under node
*/

public void getPages(PjObject node, LinkedList pages) throws
InvalidPdfObjectException {

PjPagesNode pageNode = null;

// let's hope pdf's don't have pointers to pointers

if (node instanceof PjReference)
pageNode = (PjPagesNode) pdf.resolve(node);
else
pageNode = (PjPagesNode) node;

if (pageNode instanceof PjPage) {
pages.add(pageNode);
return;
}

// kids better be an array and not a reference to one

Iterator kidIterator = ((PjArray) ((PjPages)
pageNode).getKids()).getVector().iterator();

while(kidIterator.hasNext()) {
getPages((PjObject) kidIterator.next(), pages);
}
}

public Pdf getPdf() {
return pdf;
}
}

lithium 2002-04-16
  • 打赏
  • 举报
回复
关注
javaservlet 2002-04-16
  • 打赏
  • 举报
回复
估计很难解决,
我曾经也想看,自己写一个解析PDF的库,
我看过一便PDF文件的格式后,放弃了,
需要太长的时间,也没有特别大的意义。。
有兴趣联系:xueaihua@chinaren.com
leonzhao 2002-04-16
  • 打赏
  • 举报
回复
to yuhan():

能讲讲你读取的情况吗?

to 各位大侠:

如果是说要我自己都格式去解答的话,我还会在这里提问题吗?我不是自己就去看去了吗?

各位谁知道就请说出来吧。
wes109 2002-04-14
  • 打赏
  • 举报
回复
我也在研究pdf,不过是写。其实我研究的是FOP这个软件,有java源码的。但不知道是不是全的,有研究过的大虾请告诉我,src这个文件夹中的源码究竟是不是全的,有兴趣的朋友My qq:79389412.My e-mail:wes109cn@yahoo.com.cn一起讨论。
leonzhao 2002-04-13
  • 打赏
  • 举报
回复
不信没人知道了,UP到解决!
javaservlet 2002-04-12
  • 打赏
  • 举报
回复
很难啊..
leonzhao 2002-04-12
  • 打赏
  • 举报
回复
没人知道??!!
jimjxr 2002-04-10
  • 打赏
  • 举报
回复
这个是用PJ打印pdf信息的代码

import com.etymon.pj.*;
import com.etymon.pj.object.*;

public class GetPDFInfo {
public static void main (String args[]) {
try {
Pdf pdf = new Pdf(args[0]);
System.out.println("# of pages is " + pdf.getPageCount());
int y = pdf.getMaxObjectNumber();
for (int x=1; x <= y; x++) {
PjObject obj = pdf.getObject(x);
if (obj instanceof PjInfo) {
System.out.println("Author: " + ((PjInfo)
obj).getAuthor());
System.out.println("Creator: " + ((PjInfo)
obj).getCreator());
System.out.println("Subject: " + ((PjInfo)
obj).getSubject());
System.out.println("Keywords: " + ((PjInfo)
obj).getKeywords());

}
}
}
catch (java.io.IOException ex) {
System.out.println(ex);
}
catch (com.etymon.pj.exception.PjException ex) {
System.out.println(ex);
}
}
}
leonzhao 2002-04-10
  • 打赏
  • 举报
回复
我希望的是程序的片断,不管是使用什么类库的。

有没有程序片断?
leonzhao 2002-04-10
  • 打赏
  • 举报
回复
我的意思是PDF文件中的文本,正文信息。
change 2002-04-08
  • 打赏
  • 举报
回复
读取也要知道PDF的格式呀。
jimjxr 2002-04-08
  • 打赏
  • 举报
回复
Adobe有个Java Acrobat Viewer,你可以看看:
http://www.adobe.com/products/acrviewer/main.html
还有这个,也是读取的:
http://www.etymon.com/pj/
pengji 2002-04-08
  • 打赏
  • 举报
回复
理论上只要知道了PDF得文件格式就可以读取了!不过这个一般需要ADOBE的技术文档!
http://partners.adobe.com/asn/developer/acrosdk/docs.html
leonzhao 2002-04-07
  • 打赏
  • 举报
回复
再次UP,我不是要写PDF,是读取!!
加载更多回复(4)

62,614

社区成员

发帖
与我相关
我的任务
社区描述
Java 2 Standard Edition
社区管理员
  • Java SE
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧