求教java文本过滤处理

qiuchen 2013-11-28 04:13:11
小弟初学文本处理

要处理的文件是亚马逊上的购物产品日志
对于单个产品记录 格式如下 整个日志有数十万条这样的产品记录 (整个文件1G)
我现在 想用java 读入这个文件 然后 只保每个记录的 ID 号 (如15) 和 其对应的group (如Book)
然后 再把 ID 号 (如15)和 其对应的group (Book)写入一个新的文件

不知道该怎么处理 求高手指导啊

Id: 15
ASIN: 1559362022
title: Wake Up and Smell the Coffee
group: Book
salesrank: 518927
similar: 5 1559360968 1559361247 1559360828 1559361018 0743214552
categories: 3
|Books[283155]|Subjects[1000]|Literature & Fiction[17]|Drama[2159]|United States[2160]
|Books[283155]|Subjects[1000]|Arts & Photography[1]|Performing Arts[521000]|Theater[2154]|General[2218]
|Books[283155]|Subjects[1000]|Literature & Fiction[17]|Authors, A-Z[70021]|( B )[70023]|Bogosian, Eric[70116]
reviews: total: 8 downloaded: 8 avg rating: 4
2002-5-13 cutomer: A2IGOA66Y6O8TQ rating: 5 votes: 3 helpful: 2
2002-6-17 cutomer: A2OIN4AUH84KNE rating: 5 votes: 2 helpful: 1
2003-1-2 cutomer: A2HN382JNT1CIU rating: 1 votes: 6 helpful: 1
2003-6-7 cutomer: A2FDJ79LDU4O18 rating: 4 votes: 1 helpful: 1 2003-6-27
cutomer: A39QMV9ZKRJXO5 rating: 4 votes: 1 helpful: 1 2004-2-17
cutomer: AUUVMSTQ1TXDI rating: 1 votes: 2 helpful: 0 2004-2-24
cutomer: A2C5K0QTLL9UAT rating: 5 votes: 2 helpful: 2 2004-10-13
cutomer: A5XYF0Z3UH4HB rating: 5 votes: 1 helpful: 1
...全文
233 点赞 收藏 11
写回复
11 条回复
qiuchen 2013年11月30日
谢谢10楼 和各位 我终于弄好了
回复 点赞
acefr 2013年11月30日
引用 9 楼 qiuchen 的回复:
谢谢您 我最后的输出还是没有ID 号 格式如下: Id: group: Book Id: group: Music Id: group: Book Id 后面没有数字 不知道是为什么呢
你那文本到底是不是标准的? Id:和group: 后面跟的是几个空格? group后面是一个空格吧?你主贴给的Id:后面也是一个空格啊, 你要是后面的空格数不对那肯定是读不出来,判断的正则式得改成这样:
String re="(Id|group): [\\s\\d\\w]*";
回复 点赞
浪漫小和 2013年11月29日
匹配的话应该是使用正则表达式。 。
回复 点赞
acefr 2013年11月29日
完整的测试代码,供参考
import java.io.File;  
import java.io.FileOutputStream;  
import java.io.IOException;  
import java.io.OutputStreamWriter;  
import java.io.*;
import java.util.regex.*;

public class Test {  
      
    /** 
     * @param args 
     */  
    public static void main(String[] args) {  
        File file = new File("c:\\Test.txt");
        File file2 = new File("c:\\demo.txt");
				if (file.isFile() && file.exists()) {
					try {
					InputStreamReader read = new InputStreamReader(new FileInputStream(file));
					OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(file2));
					BufferedReader bufferedReader = new BufferedReader(read);
					String lineTXT = null;
					while ((lineTXT = bufferedReader.readLine()) != null){
						String re="(Id|group): [\\d\\w]*";
						
						Pattern p = Pattern.compile(re);
						Matcher m = p.matcher(lineTXT);
    				
    				
    				while (m.find()) {
	    				String tmp = m.group();
	    				if (!"".equals(tmp)) {
								writer.write(tmp+"\r\n");
	    				}
						}
						writer.flush(); 
					} 
					read.close(); 
					}
					catch (Exception e) {
   					e.printStackTrace();
  				}
				}
				else{ 
					System.out.println("找不到指定的文件!"); 
				}
    }  
}  
回复 点赞
acefr 2013年11月29日
引用 3 楼 qiuchen 的回复:
这是我写的程序: 我从来没写过正则表达式 写的好像根本不对 哪位帮我看看改改啊 十分感谢 class Main { public static void main(String[] args) throws IOException { String file="/Users/csdn/Desktop/test.rtf"; BufferedReader br; try { br = new BufferedReader(new FileReader(file)); String line; String re1=".*?"; // Non-greedy match on filler String re2="((?:[I-z][d-z]+))"; // ID String re3="((?:[c-z][a-z]+))"; // Category Pattern p = Pattern.compile(re1+re2+re3,Pattern.CASE_INSENSITIVE | Pattern.DOTALL); Matcher m = p.matcher(file); while((line=br.readLine())!=null){ m=p.matcher(line); if (m.find()) { String day1=m.group(1); String word1=m.group(2); System.out.print(" "+day1.toString()+" "+" "+word1.toString()+" "+"\n"); } } } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.out.println("fail"); } } } 输出 结果是 \font tbl color tbl ar gl ardir natural ardir natural AS IN dis continued AS IN tit le gro up ales rank simi lar ategori es Boo ks Boo ks revie ws cutom er cutom er AS IN tit le gro up ales rank simi lar ategori es Boo ks Boo ks revie ws cutom er cutom er ...... cutom er cutom er AS IN tit le ...... 而不是 希望得到的 1 Book 2 Book 3 Book txt里的内容: Id: 0 ASIN: 0771044445 discontinued product Id: 1 ASIN: 0827229534 title: Patterns of Preaching: A Sermon Sampler group: Book salesrank: 396585 similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X categories: 2 |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368] |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370] reviews: total: 2 downloaded: 2 avg rating: 5 2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9 2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5 Id: 2 ASIN: 0738700797 title: Candlemas: Feast of Flames group: Book salesrank: 168596 similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940 categories: 2 |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484] |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486] reviews: total: 12 downloaded: 12 avg rating: 4.5 2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4 2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5 2002-1-24 cutomer: A13SG9ACZ9O5IM rating: 5 votes: 8 helpful: 8 2002-1-28 cutomer: A1BDAI6VEYMAZA rating: 5 votes: 4 helpful: 4 2002-2-6 cutomer: A2P6KAWXJ16234 rating: 4 votes: 16 helpful: 16 2002-2-14 cutomer: AMACWC3M7PQFR rating: 4 votes: 5 helpful: 5 2002-3-23 cutomer: A3GO7UV9XX14D8 rating: 4 votes: 6 helpful: 6 2002-5-23 cutomer: A1GIL64QK68WKL rating: 5 votes: 8 helpful: 8 2003-2-25 cutomer: AEOBOF2ONQJWV rating: 5 votes: 8 helpful: 5 2003-11-25 cutomer: A3IGHTES8ME05L rating: 5 votes: 5 helpful: 5 2004-2-11 cutomer: A1CP26N8RHYVVO rating: 1 votes: 13 helpful: 9 2005-2-7 cutomer: ANEIANH0WAT9D rating: 5 votes: 1 helpful: 1 Id: 3 ASIN: 0486287785 title: World War II Allied Fighter Planes Trading Cards group: Book salesrank: 1270652 similar: 0 categories: 1 |Books[283155]|Subjects[1000]|Home & Garden[48]|Crafts & Hobbies[5126]|General[5144] reviews: total: 1 downloaded: 1 avg rating: 5 2003-7-10 cutomer: A3IDGASRQAW8B2 rating: 5 votes: 2 helpful: 2
正则式用这个: String re="(Id|group): [\\d\\w]*"; 测试代码如下,Test.txt文件的内容是你主贴贴的那段文本,写入文件的自己自己搞定吧,加油哈,你行的
import java.io.DataOutputStream;  
import java.io.File;  
import java.io.FileOutputStream;  
import java.io.IOException;  
import java.io.OutputStreamWriter;  
import java.io.*;
import java.util.regex.*;

public class Test {  
      
    /** 
     * @param args 
     */  
    public static void main(String[] args) {  
        File file = new File("c:\\Test.txt");
				if (file.isFile() && file.exists()) {
					try {
					InputStreamReader read = new InputStreamReader(new FileInputStream(file));
					BufferedReader bufferedReader = new BufferedReader(read);
					String lineTXT = null;
					while ((lineTXT = bufferedReader.readLine()) != null){
						String re="(Id|group): [\\d\\w]*";
						
						Pattern p = Pattern.compile(re);
						Matcher m = p.matcher(lineTXT);
    				
    				
    				while (m.find()) {
	    				String tmp = m.group();
	    				if (!"".equals(tmp)) {
								System.out.println(tmp);
	    				}
						}
						
					} 
					read.close(); 
					}
					catch (Exception e) {
   					e.printStackTrace();
  				}
				}
				else{ 
					System.out.println("找不到指定的文件!"); 
				}
    }  
}  
回复 点赞
qiuchen 2013年11月29日
谢谢您 我最后的输出还是没有ID 号 格式如下: Id: group: Book Id: group: Music Id: group: Book Id 后面没有数字 不知道是为什么呢
引用 7 楼 acefr 的回复:
完整的测试代码,供参考
import java.io.File;  
import java.io.FileOutputStream;  
import java.io.IOException;  
import java.io.OutputStreamWriter;  
import java.io.*;
import java.util.regex.*;

public class Test {  
      
    /** 
     * @param args 
     */  
    public static void main(String[] args) {  
        File file = new File("c:\\Test.txt");
        File file2 = new File("c:\\demo.txt");
				if (file.isFile() && file.exists()) {
					try {
					InputStreamReader read = new InputStreamReader(new FileInputStream(file));
					OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(file2));
					BufferedReader bufferedReader = new BufferedReader(read);
					String lineTXT = null;
					while ((lineTXT = bufferedReader.readLine()) != null){
						String re="(Id|group): [\\d\\w]*";
						
						Pattern p = Pattern.compile(re);
						Matcher m = p.matcher(lineTXT);
    				
    				
    				while (m.find()) {
	    				String tmp = m.group();
	    				if (!"".equals(tmp)) {
								writer.write(tmp+"\r\n");
	    				}
						}
						writer.flush(); 
					} 
					read.close(); 
					}
					catch (Exception e) {
   					e.printStackTrace();
  				}
				}
				else{ 
					System.out.println("找不到指定的文件!"); 
				}
    }  
}  
回复 点赞
qiuchen 2013年11月29日
谢谢您 但是我用这段程序后 输出的data2.txt依然是空文件,正则表达式 好像还是没有匹配上
引用 4 楼 woshilianglin 的回复:
不知道楼主提供的日志文件中的每个ID是否都会有一个GROUP相对应。如果是的话,假设源数据文件内容为如下: Id: 1 ASIN: 0827229534 title: Patterns of Preaching: A Sermon Sampler group: Book salesrank: 396585 Id: 2 ASIN: 0738700797 title: Candlemas: Feast of Flames group: Book salesrank: 168596 similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940 Id: 3 ASIN: 0486287785 title: World War II Allied Fighter Planes Trading Cards group: Book salesrank: 1270652 similar: 0 其它的内容因为篇幅省略,放在D盘的DATA.TXT文件中。之后程序如下: public static void main(String[]args) throws IOException{ File inFile = new File("D:"+File.separator+"data.txt"); File outFile = new File("D:"+File.separator+"data2.txt"); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outFile))); BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inFile))); Pattern pattern = Pattern.compile("(Id:){1}\\s*\\w+|(group:)\\s*\\w+"); String str = ""; Matcher matcher; while((str = reader.readLine()) !=null){ matcher= pattern.matcher(str.trim()); if(matcher.matches()){ if(str.contains("Id")){ String[] idStrings = str.trim().split(":\\s*"); writer.write(idStrings[idStrings.length - 1]+"\t"); }else if(str.contains("group")){ String[] groupStrings = str.split(":\\s*"); writer.write(groupStrings[groupStrings.length - 1]+"\n"); } } } reader.close(); writer.flush(); writer.close(); System.out.println("文本过滤完毕"); } 你所要的结果就会写在DATA2.TXT中
回复 点赞
woshilianglin 2013年11月28日
不知道楼主提供的日志文件中的每个ID是否都会有一个GROUP相对应。如果是的话,假设源数据文件内容为如下: Id: 1 ASIN: 0827229534 title: Patterns of Preaching: A Sermon Sampler group: Book salesrank: 396585 Id: 2 ASIN: 0738700797 title: Candlemas: Feast of Flames group: Book salesrank: 168596 similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940 Id: 3 ASIN: 0486287785 title: World War II Allied Fighter Planes Trading Cards group: Book salesrank: 1270652 similar: 0 其它的内容因为篇幅省略,放在D盘的DATA.TXT文件中。之后程序如下: public static void main(String[]args) throws IOException{ File inFile = new File("D:"+File.separator+"data.txt"); File outFile = new File("D:"+File.separator+"data2.txt"); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outFile))); BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inFile))); Pattern pattern = Pattern.compile("(Id:){1}\\s*\\w+|(group:)\\s*\\w+"); String str = ""; Matcher matcher; while((str = reader.readLine()) !=null){ matcher= pattern.matcher(str.trim()); if(matcher.matches()){ if(str.contains("Id")){ String[] idStrings = str.trim().split(":\\s*"); writer.write(idStrings[idStrings.length - 1]+"\t"); }else if(str.contains("group")){ String[] groupStrings = str.split(":\\s*"); writer.write(groupStrings[groupStrings.length - 1]+"\n"); } } } reader.close(); writer.flush(); writer.close(); System.out.println("文本过滤完毕"); } 你所要的结果就会写在DATA2.TXT中
回复 点赞
qiuchen 2013年11月28日
这是我写的程序: 我从来没写过正则表达式 写的好像根本不对 哪位帮我看看改改啊 十分感谢 class Main { public static void main(String[] args) throws IOException { String file="/Users/csdn/Desktop/test.rtf"; BufferedReader br; try { br = new BufferedReader(new FileReader(file)); String line; String re1=".*?"; // Non-greedy match on filler String re2="((?:[I-z][d-z]+))"; // ID String re3="((?:[c-z][a-z]+))"; // Category Pattern p = Pattern.compile(re1+re2+re3,Pattern.CASE_INSENSITIVE | Pattern.DOTALL); Matcher m = p.matcher(file); while((line=br.readLine())!=null){ m=p.matcher(line); if (m.find()) { String day1=m.group(1); String word1=m.group(2); System.out.print(" "+day1.toString()+" "+" "+word1.toString()+" "+"\n"); } } } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.out.println("fail"); } } } 输出 结果是 \font tbl color tbl ar gl ardir natural ardir natural AS IN dis continued AS IN tit le gro up ales rank simi lar ategori es Boo ks Boo ks revie ws cutom er cutom er AS IN tit le gro up ales rank simi lar ategori es Boo ks Boo ks revie ws cutom er cutom er ...... cutom er cutom er AS IN tit le ...... 而不是 希望得到的 1 Book 2 Book 3 Book txt里的内容: Id: 0 ASIN: 0771044445 discontinued product Id: 1 ASIN: 0827229534 title: Patterns of Preaching: A Sermon Sampler group: Book salesrank: 396585 similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X categories: 2 |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368] |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370] reviews: total: 2 downloaded: 2 avg rating: 5 2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9 2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5 Id: 2 ASIN: 0738700797 title: Candlemas: Feast of Flames group: Book salesrank: 168596 similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940 categories: 2 |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484] |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486] reviews: total: 12 downloaded: 12 avg rating: 4.5 2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4 2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5 2002-1-24 cutomer: A13SG9ACZ9O5IM rating: 5 votes: 8 helpful: 8 2002-1-28 cutomer: A1BDAI6VEYMAZA rating: 5 votes: 4 helpful: 4 2002-2-6 cutomer: A2P6KAWXJ16234 rating: 4 votes: 16 helpful: 16 2002-2-14 cutomer: AMACWC3M7PQFR rating: 4 votes: 5 helpful: 5 2002-3-23 cutomer: A3GO7UV9XX14D8 rating: 4 votes: 6 helpful: 6 2002-5-23 cutomer: A1GIL64QK68WKL rating: 5 votes: 8 helpful: 8 2003-2-25 cutomer: AEOBOF2ONQJWV rating: 5 votes: 8 helpful: 5 2003-11-25 cutomer: A3IGHTES8ME05L rating: 5 votes: 5 helpful: 5 2004-2-11 cutomer: A1CP26N8RHYVVO rating: 1 votes: 13 helpful: 9 2005-2-7 cutomer: ANEIANH0WAT9D rating: 5 votes: 1 helpful: 1 Id: 3 ASIN: 0486287785 title: World War II Allied Fighter Planes Trading Cards group: Book salesrank: 1270652 similar: 0 categories: 1 |Books[283155]|Subjects[1000]|Home & Garden[48]|Crafts & Hobbies[5126]|General[5144] reviews: total: 1 downloaded: 1 avg rating: 5 2003-7-10 cutomer: A3IDGASRQAW8B2 rating: 5 votes: 2 helpful: 2
回复 点赞
小天天1234 2013年11月28日
正则表达式 和 String类的一些方法结合
回复 点赞
zk3389 2013年11月28日
用Pattern matcher,找到想要的,写到一个新文件中不就可以了吗
回复 点赞
发动态
发帖子
Java SE
创建于2007-09-28

3.4w+

社区成员

30.7w+

社区内容

Java 2 Standard Edition
社区公告
暂无公告