请问 hadoop 异构集群优化配置

小小小超人 2014-01-20 03:07:18

A型机：内存7.5G，CPU 2个
B型机：内存1.7G, CPU1个

执行语句：
insert overwrite table archive_seller_by_geo_per_day partition(partDate, shard) select sellerTokenID, partDate archiveDate, countryCode, country, countryState, countryCity, count(*) pv, count(distinct ip) uv, partDate, shard from page_visit where partDate >= ${THREE_DAY[1]} and partDate <= ${THREE_DAY[3]} group by sellerTokenID, countryCode, country, countryState, countryCity, partDate, shard having sellerTokenID>0;

page_visit 表中接近四千万数据将近800秒

slave为3A+3B一起跑与3A单独跑速度差不多。
请问在异构集群下可以做哪些优化和配置？

...全文

424 3 打赏收藏转发到动态举报

写回复

用AI写文章

3 条回复

切换为时间正序

请发表友善的回复…

发表回复

撸大湿 2014-01-22

打赏
举报

配置看的头晕，有空慢慢看先问一下，你们hadoop的网络是100M还是1000M？

小小小超人 2014-01-20

打赏
举报

以下为当前配置： hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>dfs.name.dir</name> <value>/var/lib/hadoop/dfs/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> </property> <property> <name>dfs.name.edits.dir</name> <value>/var/lib/hadoop/dfs/edits</value> <description>Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.name.dir </description> </property> <property> <name>dfs.data.dir</name> <value>/var/lib/hadoop/dfs/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> </property> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.block.size</name> <value>134217728</value>  <description>The default block size for new files.</description> </property> </configuration> ------------------------------------------------------------------------------------ mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>mapred.job.tracker</name> <value>dc-notification:9001</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>io.sort.mb</name> <value>100</value> <description>The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.</description> </property> <property> <name>mapred.compress.map.output</name> <value>true</value> <description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx200m</value> <description>Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes. </description> </property> <property> <name>mapred.job.shuffle.merge.percent</name> <value>0.8</value> <description>The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. </description> </property> <property> <name>mapred.job.shuffle.input.buffer.percent</name> <value>0.9</value> <description>The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. </description> </property> <property> <name>mapred.job.reduce.input.buffer.percent</name> <value>0.5</value> <description>The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. </description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> <description>The maximum number of map tasks that will be run simultaneously by a task tracker. </description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker. </description> </property> <property> <name>mapred.reduce.tasks</name> <value>18</value> <description>The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is "local". </description> </property> <property> <name>mapred.job.reuse.jvm.num.tasks</name> <value>-1</value> <description>How many tasks to run per jvm. If set to -1, there is no limit. </description> </property> </configuration>