请问 hadoop 异构集群 优化配置

小小小超人 2014-01-20 03:07:18
A型机:内存7.5G,CPU 2个
B型机:内存1.7G, CPU1个

执行语句:
insert overwrite table archive_seller_by_geo_per_day partition(partDate, shard) select sellerTokenID, partDate archiveDate, countryCode, country, countryState, countryCity, count(*) pv, count(distinct ip) uv, partDate, shard from page_visit where partDate >= ${THREE_DAY[1]} and partDate <= ${THREE_DAY[3]} group by sellerTokenID, countryCode, country, countryState, countryCity, partDate, shard having sellerTokenID>0;

page_visit 表中接近四千万数据将近800秒

slave为3A+3B一起跑与3A单独跑速度差不多。
请问在异构集群下可以做哪些优化和配置?

...全文
399 3 打赏 收藏 转发到动态 举报
写回复
用AI写文章
3 条回复
切换为时间正序
请发表友善的回复…
发表回复
撸大湿 2014-01-22
  • 打赏
  • 举报
回复
配置看的头晕,有空慢慢看 先问一下,你们hadoop的网络是100M还是1000M?
小小小超人 2014-01-20
  • 打赏
  • 举报
回复
以下为当前配置: hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.name.dir</name> <value>/var/lib/hadoop/dfs/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> </property> <property> <name>dfs.name.edits.dir</name> <value>/var/lib/hadoop/dfs/edits</value> <description>Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.name.dir </description> </property> <property> <name>dfs.data.dir</name> <value>/var/lib/hadoop/dfs/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> </property> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.block.size</name> <value>134217728</value> <!--134217728=128M 268435456=256M--> <description>The default block size for new files.</description> </property> </configuration> ------------------------------------------------------------------------------------ mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>dc-notification:9001</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>io.sort.mb</name> <value>100</value> <description>The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.</description> </property> <property> <name>mapred.compress.map.output</name> <value>true</value> <description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx200m</value> <description>Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes. </description> </property> <property> <name>mapred.job.shuffle.merge.percent</name> <value>0.8</value> <description>The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. </description> </property> <property> <name>mapred.job.shuffle.input.buffer.percent</name> <value>0.9</value> <description>The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. </description> </property> <property> <name>mapred.job.reduce.input.buffer.percent</name> <value>0.5</value> <description>The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. </description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> <description>The maximum number of map tasks that will be run simultaneously by a task tracker. </description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker. </description> </property> <property> <name>mapred.reduce.tasks</name> <value>18</value> <description>The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is "local". </description> </property> <property> <name>mapred.job.reuse.jvm.num.tasks</name> <value>-1</value> <description>How many tasks to run per jvm. If set to -1, there is no limit. </description> </property> </configuration>

20,808

社区成员

发帖
与我相关
我的任务
社区描述
Hadoop生态大数据交流社区,致力于有Hadoop,hive,Spark,Hbase,Flink,ClickHouse,Kafka,数据仓库,大数据集群运维技术分享和交流等。致力于收集优质的博客
社区管理员
  • 分布式计算/Hadoop社区
  • 涤生大数据
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧