问题导读
1.hive实现统计的查询语句是什么?
2.生产环境中为什么建议使用外部表?
3.hadoop mapreduce创建类DataWritable的作用是什么?
4.为什么创建类DataWritable?
5.如何实现统计手机流量?
6.对比hive与mapreduce统计手机流量的区别?
1.使用Hive进行手机流量统计
很多公司在使用hive对数据进行处理。
hive是hadoop家族成员,是一种解析like sql语句的框架。它封装了常用MapReduce任务,让你像执行sql一样操作存储在HDFS的表。
hive的表分为两种,内表和外表。
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。
在删除表的时候,内部表的元数据和数据会被一起删除, 而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据。
Hive的内外表,还有一个Partition的分区的知识点,用于避免全表扫描,快速检索。后期的文章会提到。
接下来开始正式开始《Hive统计手机流量》
原始数据:
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 瑙.?缃.. 15 2 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 淇℃.瀹.. 20 20 156 2936 200
操作步骤:
- #配置好Hive之后,使用hive命令启动hive框架。hive启动属于懒加载模式,会比较慢
- hive;
- #使用show databases命令查看当前数据库信息
- hive> show databases;
- OK
- default
- hive
- Time taken: 3.389 seconds
- #使用 use hive命令,使用指定的数据库 hive数据库是我之前创建的
- use hive;
- #创建表,这里是创建内表。内表加载hdfs上的数据,会将被加载文件中的内容剪切走。
- #外表没有这个问题,所以在实际的生产环境中,建议使用外表。
- create table ll(reportTime string,msisdn string,apmac string,acmac string,host string,siteType string,upPackNum bigint,downPackNum bigint,upPayLoad bigint,downPayLoad bigint,httpStatus string)row format delimited fields terminated by '\t';
- #加载数据,这里是从hdfs加载数据,也可用linux下加载数据 需要local关键字
- load data inpath'/HTTP_20130313143750.dat' into table ll;
- #数据加载完毕之后,hdfs的
- #执行hive 的like sql语句,对数据进行统计
- select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;
执行结果如下:
- hive> select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;
- Total MapReduce jobs = 1
- Launching Job 1 out of 1
- Number of reduce tasks not specified. Estimated from input data size: 1
- In order to change the average load for a reducer (in bytes):
- set hive.exec.reducers.bytes.per.reducer=<number>
- In order to limit the maximum number of reducers:
- set hive.exec.reducers.max=<number>
- In order to set a constant number of reducers:
- set mapred.reduce.tasks=<number>
- Starting Job = job_201307160252_0006, Tracking URL = http://hadoop0:50030/jobdetails.jsp?jobid=job_201307160252_0006
- Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=hadoop0:9001 -kill job_201307160252_0006
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
- 2013-07-17 19:51:42,599 Stage-1 map = 0%, reduce = 0%
- 2013-07-17 19:52:40,474 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:41,690 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:42,693 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:43,698 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:44,702 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:45,707 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:46,712 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:47,715 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:48,721 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:49,758 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:50,763 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:51,772 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
- 2013-07-17 19:52:52,775 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
- 2013-07-17 19:52:53,779 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
- MapReduce Total cumulative CPU time: 50 seconds 0 msec
- Ended Job = job_201307160252_0006
- MapReduce Jobs Launched:
- Job 0: Map: 1 Reduce: 1 Cumulative CPU: 50.0 sec HDFS Read: 2787075 HDFS Write: 16518 SUCCESS
- Total MapReduce CPU Time Spent: 50 seconds 0 msec
- OK
- 13402169727 171 108 11286 130230
- 13415807477 2067 1683 169668 1994181
- 13416127574 1501 1094 161963 802756
- 13416171820 113 99 10630 32120
- 13417106524 160 128 18688 13088
- 13418002498 240 256 22136 86896
- 13418090588 456 351 98934 67470
- 13418117364 264 152 29436 49966
- 13418173218 37680 48348 2261286 73159722
- 13418666750 22432 26482 1395648 39735552
- 13420637670 20 20 1480 1480
- ......
- Time taken: 75.24 seconds
2.Hadoop MapReduce手机流量统计
自定义一个writable
- package cn.maoxiangyi.hadoop.wordcount;
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.IOException;
- import org.apache.hadoop.io.Writable;
- public class DataWritable implements Writable {
- private int upPackNum;
- private int downPackNum;
- private int upPayLoad;
- private int downPayLoad;
- public DataWritable() {
- super();
- }
- public DataWritable(int upPackNum, int downPackNum, int upPayLoad,
- int downPayLoad) {
- super();
- this.upPackNum = upPackNum;
- this.downPackNum = downPackNum;
- this.upPayLoad = upPayLoad;
- this.downPayLoad = downPayLoad;
- }
- @Override
- public void write(DataOutput out) throws IOException {
- out.writeInt(upPackNum);
- out.writeInt(downPackNum);
- out.writeInt(upPayLoad);
- out.writeInt(downPayLoad);
- }
- @Override
- public void readFields(DataInput in) throws IOException {
- upPackNum = in.readInt();
- downPackNum = in.readInt();
- upPayLoad = in.readInt();
- downPayLoad =in.readInt();
- }
- public int getUpPackNum() {
- return upPackNum;
- }
- public void setUpPackNum(int upPackNum) {
- this.upPackNum = upPackNum;
- }
- public int getDownPackNum() {
- return downPackNum;
- }
- public void setDownPackNum(int downPackNum) {
- this.downPackNum = downPackNum;
- }
- public int getUpPayLoad() {
- return upPayLoad;
- }
- public void setUpPayLoad(int upPayLoad) {
- this.upPayLoad = upPayLoad;
- }
- public int getDownPayLoad() {
- return downPayLoad;
- }
- public void setDownPayLoad(int downPayLoad) {
- this.downPayLoad = downPayLoad;
- }
- @Override
- public String toString() {
- return " " + upPackNum + " "
- + downPackNum + " " + upPayLoad + " "
- + downPayLoad;
- }
- }
MapReduc函数
- package cn.maoxiangyi.hadoop.wordcount;
- import java.io.IOException;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class DataTotalMapReduce {
- public static void main(String[] args) throws Exception {
- Configuration configuration = new Configuration();
- Job job = new Job(configuration);
- job.setJarByClass(DataTotalMapReduce.class);
- job.setMapperClass(DataTotalMapper.class);
- job.setReducerClass(DataTotalReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(DataWritable.class);
- job.setCombinerClass(DataTotalReducer.class);
- Path inputDir = new Path("hdfs://hadoop0:9000/HTTP_20130313143750.dat");
- FileInputFormat.addInputPath(job, inputDir);
- Path outputDir = new Path("hdfs://hadoop0:9000/dataTotal");
- FileOutputFormat.setOutputPath(job, outputDir);
- job.waitForCompletion(true);
- }
- }
- /**
- *
- *
- *
- * 1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82
- * i02.c.aliimg.com 24 27 2481 24681 200 1363157995052 13826544101
- * 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200 1363157991076 13926435656
- * 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
- *
- *
- */
- class DataTotalMapper extends Mapper<LongWritable, Text, Text, DataWritable> {
- @Override
- protected void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String lineStr = value.toString();
- String[] strArr = lineStr.split("\t");
- String phpone = strArr[1];
- String upPackNum = strArr[6];
- String downPackNum = strArr[7];
- String upPayLoad = strArr[8];
- String downPayLoad = strArr[9];
- context.write(
- new Text(phpone),
- new DataWritable(Integer.parseInt(upPackNum), Integer
- .parseInt(downPackNum), Integer.parseInt(upPayLoad),
- Integer.parseInt(downPayLoad)));
- }
- }
- class DataTotalReducer extends Reducer<Text, DataWritable, Text, DataWritable> {
- @Override
- protected void reduce(Text k2, Iterable<DataWritable> v2, Context context)
- throws IOException, InterruptedException {
- int upPackNumSum = 0;
- int downPackNumSum = 0;
- int upPayLoadSum = 0;
- int downPayLoadSum = 0;
- for (DataWritable dataWritable : v2) {
- upPackNumSum += dataWritable.getUpPackNum();
- downPackNumSum += dataWritable.getDownPackNum();
- upPayLoadSum += dataWritable.getUpPayLoad();
- downPayLoadSum += dataWritable.getDownPayLoad();
- }
- context.write(k2, new DataWritable(upPackNumSum, downPackNumSum, upPayLoadSum, downPayLoadSum));
- }
- }
结果节选
- 13402169727 171 108 11286 130230
- 13415807477 2067 1683 169668 1994181
- 13416127574 1501 1094 161963 802756
- 13416171820 113 99 10630 32120
- 13417106524 160 128 18688 13088
- 13418002498 240 256 22136 86896
- 13418090588 456 351 98934 67470
- 13418117364 264 152 29436 49966
- 13418173218 37680 48348 2261286 73159722
- 13418666750 22432 26482 1395648 39735552
- 13420637670 20 20 1480 1480
- 13422149173 40 32 4000 3704
- 13422311151 465 535 33050 661790
- 13424077835 84 72 15612 9948
- 13424084200 765 690 60930 765675
- 13428887537 43892 44830 2925330 65047620
- 13430219372 454 352 33792 192876
- 13430234524 27852 39056 1767220 52076614
- 13430237899 1293 1165 166346 808613
- 13430258776 4681 4783 350511 6609423
- 13430266620 10544 9377 11600817 5728002
- 13432023893 40 0 2400 0
http://www.aboutyun.com/forum.php?highlight=hive&mod=viewthread&tid=7455
相关推荐
基于hadoop的Hive数据仓库JavaAPI简单调用的实例,关于Hive的简介在此不赘述。hive提供了三种用户接口:CLI,JDBC/ODBC和 WebUI CLI,即Shell命令行 JDBC/ODBC 是 Hive 的Java,与使用传统数据库JDBC的方式类似 Web...
毕业设计,采用Hadoop+Hive构建数据仓库,使用django+echarts构建前端web网站对业务指标进行可视化呈现 1. Hadoop+Hive构建数据仓库 2. django+echarts网站开发 3. 数据清洗,数据模型构建 毕业设计,采用Hadoop+...
分布式文件管理系统 Hadoop MapReduce Hive
该文档保护了目前比较流行的大数据平台的原理过程梳理。Hadoop,Hive,Hbase,Spark,MapReduce,Storm
大数据Hadoop、MapReduce、Hive项目实践
hadoop实现网站流量数据分析(MapReduce+hive)程序+说明.rar
网站流量数据分析 (MapReduce+Hive综合实验)
Starting with installing Hadoop YARN, MapReduce, HDFS, and other Hadoop ecosystem components, with this book, you will soon learn about many exciting topics such as MapReduce patterns, using Hadoop to...
《开发和优化高效的Hadoop & Hive 程序》,详细介绍了淘宝在应用大数据技术的思路,以及针对MAPREDUCE和HIVE的调优过程。
hadoop+ha+hive+mapreduce,包括hdfs,mapreduce,hive详细原理以及生产实战,还有hdfs的高可用性等等
hadoop1.1.2操作例子 包括hbase hive mapreduce相应的jar包
Hadoop-0.20.0-HDFS+MapReduce+Hive+HBase十分钟快速入门
用户的手机,连接到不同的基站会产生一条记录。 数据格式为:用户标识 设备标识 基站位置 通讯的日期 通讯时间 example: 0000009999 0054785806 00000089 2016-02-21 21:55:37 需要得到的数据格式为: 用户...
内容概要:让初学者能高效、快捷掌握Hadoop必备知识,大大缩短Hadoop离线阶段学习时间。...HDFS+MapReduce+Hive 基于数据仓库实现离线数据分析的可视化报表开发。 Hadoop生态综合案例 ——陌陌聊天数据分析
Starting with installing Hadoop YARN, MapReduce, HDFS, and other Hadoop ecosystem components, with this book, you will soon learn about many exciting topics such as MapReduce patterns, using Hadoop to...
Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets About This Book Process large and complex datasets using next generation Hadoop Install, configure, and administer ...
hadoop mapreduce hive spark hbase spack storm sqoop hadoop hive spark hadoop 查看输入的文件内容 [hadoop@master ~]$ more wc.input 任务二 将输入文件上传到HDFS 在master主节点,使用 root 用户登录,然后...
讲解hadoop生态,mapreduce原理,hive应用架构,数据过滤实战
Hadoop简单应用案例,包括MapReduce、单词统计、HDFS基本操作、web日志分析、Zookeeper基本使用、Hive简单操作等