• Home   /  
  • Archive by category "1"

Essay Word Count Reducer Strips

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

In short, you can run a Hadoop MapReduce using SQL-like statements with Hive.

Here is an WordCount example I did using Hive. The example first shows how to do it on your Local machine, then I will show how to do it using Amazon EMR.

Local

1. Install Hive.

First you need to install Hadoop on your local, here is a post for how to do it. After you installed Hadoop, you can use this official tutorial.

*2. This step may not needed, in case you meet error says the IP address cannot be accessed, go to your Hadoop folder, edit the , change fs.default.name from IP to your hostname (for me it’s

3. Write mapper & reducer for our WordCount example, here I use python, you can use any script languages you like.

Mapper: (word_count_mapper.py)

#!/usr/bin/python import sys for line in sys.stdin: line = line.strip(); words = line.split(" "); # write the tuples to stdout for word in words: print '%s\t%s' % (word, "1")

Reducer: (word_count_reducer.py)

#!/usr/bin/python import sys # maps words to their counts word2count = {} for line in sys.stdin: line = line.strip() # parse the input we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: continue try: word2count[word] = word2count[word]+count except: word2count[word] = count # write the tuples to stdout # Note: they are unsorted for word in word2count.keys(): print '%st%s'% ( word, word2count[word] )

4. Write the Hive script word_count.hql. Note: you can run the following codes line by line in Hive console as well.

drop table if exists raw_lines; -- create table raw_line, and read all the lines in '/user/inputs', this is the path on your local HDFS create external table if not exists raw_lines(line string) ROW FORMAT DELIMITED stored as textfile location '/user/inputs'; drop table if exists word_count; -- create table word_count, this is the output table which will be put in '/user/outputs' as a text file, this is the path on your local HDFS create external table if not exists word_count(word string, count int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' lines terminated by '\n' STORED AS TEXTFILE LOCATION '/user/outputs/'; -- add the mapper&reducer scripts as resources, please change your/local/path add file your/local/path/word_count_mapper.py; add file your/local/path/word_count_reducer.py; from ( from raw_lines map raw_lines.line --call the mapper here using 'word_count_mapper.py' as word, count cluster by word) map_output insert overwrite table word_count reduce map_output.word, map_output.count --call the reducer here using 'word_count_reducer.py' as word,count;

5. Put some text files on HDFS ‘/user/inputs/’ using Hadoop commandline ()

6. Run your script!

hive -f word_count.hql

The script will create 2 tables, read input data in raw_lines table and add mapper & reducer scripts as resources; do the MapReduce and store the data in word_count table, which you can find the text file in ‘/user/outputs’.

IN case you meet the safe mode error, you can close the safe mode manually:

hadoop dfsadmin -safemode leave

IMPORTANT:
In your script file, PLEASE do not for get to add “#!/usr/bin/python” at the first line. I forgot to add and met this error, which cost me half an hour to figure out why…

Starting Job = job_201206131927_0006, Tracking URL = http://domU-12-31-39-03-BD-57.compute-1.internal:9100/jobdetails.jsp?jobid=job_201206131927_0006 Kill Command = /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.249.190.165:9001 -kill job_201206131927_0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2012-06-13 20:56:15,119 Stage-1 map = 0%,  reduce = 0% 2012-06-13 20:57:10,489 Stage-1 map = 100%,  reduce = 100% Ended Job = job_201206131927_0006 with errors Error during job, obtaining debugging information... Examining task ID: task_201206131927_0006_m_000002 (and more) from job job_201206131927_0006 Exception in thread "Thread-120" java.lang.RuntimeException: Error while reading from task log url at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130) at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211) at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://10.248.42.34:9103/tasklog?taskid=attempt_201206131927_0006_m_000000_2&start=-8193 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436) at java.net.URL.openStream(URL.java:1010) at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120) ... 3 more Counters: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 1  Reduce: 1   HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec

Amazon EMR

Running Hive script on EMR is very simple actually. I will use pictures to show how I did.

Here is the code I modified for EMR:

create external table if not exists raw_lines(line string) ROW FORMAT DELIMITED stored as TEXTFILE LOCATION '${INPUT}'; create external table if not exists word_count(word string, count int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' lines terminated by 'n' STORED AS TEXTFILE LOCATION '${OUTPUT}'; from ( from raw_lines map raw_lines.line using '${SCRIPT}/word_count_mapper.py' as word, count cluster by word) map_output insert overwrite table word_count reduce map_output.word, map_output.count using '${SCRIPT}/word_count_reducer.py' as word,count;

Note in the script, I use INPUT, OUTPUT, SCRIPT variables, INPUT/OUTPUT are set by EMR automatically in the step (2) below, SCRIPT is set by me in the Extra args.

All files are stored in S3.

1. Create EMR job:

2. Set the Hive script path and arguments

3. Set instances

4. Set Log path

5. Run the Job!

Like this:

LikeLoading...

This entry was posted in MapReduce and tagged EMR, Hive, MapReduce, Wordcount on by purplechun.

As the writer of an essay, imagine yourself crossing a river, guiding a troop of avid readers. You bring an armful of stones to lay down and step on as you go; each stone is a sentence or paragraph that speaks to and develops the essay's thesis, or central question. If you find yourself in the middle of the river with another mile to shore but only a few more stones, you can't finesse such a situation. You can't ask your readers to follow you and jump too broad a span.

In such a case, stop. Ask yourself if you need more stones—more sentences or paragraphs—or if perhaps you have already used ones that more properly belong ahead. On a second look, you may decide that the distance between stones is not that great, after all; perhaps your reader only needs a hand of assistance to get from one stone, or paragraph, to the next. In an essay, such assistance can be offered in the form of a "furthermore" or "in addition to" or "therefore." These are called transitional words and phrases.

Transitional words or phrases sometimes will be precisely what you need to underscore for your readers the intellectual relationship between sentences or paragraphs—to help them navigate your essay. Very often, such transitions

  • address an essential similarity or dissimilarity (likewise, in the same way, on the other hand, despite, in contrast);
  • suggest a meaningful ordering, often temporal (first, second, at the same time, later, finally) or causal (thus, therefore, accordingly, because);
  • in a longer paper, remind the reader of what has earlier been argued (in short, as has been said, on the whole).

Keep in mind that although transitional words and phrases can be useful, even gracious, they never should be applied to force a vagrant paragraph into a place where it does not, structurally, belong. No reader will be fooled by such shoddy craft, which is designed to help the writer finesse the essay's flaws, rather than to illuminate for the reader the connections among the essay's ideas and textual evidence. A strip of Velcro on a cracked wall will not fool us into thinking we are standing somewhere safe; neither will a Velcro transition persuade an essay's readers that they are in the hands of a serious writer with something serious to say. In the absence of genuine intellectual connection, such efforts at transition all sound manufactured. The human voice has been drained off, and what's left is hollow language.

Velcro transitions insult and bore the reader by pointing out the obvious, generally in a canned and pompous way. Here are some examples:

It is also important to note that ... Thus, it can be said that ... Another important aspect to realize is that ... Also, this shows that ...

This is not to say that such phrases never can be used in an essay. Of course they can, mostly for summary. Just don't use them indiscriminately. Be careful, and be honest. Don't talk down to the reader. If you tell a reader that something "is important to note," make sure there's a very good chance the reader would not have realized this if you hadn't pointed it out. And never overdo such phrases; after all*, everything in your essay ought to be important to note. In other words, be aware that, in a well-crafted essay, every sentence is a transitional sentence.

This shouldn't be as intimidating as it might at first sound. Rather, this is another way of saying that transitions are important not simply between paragraphs. Instead, the necessity to transition occurs among the sentences within a paragraph, and from paragraph to paragraph. A paragraph ought to follow logically from the one preceding, and move the argument towards the paragraph that follows. Again, this is no cause for alarm on the part of the writer. It's simply another way of saying that, just as the sentence itself has internal logic and coherence, so does the paragraph; and so does the essay as a whole.

Tips for Transitioning

Quite often, if you are having a terrible time figuring out how to get from one paragraph to the next, it may be because you shouldn't be getting from one paragraph to the next quite yet, or even ever; there may be something crucial missing between this paragraph and its neighbors—most likely an idea or a piece of evidence or both. Maybe the paragraph is misplaced, and logically belongs elsewhere. The reason you can't come up with a gracious connective sentence is that there's simply too large an intellectual span to cross, or that you've gone off in the wrong direction.

Before you can go on, some causality needs first to be explicated, some other piece of evidence offered. You have to guide the reader safely to the next idea by making certain that everything that should have been discussed by this point has in fact been thoroughly discussed. While it is true that an essay is a conversation between a writer and a reader, in which the reader's questions and concerns are internalized and addressed by the writer at the appropriate times, it is also true that even the most committed reader cannot read your mind. You have to guide your reader.

As has been discussed above, it is also useful to note that** transitions between paragraphs that really do belong where they are in the essay can be strengthened by the repetition or paraphrasing of one paragraph's key words into the next. Such repetition or paraphrasing of key words, however, can be little more than Velcro** if the writer really has nothing more to say, as is now the case.

* Underlined words and phrases function as transitions. Try reading without them; you'll see that the ideas remain in logical order. Such words and phrases, however, make life easier for the reader. They never substitute for intellectual coherence.

** Ick! Velcro—beware!

Copyright 1998, Maxine Rodburg, for the Writing Center at Harvard University

One thought on “Essay Word Count Reducer Strips

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *