Home
>>    




WebMapReduce

CS 121 (CS1)

wmr

Additional problems (PDF)

  1. tallies -- [HC] Simple tallies. WebMapReduce (WMR), as described in CSinParallel module http://selkie.macalester.edu/csinparallel/modules/IntroWMR/build/html/index.html, has great potential for "big data" computations, since it computes using the open-source software Hadoop that is the most common choice for computations at a truly enormous scale with data that may have no rigid regular structure. Some examples of such data include: corpora (plural of "corpus"), which could be literary or could equally well represent conversations, research data, or web pages; graphical information such as maps and their annotations; and relationship data such as "friendship" or "likes" on social media.

    One of the first skills to develop in map-reduce problem solving is to perform simple tallies that describe a dataset. The following questions develop these skills using WMR. The "word count" example discussed in the module is an example. The goal is to produce a list of words in a corpus together with each word's frequency in that corpus. The following solution is taken from the module, with documentation added.

    Mapper
    
    # IN keys and values:
    #   key: holds lines of text from the corpus   value: is empty
    # OUT keys and values:
    #   key: one word from that text  value: a string representing the number 1
    
    def mapper(key, value):
        words=key.split()
        for word in words:
            Wmr.emit(word, '1')
    
    Reducer
    
    # IN keys and values:
    #   key: one word from a corpus  value: a string representing the number 1
    #   NOTE:  the values are packed into an interator  iter
    # OUT keys and values:  
    #   key: that one word from a corpus  value: the sum of the 1's from iter
    
    def reducer(key, iter):
        sum = 0
        for s in iter:
            sum = sum + int(s)
        Wmr.emit(key, str(sum))
    

    Comments

    • The basic structure of map-reduce computing is to

      1. break the given data up into labelled pieces, using a mapper() function, then to

      2. gather those pieces according to label and combine all the pieces that have the same label, using a reducer() function.

    • We have highlighted comments about IN and OUT keys and values. These comments describe how labelled pieces (technically called key-value pairs) are generated from the original data and how those pieces are combined for each label.

      Writing down IN and OUT not only describes the map-reduce computation, but also helps with problem solving, because it specifies the goals for the mapper and reducer functions.

    • All keys and values in map-reduce computation are actually strings.

      • The Wmr.emit() method will automatically convert numerical arguments to strings, but we have used string argument values '1' and str(sum) directly in calls of Wmr.emit() to emphasize that those arguments must be strings.

      • The conversion int(s) in reducer() is necessary, since s is a string and we want to add its numerical value to the accumulator sum.

    • The arguments for a mapper() are a key and a value. This supports using the result of one map-reduce computation as the input for another one, since map-reduce computations produce key-value pairs.

      This particular mapper assumes that the lines of the given input text are located entirely in the key argument of mapper(), with an empty string as value. This allows us to count words in ordinary text files.

      The mapper() is called once for each line of input in WMR (whether that line has key-value structure or only a key).

    • The reducer() operates on all values having the same key. Therefore the arguments of a reducer() are

      1. a single copy of the key string (i.e., the "label"), and

      2. an iterator that contains all the values for that key that were produced by all calls of the mapper().

      The iterator argument iter makes it convenient to examine all values using a for loop, as seen in this reducer(). An iterator is used instead of a list because WMR supports "big-data" computations, in which there could be billions or trillions of values for a given key -- too many values to be held in a list.
    DO THIS: Carry out the following exercises using WebMapReduce.


    1. --
    2. wmr -- [C] Getting started with WMR. Explore the features of WebMapReduce (WMR) with the mapper() and reducer() above, by performing the following steps.
      1. Access WMR, by visiting the site http://www.cs.stolaf.edu/wmr in a browser.

        Note: This site is only accessible on the St. Olaf campus network. If you are off-campus, you can connect your computer to the campus network using VPN.

      2. Use your St. Olaf username and password to login to WMR. Your account will be recognized automatically -- there is no need to register for an account.

      3. This will present you with the WMR job submission form

        • In the Job info section, choose a job name (e.g., word-count), and make sure the language is Python3. Leave the two task counts blank (Hadoop will choose these automatically) and leave the sorting on the default Alphabetic.

        • In the Input section, choose Direct Input, and enter two or three lines of text in the text box, such as

          The cat in the hat
          wore the hat
          to the cat hat party.
          

        • In the Mapper section, enter the mapper() code above. Include the comments (feel free to cut and paste) as well as the code (retype this short function to get a feel for WMR code input).

        • Likewise, in the Reducer section, enter the reducer() code above, including the IN and OUT comments.

        • Click on the Test button. This performs a test run of your job, useful for making sure your code works correctly with test input.

          • The data size is limited for Test executions, but it otherwise behaves similarly to actual Hadoop computations.

          • The Test option is faster for small data than an actual Hadoop job.

          • A successful Test computation shows the key-value pairs emitted from a mapper(), which is useful for verifying your code and for debugging if something is wrong. The actual Hadoop interface does not show these intermediate key-value pairs.

          Always use the Test option with suitable test data to check your mapper() and reducer() code, since the Hadoop Submit option takes much longer to deliver less debugging information if something goes wrong.

          Note: for the "cat-hat" test data above, the intermediate key-value pairs are

                  The	1
                  cat	1
                  in	1
                  the	1
                  hat	1
                  wore	1
                  the	1
                  hat	1
                  to	1
                  the	1
                  cat	1
                  hat	1
                  party.	1
          
          The final output from the cat-hat example input is
                  cat	2
                  hat	3
                  in	1
                  party.	1
                  the	3
                  The	1
                  to	1
                  wore	1
          

      4. If the Test job indicates any errors, you can use a Resubmit Job link or your browser's back button to return to the Job Submission page and correct the error. Then, Test your job again, repeating as necessary until your job works correctly with the Test interface.

      5. Next, click on a Resubmit Job link or use your browser's back button to return to the New Job page. (If you use Resubmit Job, note that your input has been stored as a data set in the Hadoop system.) This time, use the Submit button to run the job on Hadoop instead of the test system.

        • Even though your data is small, a Hadoop run takes a long time. This is because Hadoop takes time to gear up for very large data sets. These lengthy preparations pay off when there are billions or trillions of bytes of data, but they extend the time required for a job when there are only a few dozen bytes of data.

        • Your job should run correctly without errors, since you tested it using the Test option.

        • You may observe that the sort order is slightly different with Hadoop Submit than with the Test option. Hadoop produces

                  The	1
                  cat	2
                  hat	3
                  in	1
                  party.	1
                  the	3
                  to	1
                  wore	1
          
          The difference is with the word The, since Hadoop sorting places capitalized words before words that begin with a lower-case letter.

          The Test option allows you to check the correctness of your code, but this sorting difference shows that the Test system doesn't necessarily produce results that are identical with Hadoop's results.

      6. After your successful Hadoop (Submit) run, return to the New Job page and select the Save link at the bottom of the page. This saves the data set and code for your job under the job name you chose above (e.g., word-count). In the future, you can load this job again by choosing Saved Configurations in the left menu bar.

        In practice, you probably don't want to save all of your successful runs, since the Saved Configuration menu might become overly large. However, some configurations such as this word-count example can be good starting points for other WMR jobs.

      7. Finally, run your job with larger input, e.g., a large book. For example:

        • Enter the Cluster Path /shared/gutenberg/CompleteShakespeare.txt to use a copy of Project Gutenberg's digitization of the Complete Works of William Shakespeare.

        • Likewise, the Cluster Path /shared/gutenberg/WarAndPeace.txt is a copy of a translation of Tolstoy's famous novel.

          Note: Each of these works may also be accessible using the drop down menu labelled Cluster Dataset. Use one of the choices without line numbers for this mapper() and reducer().

        To split these large files, use the Submit button, because the data is too big for the Test option. (When data is too big for the Test option, only a small portion of that data is used.)

        Here are the first few lines of output for CompleteShakespeare.txt:

                "	241
                "'Tis	1
                "A	4
                "AS-IS".	1
                "Air,"	1
                "Alas,	1
                "Amen"	2
                "Amen"?	1
                "Amen,"	1
                "And	1
                "Aroint	1
                "B	1
                "Black	1
                "Break	1
                "Brutus"	1
                "Brutus,	2
                "C	1
                "Caesar"?	1
        	...
        
      To submit your work on this question, copy your mapper(), your reducer(), and the first few lines of your output for your final run. Use copy and paste for all code, not a screenshot.


    3. wc_value -- [C] Word count with line numbers. Modify the code in the previous problem to work when each text line has a line number. This means changing the IN/OUT documentation for the mapper() (only), to
          # IN keys and values:
          #   key: holds a line number   value: holds a line of text from the corpus
          # OUT keys and values:
          #   key: one word from that text  value: a string representing the number 1
      
      The boldface portion of the comment is what needs to change.

      Hints:

      • You can start with your saved configuration from the exercise above, and modify the mapper(). Only the mapper() needs to change, since the IN/OUT specifications for the reducer() are unchanged.

      • Here is a test data set:

        1	The cat in the hat
        2	wore the hat
        3	to the cat hat party.
        
        You can enter this data interactively in the Direct Input box in the Input portion of the New Job page. Separate the line number from the text by a TAB character in that form, because Hadoop uses the first TAB character in a line to separate the key from the value for that line.

      • The Test computation should show the same intermediate key-value pairs and result key-value pairs as the original cat-hat example data produced.

      • At the end, test your modified code with a large dataset that has line numbers, such as /shared/gutenberg/CompleteShakespeare.txt_n or /shared/gutenberg/WarAndPeace.txt_n (these may be available as options with line numbers in Cluster Datasets.



    4. sort -- [C] Sorting with the shuffle stage. The the previous two problems probably produce results that are sorted by word in dictionary order (lexicographic order). This is because Hadoop sorts all key-value pairs according to key in the "shuffle" phase, which occurs between the calls of mapper() and the calls of reducer().
             mapper() calls  --->   shuffle  --->   reducer() calls
      
      The primary purpose of the shuffle phase is to move and organize the key-value pairs produced by mapper()s in order to prepare for calls of reducer(), which calls must receive all values for a particular key. Hadoop uses sorting according to key in order to perform this organization, and by calling reducer() in that key-sorted order, the results of map-reduce can take advantage of that sorted order.

      If your results do not appear in sorted order, then Hadoop is using more than one reducer task. A reducer task is a program that calls reducer() for keys. If Hadoop uses more than one reducer task, the keys will be dealt out among each reducer task equally; each reducer task will produce results from the sorted list of keys that it receives, but the combined results from all reducer tasks will be out of order.

      DO THIS:





    5. 2cycle -- [HC] Ordering by frequency. The previous problems produce word-frequency results that are sorted in ordering according to word, but a linguist or humanist might want to know about the most frequently occurring words instead of the frequency of every word. This requires sorting by frequency instead of sorting by word.

      Since the word frequency values cannot be known until after the reducer() calls have computed those frequencies, we will use a second map-reduce cycle to sort the results of this sort the word-frequency results according to frequency, using the steps below. This follow-on WMR job may produce the following results for the cat-hat data.

              1	The
              1	in
              1	party.
              1	to
              1	wore
              2	cat
              3	hat
              3	the
      
      Note: words with the same frequency (e.g., hat and the) may appear in different orders than above, but the frequencies (e.g., 3) will be appear in sorted order.


        --
      1. -- [C] Start with the "Job Succeeded" results page from a map-reduce job computed using Hadoop (the "Submit" button), not the "Test" option.

        At the bottom of the "Job Succeeded" page, click on the "Use Output" link. This will proceed to the "New Job" page, but with a predefined data set that will consist of the key-value pairs (sorted by word) produced by the original word-count reducer()



      2. -- [HC] Now, fill in the rest of the form for this second map-reduce cycle.
        • In the "Job Info" section, choose a new job name, such as word-count2

          Also, choose one reducer, to insure that a single sorted data set will be produced at the end of this second map-reduce cycle.

          Finally, for "Sort", choose "Numeric", which will sort keys according to numerical value instead of sorting lexicographically as strings. For example, in "Numeric" order the key "2" should come before the key "10", but in lexicographic (dictionary) order "10" would come before "2".

        • For mapper():

              # IN key-value pairs
              #   key: holds a word   value: holds an integer, frequency of that word
              # OUT key-value pairs
              #   key: holds an integer, frequency of a word   value: holds that word
          

        • For reducer():

              # IN key-value pairs
              #   key: holds an integer, frequency of a word   value: holds that word
              #   NOTE:  the values are packed into an interator  iter
              # OUT key-value pairs
              #   key: holds an integer, frequency of a word   value: holds that word
          

        Hints:
        • This mapper() doesn't change the key or the value. It only needs to emit those strings in the opposite roles (i.e., emit the original value as a new key, and the original key as a new value).

        • This reducer() also doesn't change the key or the value. It only emits the key and each value in the same roles, unchanged. This is called the identity reducer.

        • The results will be sorted by frequency because of the shuffle phase.

        • Test and debug your mapper() and reducer() using the "Test" option. Then, use the "Submit" option for your final run.

        • This might be a good job configuration to "Save", since it includes an identity reducer. You might add a comment with "Identity reducer" to the code for the reducer().





    6. strip -- [HC] Removing capitals and punctuation. The original mapper() and reducer() for counting word frequencies counts capitalized or punctuated forms of a word separately from that word; for example, the frequencies of The and "the and the are all tallied separately. These distinctions are undesirable in many cases for corpus analysis. For example, in the complete Shakespeare example above, we would most likely want to combine the frequencies of "Amen", "Amen"?, "Amen,", amen, etc., into a single frequency for amen.

      Modify those original mapper() and reducer() functions to ignore capitalization and punctuation before or after a word. The resulting functions will satisfy these IN/OUT specs (boldface indicates changes):

      Mapper

      # IN keys and values:
      #   key: holds lines of text from the corpus   value: is empty
      # OUT keys and values:
      #   key: one word from that text, after lowering letter cases
      #      and removing punctuation before and after each word  
      #   value: a string representing the number 1
      

      Reducer

      # IN keys and values:
      #   key: one word from a corpus  value: a string representing the number 1
      #   NOTE:  the values are packed into an interator  iter
      # OUT keys and values:  
      #   key: that one word from a corpus  value: the sum of the 1's from iter
      

      Hints

      • As the changes suggest, the same reducer() can be used; only the mapper() needs to be modified.

      • Apply lower() and strip() to each word before emitting it.

        Note: The method strip() is better than remove() for this purpose, because strip() only removes characters from the beginning or ending of each word, and we don't want to remove internal apostrophes from contractions (like don't).

      • Use the "Test" interface first with a small data set, such as the cat-hat data (including a final period after the word party.), as always.

      • Start with your best guess about punctuation marks to strip away, and check your guess using an actual corpus, such as the complete works of Shakespeare. Scroll through the results of a Hadoop ("Submit") run to see if additional punctuation marks should be removed.

        Also, remember to escape a quote character within quote characters. E.g., the strip() argument ".,?\"'" removes periods, commas, question marks, double-quote characters ", and single-quote/apostrophe characters '.




  2. oview -- Overview of map-reduce techniques; context forwarding. The exercises above have presented several basic techniques of map-reduce computing, including the following:
    • the essential elements of problem-solving with the map-reduce pattern, summarized as

      Basic idea of map-reduce computing:

      1. Break data up into labelled pieces (mapper())

      2. Gather and combine pieces that have the same label (reducer())

      where the "labelled pieces" are called key-value pairs;
    • using features of WMR including

      • basic job configuration features provided by WMR, such as selecting a job name (used when saving a configuration), choice of programming language for mappers and reducers, requesting a certain number of reducers (use a single reducer in order to produce results that preserve the shuffler's sorting order), and choice of sorting (alphabetic vs. numeric),

      • specifying input data for a map-reduce job, which may originate as a preset cluster dataset, or specified using a cluster path, or may be uploaded from your own computer, or may be entered directly into a text box,

      • entering definitions of the mapper() and reducer() functions that determine the map-reduce computation to perform on the input data,

      • performing a "Test" computation to try a mapper() and reducer() with small data, in order to verify and debut a map-reduce computation,

      • performing a "Submit" computation to run a map-reduce computation using Hadoop, which takes more time than "Test" but could be performed with billions, trillions, or even (given a large enough cluster) quadrillions of bytes of data, and

      • saving a configuration including input data, mapper(), and reducer() for later retrieval;

    • using map-reduce computation to perform operations involving simple tallies, such as word frequency counting, finding mean ratings, or constructing an index;

    • using IN/OUT specs to describe the mapper and reducer functions needed for a map-reduce computation;

    • taking advantage of sorting during the shuffle phase of Hadoop computation to order the key-value pairs of a result;

    • using multiple Hadoop cycles in order to perform multi-stage map-reduce computations, as required when sorting word frequencies by frequency rather than by word;

    • using an identity mapper() or an identity reducer() when needed to solve a problem;

    • grouping data values together under a single data value, for example, treating "Amen", "Amen"?, "Amen,", etc., all as occurrences of the word amen (we implemented this grouping by lowering letter case and stripping punction); and

    • pre-tallying with available values within a call to a mapper().

    We will now proceed to explore additional techniques of map-reduce computing, namely:
    • context forwarding, in which information about a key-value pair (labelled piece of data) are included with that pair;

    • structured values and structured keys, for packing multiple components of information into a key-value pair;

    • multi-case mappers and multi-case reducers, which enable multiple kinds of input data or key-value pairs to be processed within a single map-reduce cycle; and

    • broadcasting data values, which enables known elements of information to be distributed to many reduce() operations.

    We will start with context forwarding.


    1. --
    2. concordance -- [HC] Constructing a concordance. One of the most useful tools for humanists and linguists studying a corpus of text is a concordance or keyword-in-context (KWIC) index, which lists words in a corpus together with the contexts in which that word appears in that corpus. For example, here is a concordance for the cat-hat test data set.
      The	The cat in the hat
      cat	The cat in the hat
      cat	to the cat hat party.
      hat	The cat in the hat
      hat	to the cat hat party.
      hat	wore the hat
      in	The cat in the hat
      party.	to the cat hat party.
      the	The cat in the hat
      the	to the cat hat party.
      the	wore the hat
      to	to the cat hat party.
      wore	wore the hat
      
      Write a mapper() and reducer() functions to create a concordance for an input corpus that is sorted by word, where a context presented for each word consists an the entire line containing that word.

      Note: this is an example of the context-forwarding technique for map-reduce computing identified in the overview of map-reduce techniques.

      Hints:

      • Here are IN/OUT specs:

        Mapper

        # IN keys and values:
        #   key: holds lines of text from the corpus   value: is empty
        # OUT keys and values:
        #   key: one word from that corpus   value: the entire line containing that word
        

        Reducer

        # IN keys and values:
        #   key: one word from a corpus   value: an entire line containing that word
        # OUT keys and values:  
        #   key: that same word from that corpus  value: that same line of context
        

      • As the IN/OUT spec indicates, the reducer() should be an identity reducer.



    3. multibook -- [HC] Word frequencies per book. Whether comparing the multiple works by a single author or analyzing books by different authors, being able to analyze the content of multiple books in a corpus by book is an essential skill for many research projects in the humanities and Linguistics. We will start considering this kind of analysis by counting word frequencies per book for a corpus consisting of multiple books.

      Implement these IN/OUT specs for a map-reduce job:

      Mapper
      # IN keys and values:
      #   key: holds a one-word code for a book title   value: one line of that book
      # OUT keys and values:
      #   key: one word from a book, followed by a space and that book's one-word code
      #   value: a string '1'
      
      Reducer
      # IN keys and values:
      #   key: one word from a book, followed by a space and that book's one-word code
      #   value: a string '1'
      # OUT keys and values:  
      #   key: one word  value: a one-word code, followed by a space and the
      #			  frequency of that word in that book
      
      Example data:
              cathat	The cat in the hat
              cathat	wore the hat
              cathat	to the cat hat party.
              OwlCat	The owl and the pussy cat went to sea
      
      Intermediate key-value pairs
              The cathat	1
              cat cathat	1
              in cathat	1
              the cathat	1
              hat cathat	1
              ...
              The OwlCat	1
              owl OwlCat	1
              and OwlCat	1
              the OwlCat	1
              ...
      
      Final key-value pairs emitted from reducer:
              The	OwlCat 1
              The	cathat 1
              and	OwlCat 1
              cat	cathat 2
              cat	OwlCat 1
              in	cathat 1
              hat	cathat 3
              owl	OwlCat 1
              sea	OwlCat 1
              the	OwlCat 1
              the	cathat 3
              ...
      
      Hints:
      • Only one map-reduce cycle is needed.

      • A reducer() call will have two arguments: a key containing a space such as 'The cathat' and an iterator of values '1'. Add up the '1's just as you would for the word-count reducer. But you should emit something different than the word-count reducer emits:

        • use split() to break the two-word key into two pieces;

        • the key for your emit() call should only be the first piece from splitting;

        • the value argument for your emit() call should be the second part from splitting (book code) followed by a space, then the sum of the '1's.




  3. struct -- Structured values and structured keys. In map-reduce computing, we solve problems using key-value pairs to hold data. If our data only involves one component, we can use an empty key or an empty value to represent the data. For example, for a corpus without line numbers or other labels, the text may exist entirely within the key argument of a mapper() call, and the value argument of that mapper() may always be empty.

    What if the data has three or more components? In that case, we can pack multiple fields of data into a key and/or a value by using separator characters or other strategies within that key or value. The Netflix dataset provides an example of a multi-field key: commas separate the four fields (movie id, reviewer id, rating, and date of review) of the data. Using multiple fields is an example of structured keys and values. Other structuring strategies could be used besides fields.

    The following exercises use structured keys and/or values.



    1. --
    2. 2values -- [HC] Two fields within a value. For Netflix data, compute both the average rating per movie and the latest date of rating for each movie, using a single map-reduce cycle.

      Mapper

      # IN keys and values:
      #   key: holds one line of Netflix data   value: empty
      
      Reducer
      # OUT keys and values:  
      #   key: a movie id  
      #   value: mean of that movie's ratings to 2 dec. places, followed by a space, 
      #	   followed by the latest date when that movie was rated
      
      For example, if the Netflix data lines are
              1,1596531,5,2004-01-23
              3,2318004,4,2004-02-05
              6,110641,5,2003-12-15
              8,1447639,4,2005-03-27
              8,2557899,5,2005-04-16
              6,52076,4,2004-10-05
              1,13651,3,2004-06-16
              1,1374216,2,2005-09-23
      
      then the reducer should produce the following key-value pairs:
              1	3.33 2005-09-23
              3	4.00 2004-02-05
              6	4.50 2004-10-05
              8	4.50 2005-04-16
      
      Hints
      • The IN/OUT spec above only indicates the input for the mapper and the output from the reducer. What should the intermediate key-value pairs be?

        From each line of Netflix input, we need to extract the movie id, the movie rating, and the date of that rating. Since we want to organize the computation by movie id, the key should be the movie id for intermediate key-value pairs. The other two data elements can be packed into the value, separated by a space.

        Using this strategy for the example lines of data, the intermediate key-value pairs would be

                1	5 2004-01-23
                3	4 2004-02-05
                6	5 2003-12-15
                8	4 2005-03-27
                8	5 2005-04-16
                6	4 2004-10-05
                1	3 2004-06-16
                1	2 2005-09-23
        
        We can use this strategy to complete the key-value specs for the mapper and reducer:

        Mapper

        # IN keys and values:
        #   key: holds one line of Netflix data   value: empty
        # OUT keys and values:
        #   key: a movie id  value: a rating, then a space, then a date for that movie
        
        
        Reducer
        
        # IN keys and values:
        #   key: a movie id  value: a rating, then a space, then a date for that movie
        # OUT keys and values:  
        #   key: a movie id  
        #   value: mean of that movie's ratings to 2 dec. places, followed by a space, 
        #	   followed by the latest date when that movie was rated
        

      • Use split() in the mapper() (with comma as a separator) to obtain the four fields from the input in Netflix format.

        Also use split() in the reducer() to separate each value's rating from its date.

      • Use three accumulators in the reducer(): two for computing the mean rating (count and sum) and one for computing the latest date. Note: The netflix dates are formatted so they can be compared using simple lexicographic order (< or > as str objects).



    3. manyvalues -- [HC] Many fields within a value. For Netflix data, compute the average rating per movie, number of ratings, the highest rating, the lowest rating, the earliest date of rating for each movie, and the latest date of rating for each movie, using a single map-reduce cycle.

      Mapper

      # IN keys and values:
      #   key: holds one line of Netflix data   value: empty
      
      Reducer
      # OUT keys and values:  
      #   key: a movie id  
      #   value: the following fields for a movie, separated by spaces:  
      #            mean rating to 2 dec. places; number of ratings;  
      #            lowest rating; highest rating; 
      #            earliest date rated;  latest date rated
      
      For example, if the Netflix data lines are
              1,1596531,5,2004-01-23
              3,2318004,4,2004-02-05
              6,110641,5,2003-12-15
              8,1447639,4,2005-03-27
              8,2557899,5,2005-04-16
              6,52076,4,2004-10-05
              1,13651,3,2004-06-16
              1,1374216,2,2005-09-23
      
      then the reducer should produce the following key-value pairs:
              1	3.33 3 2 5 2004-01-23 2005-09-23
              3	4.00 1 4 4 2004-02-05 2004-02-05
              6	4.50 2 4 5 2003-12-15 2004-10-05
              8	4.50 2 4 5 2005-03-27 2005-04-16
      
      Hints:
      • Fill in the IN/OUT specs first, in order to determine the intermediate key-value pairs you will need.

      • Use numerous accumulators in the reducer() to compute all the desired values.




  4. trace -- [H] Tracing map-reduce. A trace of a WebMapReduce computation describes the computational process that a Hadoop map-reduce job performs, for a given input, mapper(), and reducer(). For example, (see class notes for an example).

    Note: No specs are needed for a trace of a WMR computation, just as no specs are needed for tracing a Python computation such as a recursive function call.

    DO THIS: perform traces of the following WMR jobs, given input, mapper() and reducer().



    1. --
    2. index_trace -- [H] Trace the following WMR computation (see example above), which produces a list of all line numbers containing a particular word in a corpus.
      • Input:

                1	I am Sam.
                2	Sam I am, okay? 
        

      • Mapper:

                def mapper(key, value):
                    lis = value.split()
                    for word in lis:
        	        Wmr.emit(word.strip('.,!?;-'), key)
        

      • Reducer:

                def reducer(key, iter):
                    lines = ''
                    for linenum in iter:
        	        lines = lines + linenum + ' '
                    Wmr.emit(key, lines)
        



    _____
  5. multicase -- Multi-case reducers and mappers. Exercises above with multi-field values show how to compute multiple values in a single map-reduce cycle by packing multiple quantities into the value of a key-value pair. Another strategy for computing multiple values during a single map-reduce cycle is to have multiple kinds of keys, and for a reducer to have multiple cases in order to perform different computations for different kinds of keys. The following exercises explore this useful technique.


    1. --
    2. localglobal -- Per-book and overall frequencies. The exercise about word frequencies in multi-books data sets above computes word frequencies per book, but not the total word frequencies among all books. Intermediate Key-value pairs are chosen to have the form
              word bookcode	1
      
      in that problem, with a two-part key, in order to have a reducer call for each word appearing in a book.

      We could also compute the total frequency of a word over all books, by using intermediate key-value pairs

              word	1
      

      If we need to know both a word's frequency per book and that word's total frequency over all books, we can compute both of those frequencies per word by including both kinds of key-value pairs. Perform this computation by implementing the following IN/OUT spec.

      Mapper

      # IN keys and values:
      #   key: holds a one-word code for a book title   value: one line of that book
      # OUT keys and values (2 formats):
      # 1 key: one word from a book  value: string '1'
      # 2 key: one word from a book, a space, and a book title  value: string '1'
      

      Reducer

      # IN keys and values (2 formats):
      #   key: one word from a book  value: string '1'
      #   key: one word from a book, a space, and a book title  value: string '1'
      # OUT keys and values (2 formats):
      #   key: one word  value: the total frequency of that word for all books
      #   key: one word  value: a book code, a space, and the frequency that
      #			  word in that book
      

      Suggested test data set:

              cathat	The cat in the hat
              cathat	wore the hat
              cathat	to the cat hat party.
              OwlCat	The owl and the pussy cat went to sea
      
      Expected output for that test data set:
      	The	2
              The	OwlCat 1
              The	cathat 1
      	and	1
              and	OwlCat 1
      	cat	3
              cat	cathat 2
              cat	OwlCat 1
      	in	1
              in	cathat 1
      	hat	3
              hat	cathat 3
      	owl	1
              owl	OwlCat 1
      	sea	1
              sea	OwlCat 1
      	the	4
              the	OwlCat 1
              the	cathat 3
              ...
      
      Here, the overall count is indicated by output pairs such as The   2 that do not mention a book code, and the other key-value pairs indicate per-book frequencies.

      Hints:

      • In your mapper, emit two pairs for each word, one that includes both that word and the book name in the key and one whose key is only that word.

      • In your reducer, apply split() to the key in order to produce a list of either one or two strings. When it comes time to emit a value from your reducer, use that list's length to determine whether to emit a two-field value (with book code and frequency) or a one-field value (frequency only).



    3. accum -- [HC] Accumulator keys. The standard word-count map-reduce computation computes the frequency of each word in a corpus, but not the total count of all words appearing in that corpus. For example, in the cathat.txt test set, the word-count computation determines that the word cat occurs three times, but does not report that there are 13 words total in that data set. Of course, knowing the total word count would be important for corpus analysis: finding three occurrences of cat among 13 total words is quite different than finding three occurrences out of 10,000 words.

      We can compute the total word frequency on the same map-reduce cycle as the individual word frequencies by using an accumulator key, i.e., a special-purpose key for computing that total word frequency, and also using individual words for keys. By choosing a key that can't possibly be a word as the accumulator key, that special-purpose key can be handled specially in the reducer().

      DO THIS: Modify the standard word-count mapper() and reducer() to count the total number of words as well as the frequency per word, by satisfying the following specs.

      Mapper

          # IN keys and values:
          #   key: holds lines of text from the corpus   value: is empty
          # OUT keys and values (2 formats):
          #   1. key: one word from that text, after lowering case and stripping
          #           punctuation from each end of each word
          #      value: a string representing the number 1
          #   2. key: AAAA  value: a string representing the number 1
      

      Reducer

          # IN keys and values (2 formats):
          #   key: one word from a corpus  value: a string representing the number 1
          #   key: AAAA  value: a string representing the number 1
          # OUT keys and values (2 cases):  
          #   key: that one word from a corpus  value: the frequency of that word
          #   key: the string "TOTAL WORDS"  value: total number of words in the corpus
      

      Suggested test data set:

          The cat in the hat
          wore the hat
          to the cat hat party.
      
      Expected output for that test data set:
      	TOTAL WORDS	13
      	cat	2
              hat	3
              in	1
              party	1
              the	4
              to	1
              wore	1
      

      Hints

      • We chose 'AAAA' as the accumulator key because it is not a word, and because it will appear early in sorted order.

      • Give your WMR job a name such as density-1 in the "Name" field of the "Job info" section, since it represents the first phase of your solution to this problem.

      • In the mapper(), lower the case of each word and strip punctuation from it, then emit each (modified) word and '1' as usual, but also emit AAAA and '1' for each word.

      • In the reducer(), add up the '1' values in iter as for all keys, including the special accumulator key AAAA. (Of course, this will require an accumulator variable inside the reducer.) Then, when you're ready to emit the frequency, emit TOTAL WORDS and the sum if the reducer key was AAAA, and emit the reducer key unchanged and the sum for all other keys.



    _____
  6. tmp -- tmp.


    input
    =mapper IN
    cat	The cat in the hat
    cat	wore the cat hat
    Owl	The owl and the cat
    
    mapper()
    def mapper(key, value):
        lis = value.split()
        for word in lis:
    	Wmr.emit(word.lower(), key)
    
    (A) For first input line:
    key_____
    value_______________
    lis_______________
    word_____
    mapper OUT
    (B)
    
    
    
    
    
    
    
    Shuffle
    reducer IN
    (C)
    
    
    
    
    
    
    reducer()
    def reducer(key, iter):
        count = {'cat': 0, 'Owl': 0}
        for val in iter:
            count[val] = count[val] + 1
        Wmr.emit(key, 'cat ' + str(count['cat']) + 
                      ', Owl ' + str(count['Owl']))
    
    (D) For call reducer('cat', ...):
    key__________
    FILL IN FOR OTHER VARIABLES HERE
    reducer OUT
    and	cat 0, Owl 1
    cat	cat 2, Owl 1 
    (E) CONTINUE:
    
    
    
    
    
    


  7. index -- Making an index. Index...


  8. netflix -- Netflix computations. Netflix problems...