Project 5: Web Benchmark and Spark

Overview

An important part of any Web application development is to provision our Web site for the expected load from the users. To do this, we need to be able to measure how many concurrent user requests a single Web server can handle, so that we can estimate how many server instances are needed to handle the expected load. It is also important to be able to run any complex data processing tasks on a cluster of machines available to us, so that we can provide a near-time feedback from our user data to improve our application.

In Part A, we would focus on measuring the number of concurrent requests that a system can handle. More precisely, we will look into measuring how many users would get the response within a reasonable time period.

In Part B, we would learn how we can use the popular Apache Spark engine to run a complex data processing tasks on a cluster of machines in parallel.

Development Environment

The main development of Project 5 will be done using a new docker container named “spark” created from “junghoo/cs144-spark”. Use the following command to create and start this new container:

$ docker run -it -p4040:4040 -p8089:8089 -v {host_shared_dir}:/home/cs144/shared --name spark junghoo/cs144-spark

Make sure to replace {host_shared_dir} with the name of the shared directory on your host. The above command creates a docker container named spark with appropriate port forwarding and directory sharing.

Our new container has Locust (v0.9.0), JDK (v1.8.0), Scala (v2.12.2), and Spark (v2.2.0) pre-installed. You can check their versions by the following commands.

$ locust --version
$ java -version
$ scala -version
$ spark-shell --version

As before, the default username inside the container is “cs144” with password “password”.

When you perform the tasks for Parts A and B of Project 5, you also need to run the servers you developed in Projects 2 and 3. Make sure that your servers in the “tomcat” and “mean” containers are still available and run well by executing the following sequence of commands inside the host:

$ docker start tomcat
$ docker start mean
$ docker exec -d mean npm start --prefix {your_blog_server_dir}

Replace {your_blog_server_dir} with the path to your Project 3 directory inside the mean container, e.g., /home/cs144/shared/project3/blog-server. Make sure that your Tomcat server is available at http://localhost:8888 and your node server is available at http://localhost:3000 on your host machine.

By default, each container runs in its own isolated environment and can reach other containers only if it knows their IP addresses. In this project, we want to make the containers reachable by their container names, not just by IPs, so that we can send HTTP requests between containers more easily. This can be done by creating a custom bridge network and connecting our three containers to the bridge network. Run the following command inside the host to create a custom bridge network, named cs144-net:

$ docker network create cs144-net

Once a bridge network is created, any running container can be connected to it through the docker network connect command:

$ docker network connect cs144-net spark
$ docker network connect cs144-net tomcat
$ docker network connect cs144-net mean

The above sequence of commands will connect our three containers, “spark”, “tomcat”, and “mean”, to the bridge network “cs144-net”. (Make sure that the three containers are already running before you execute the above commands.)

Now the three containers can reach each other through their names. To verify, execute a shell in the “spark” container and run curl commands to issue HTTP requests to other containers like the following:

$ docker exec --user cs144 -it spark /bin/bash
cs144@e6a142ac3bf9:~$ curl http://mean:3000
cs144@e6a142ac3bf9:~$ curl http://tomcat:8080

The first docker exec command runs a shell in the spark container, so that the next commands are executed from the container, not from the host. The next two curl commands, therefore, issue HTTP requests from the spark container to the mean and tomcat containers, respectively. As long as the mean and tomcat containers have started and are running, you should see correct responses from the servers of the other two containers. Make sure that this is the case by checking the responses. From now on, you should use the http://mean:3000 and http://tomcat:8080 within spark container to access other containers.

Once you finish checking the containers and their network connectivity, you can stop them using the following command:

$ docker stop spark
$ docker stop mean
$ docker stop tomcat

From now on, all three containers can talk to each other through their names!

Part A: Testing Performance of Server(s)

In this part, you are require to (1) learn how to use Locust to test the performance of your server, (2) write some customized files that represent user behavior, (3) record the results of test cases, and (4) find the maximum number of users that the server could handle under some performance requirements.

Learn to Use Locust

You must go over the following Locust tutorial, step-by-step, before moving on:

Before moving forward, make sure you KILL the simple server so that the request you send later is not responded by the fake server.

Mock Data Preparation

Before we start load testing our servers using Locust, we need to load some mock data into our databases. We prepared two scripts, mock_data_tomcat.sh and mock_data_node.sh for this purpose. Download the tomcat script inside the “tomcat” container and execute it like the following:

$ wget http://oak.cs.ucla.edu/classes/cs144/project5/mock_data_tomcat.sh
$ bash ./mock_data_tomcat.sh

Make sure that MySQL and Tomcat servers are up and running inside the container before you run the above script.

Do the same for the node script inside the “mean” container:

$ wget http://oak.cs.ucla.edu/classes/cs144/project5/mock_data_node.sh
$ bash ./mock_data_node.sh

Again, make sure that MongoDB server is running inside the container before you run the above script.

Now our databases have been populated with 500 fake blog posts by the user cs144.

Write Locust Files

Once we have all the test data inserted, it’s time for you to write locust files to load test the servers. In particular, you are required to write 6 files to perform the following tests:

Locust File for Tomcat Server

  1. read_tomcat.py

  2. write_tomcat.py

  3. mixed_tomcat.py

Locust File for Node Server

  1. read_node.py

  2. write_node.py

  3. mixed_node.py

Run Locust Files and Save Results

Part 1

Once you fish writing the six locust files, run Locust without the web UI using the following four files: read_tomcat.py, write_tomcat.py, read_node.py, and write_node.py. Once they finish running, fill in the performance.json file with your results, where the field values are 0. More precisely, you should fill in the fields, “Number of Users” according to the configuration parameters you set in the command, and, “Total Number Of Req”, “Total RPS” and “response time for 98% of the requests”, with the numbers that you get from the final summary of your tests.

Note: The filled values should be numbers, not a string. For example, if your total RPS is 160 req/s, just put the number 160, like “Total RPS”: 160, NOT like “Total RPS”: “160 req/s”. Our grading script won’t correctly recognize any value other than numbers. Do NOT change the any other part of the performance.json file.

Part 2

After filling in the performance.json, you should be now familiar with the testing procedure. Let’s try to find the maximum users that the server could handle under 10%-write and 90%-read load using mixed_tomcat.py and mixed_node.py files. In particular, we require that the servers must return at least 98% responses within less than 1000ms for all requests except login requests. That is, each URL name group except /login?... should return 98% of responses in less than 1000ms.

Your task is to find the maximum number of users (in the unit of hundreds like 300 if the number if higher than 100, in the unit of tens like 60 if the number if higher than 10, and in the unit of 1 if it is below 10) that the servers can handle under this requirement. Use summary_tomcat.txt and summary_node.txt to save the results from the tests when the server met this requirement with the largest number of users. Please save only the summary in your result files by using --only-summary.

After Part A is done, you should have 6 test files (read_tomcat.py, write_tomcat.py, mixed_tomcat.py, read_node.py, write_node.py, mixed_node.py), 1 JSON file (performance.json) and 2 summary files (summary_tomcat.txt, summary_node.txt).

Note 1: Sometimes you might see fails caused by an error like ConnectionError(ProtocolError('Connection aborted.', error(104, 'Connection reset by peer')),). It is very likely an indication that the number of users is more than that your server could handle, not necessarily something is wrong in your implementation.

Note 2: There is no right or wrong results for performance stats in Part A as long as your test files are correctly implemented. You just need to submit the results that you get from your performance benchmark.

Note 3: Now that you are done with Part A, think about the performance of two servers in each scenario. Which one exhibited a better performance? Tomcat or Node.JS? Note that our comparison may be a bit unfair to the Node.JS server because (1) Tomcat did not “render” markdown to HTML when a blog post is “opened” and (2) Node.JS had to perform the extra authentication verification step for the “write” tasks. You may want to keep this difference in mind when you interpret your results. You do not need to provide an answer to this question in your submission. This question is to encourage you to think more deeply about what you observed from load testing.

Part B: Apache Spark

In this part, you will learn how to use the popular Apache Spark engine to perform a (potentially heavy) computational task on a large number of machines using the Map-Reduce framework. In particular, you will identify the most “popular” users on the Twitter network, measured by the number of their followers. Our new “spark” container has the Spark engine preinstalled. We also provide a snapshot of the follower-following graph of Twitter. Your job is to write a (simple) code on Spark that returns the IDs of the users with high follower counts.

Twitter Graph File and Our Task

Download the twitter.edges file that contains a snapshot of the follower-following graph structure of Twitter:

Each line of the file represents the “following” edges from a particular Twitter user in the format below:

user1: user2,user3,...,userk

The above line indicates that user1 is “following” user2 through userk. Note that each user in the file is represented as a unique random integer. For example, the first line of the file:

138134695: 62902439,15224867

indicates that the user “138134695” is following two other users, 62902439 and 15224867.

Given this file, it is relatively straightforward to find the user who follows the largest number of users. We simply need to identify the line with the largest of user IDs behind colon. Unfortunately, our task is more complex. We need to identify the users who are followed by a large number of other users (more precisely by more than 1,000 other users). While our dataset is reasonably small – it is only 21 MB in size – you can imagine that this dataset can potentially be huge, so we want to implement this task using the Apache Spark Engine, so that we can perform this task in parallel on as many available machines as possible.

Apache Spark

Writing a program that runs on a large number of machines in parallel can be a daunting task. To make this task easier, a number of distributed software infrastructures have been developed. In particular, as we learned in class, Map-Reduce framework asks the programmer to provide just the core computational logic of the given task as a set of Map and Reduce functions. Given this core functions, Map-Reduce framework takes care of the rest, including data distribution, parallel execution, process monitoring, and result shuffling and collection. Apache Spark is a popular open source software that provides a Map-Reduce style programming environment on a cluster with a large number of machines.

Once you are inside our spark container, you can run the Spark interactive shell by executing the spark-shell command:

$ spark-shell
Spark context Web UI available at http://172.17.0.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1518992867666).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Inside the Spark shell, you can execute any scala command using Spark API. You can exit from the Spark interactive shell by pressing “Control+D” key.

Now that we know how to start the Apache Spark Shell, it is time to learn how to use it by going over one of widely available Apache Spark tutorials on the Internet. For example, the official Quick Start Tutorial provides a ten-minute introduction to essential basics. The file wordCount.scala also contains the example code that we went over in the class:

val lines = sc.textFile("input.txt")
val words = lines.flatMap(line => line.split(" "))
val word1s = words.map(word => (word, 1))
val wordCounts = word1s.reduceByKey((a,b) => a+b)
wordCounts.saveAsTextFile("output")
System.exit(0)

Again, the above code computes the frequency of each word in the text file input.txt and generates (word, frequency) pairs as the output. When saveAsTextFile("output") is called, the program creates a new subdirectory named output, where part-xxxxx file(s) are generated that contain the (word, frequency) output pairs in wordCounts. Note the last line in the wordCount.scala script: System.exit(0). Calling the system exit function ensures that once the script finishes, the interactive shell is aborted as well. You can execute this file using the Spark shell using the following command:

$ spark-shell -i wordCount.scala
...
$ ls -l output/
total 8
-rwxr-xr-x 1 root root    0 Mar 16 03:18 _SUCCESS
-rwxr-xr-x 1 root root 2131 Mar 16 03:18 part-00000
-rwxr-xr-x 1 root root 2056 Mar 16 03:18 part-00001
$ head output/part-00000
(country,2)
(House,2)
(its,1)
(previously,1)
(countries,,1)
(have,2)
(policy,1)
(travel,,1)
(order,7)
(Trump’s,2)

Note that the provided wordCount.scala code is written to be executed in an interactive shell only. If we want to properly run our program on multiple machines in a Spark cluster, we need to “wrap” this code within an object with a “main” function. We also need to create proper “Spark Configuration” and “Spark Context” in which our program will run. For example, wordCountFull.scala shows an extended version of our wordCount program that can be properly compiled into an executable “package”, say word-count-project.jar using a tool chain such as sbt. Once packaged, it can be submitted to a Spark cluster for parallel execution using a command like the following:

$ spark-submit --class edu.ucla.cs144.WordCount --master spark://23.195.26.187:7077 --deploy-mode cluster word-count-project.jar

Since the goal of this project is to introduce you to the main programming paradigm/API of Spark, not the nitty gritty details of the Spark packaging tool chain and job submission interface, in this project we will assume that your code will be executed through a Spark interactive shell using the command spark-shell -i.

While Spark supports multiple languages, Scala is the most popular (and syntactically clean) language to program on Spark, so we use Scala for this project. Basic Scala needed to complete this project is easy to learn and your code will be much cleaner than when you use other languages. There exist many quick online tutorials on Scala, such as this one. Fortunately, this project can be completed without using any advanced Scala constructs like class inheritance. In particular, we note that anonymous functions in Scala are similar to “arrow functions” in JavaScript, which can be used to pass “Map-Reduce” functions.

Writing Your Code

Now that you got the basics, it is time to write code. Your code must read the text file twitter.edges located in the current directory, parse it to obtain the Twitter follower-following graph, perform necessary computation, and return the list of all (userid, follower_count) pairs for the users with more than 1000 followers. The output from your code should contain many lines of (userid, follower_count) pairs like:

(40981798,8569)
(43003845,7621)
...

The first two lines of the above output, for example, indicate that the users 40981798 and 43003845 have 8569 and 7621 followers, respectively. The output (userid, follower_count) pairs should be saved as a (set of) text file(s) in the “output” directory using the Spark saveAsTextFile() function. The output (userid, follower_count) pairs may appear in any order and need not be sorted.

In writing your code, you may find the list of Spark transformation functions helpful. Also, if you need a “hint” on parsing the provided twitter.edges file, you may find this question and answer at StackOverflow helpful (local mirror is available here).

Before we finish, we reiterate the essential requirements of your code.

Code Requirements

  1. Your code should read the twitter graph from the file “twitter.edges” located in the current directory.
  2. The output from your code should be the list of (userid, follower_counts) pairs for all the users whose follower count is larger than 1000. The output does not have to be sorted.
  3. Your code should save the output in the “output” subdirectory within the current working directory using saveAsTextFile() of Spark RDD.
  4. Your code should compute the final results using Map-Reduce-style programming by applying a series of Spark transformation functions to the input dataset.

Testing Your Code

Before submitting your code, thoroughly test your code so that it computes the correct results. To help you ensure that your code produces the correct output, here are a few sample results from our dataset:

User ID Follower Count
40981798 8569
3359851 3905
88323281 2315
18742444 1585
9451052 1184
302847930 1182
12925072 1002

In total, there are 177 users with more than 1000 followers.

What to Submit

For this project, you need to submit a single zip file named project5.zip that has the following packaging structure.

project5.zip
 +- read_tomcat.py, write_tomcat.py, mixed_tomcat.py
 +- read_node.py, write_node.py, mixed_node.py
 +- performance.json
 +- summary_tomcat.txt, summary_node.txt
 +- topUsers.scala
 +- TEAM.txt
 +- README.txt (optional)

Each file or directory is as following:

  1. read_tomcat.py, write_tomcat.py, mixed_tomcat.py, read_node.py, write_node.py, mixed_node.py: These are the six locust files to load test your tomcat and node servers
  2. performance.json: This is the JSON file that contains the load test results.
  3. summary_tomcat.txt, summary_node.txt: These are the summary table from the two mixed benchmark tests of Part A-1
  4. topUsers.scala: this is the main Scala code that you wrote to compute the top Twitter users. This code should be executable simply by typing “spark-shell -i topUsers.scala”. Please DO NOT submit any input or output files for your code. Just submit your main Scala script.
  5. TEAM.txt: This file must include the 9-digit university ID (UID) of every team member, one UID per line. No spaces or dashes. Just 9-digit UID per line. If you are working on your own, include just your UID.
  6. README.txt includes any comments you find worth noting, regarding your code structure, etc.

All files should be contained directly under the project5.zip (without any enclosing folders).

Testing Your Zip File

To ensure the correct packaging of your submission, we have made a grading script p5_test for Project 5, which can be executed like:

$ ./p5_test project5.zip

(Add the appropriate path to the project5.zip if needed. You may need to use “chmod +x p5_test” if there is a permission error.)

You MUST test your submission using the script to minimize the chance of an unexpected error during grading. When everything runs properly, you will see an output similar to the following from the script:

Executing your Spark code.....
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = local-1488946788877).
Spark session available as 'spark'.
Loading topUsers.scala...

...

(20,1010)
(99,1010)
(10,1010)

SUCCESS! We finished testing your zip file integrity.

Once your work is properly packaged as a zip file, submit your zip file via our submission page at CCLE.

You may submit as many times as you like, however only the latest submission will be saved, and those are what we will use for grading your work and determining late penalties.

Grading Criteria

Overall grading breakdown is as below