CS246 Project 4

Overview

The primary goal of this project is to learn how to use common software tools to run two important theoretical methods that we learned in the class: (1) low-rank matrix approximation and (2) latent dirichlet analysis. In Task 1, you will learn how to perform a rank-k approximation of a matrix using Python. In Task 2, you will learn how to perform LDA using MALLET.

Task 1: Low-Rank Matrix Approximation

The first task of Project 4 is to "compress" an image through a rank-k matrix approximation using Singular Value Decomposition (SVD). In the past, Matlab (and its open-source alternative Octave) used to be the most popular software tool for matrix manipulation, but recently Python with its numpy package has taken over the community due to its ease and the wide availability of related tools.

Learning Python

Given its wide adoption in the data-science and machine-learing community, we will use Python for Task 1. Python is an easy to learn, powerful programming language with efficient high-level data structures and a simple yet effective approach to object-oriented programming. Basic Python is easy to learn and your code will be much clearer than when you use other languages. Python 3 is preinstalled in our virtual machine, which can be invoked as follows:

cs246@cs246:~$ python3
Python 3.5.2 (default, Sep 14 2017, 22:51:06) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

If you are not familiar with Python, go over one of the many online tutorials on Python, such as this one. While Python has many advanced programming constructs, the linked tutorial is likely to teach you what you need to complete this project.

Since we need to perform linear algebraic operations on matrix-like data for Task 1, we will have to use numpy python package. If you are not familiar with Python numpy package, go over this tutorial on Python numpy. (Note: You do not need to go over the "Matlab files" and "Matplotlib" sections of the linked tutorial since we won't be using them in this task.)

Your Task

In the project4.zip file, we included task1.py file that you have to update to implement the code for Task 1. The current code takes two command-line parameters, and simply prints out the task that it is supposed to do like the following:

cs246@cs246:~$ ./task1.py lion-small.png 10
This program should read lion-small.png file, perform rank 10 approximation, 
and save the results in lion-small.10.png and lion-small.10.npy files.

In the project4.zip file, we also included two grayscale PNG files, lion-small.png and lion-big.png, that you can use to develop and test your code.

Your updated code should perform the following operations:

Load an image file (whose name is provided as the first command-line argument) into a numpy array using the scipy.misc.imread function.
Perform singular value decomposition on the array
Keep only the top-k entries in the SVD decomposed matrices and multiply them to obtain the best rank-k approximation of the original array (the k value should come from the second command-line argument of task1.py)
Save the approximated array as (1) a binary array file (with the name from the getOutputNpyName function call) using numpy.save function and (2) a PNG file (with the name from the getOutputPngName function call) using the scipy.misc.imsave function.

If you need to know how to perform basic matrix operations on numpy array (like multiplying matrices, removing rows and colums, etc.), read the linked tutorial on Python numpy. If you need to know how to perform advanced linear algebraic operations in Python, go over the documentation on numpy.linalg.

Once you finish developing your code, run it on the PNG files lion-small.png and lion-big.png with varying ranks --- say 1, 10, and 100 --- to see how the approximated image changes depending on the rank.

Note: The lion-big.png image is of dimension 2560x1600, so it may take a while for the SVD and matrix multiplication operations to finish.

Task 2: Latent Dirichlet Analysis

Latent Dirichlet Analysis (LDA) is a probabilistic topic model that is able to extract latent topics from a document corpus automatically without any user input. Once the user specifies the desired number of topics to extract, LDA takes care of the rest. As Task 2 of this project, we will run LDA on small corpuses and explore the results.

For this task, you will have to use the topic modeling libs in MALLET to perform LDA. MALLET is a Java-based library for machine learning toolkits of text analysis, including document classification, clustering, topic modeling, and information extraction. The MALLET topic modeling toolkit contains an efficient, sampling-based implementations of LDA, which allows you to perform LDA on a large text corpus by writing just a few lines of shell scripts.

In our project4.zip file, we included two datasets that can be used for this task: (1) NSF grants and (2) AP articles. NSF grants dataset, located in the data/nsf/ folder, consists of 1,166 abstracts of NSF Grant proposals sampled from the NSF Research Awards Corpus between 2000-2003. AP articles dataset, located in the data/ap/ folder, consists of 1,085 random Associated Press articles. Take a look at the files in the two folders to see what the texts look like.

Now read the MALLET Topic Modeling page to learn how to use MALLET to perform LDA on a document corpus. Roughly, you can use the MALLET topic model package in two steps: the first step is to import text files into MALLET's internal format. This can be done with the import-dir command. Once you have imported documents into MALLET format, the next step is to run the topic model. This can be done using the command train-topics with proper options.

Note: Within our virtual machine, MALLET is installed in the /usr/local/mallet/ folder. Also the PATH variable has already been set, so that you can execute MALLET just by typing mallet in the command line.

Now run MALLET on the NSF grants dataset for 20 topics. If executed correctly, MALLET will print out a sequence of output similar to the following:

Mallet LDA: 20 topics, 5 topic bits, 11111 topic mask
Data loaded.
max tokens: 426
total tokens: 162501
<10> LL/token: -9.65293
...
0	0.25	cell molecular genes proteins protein genetic cells biology function gene dna plant plants important research expression specific involved identify development 
1	0.25	results research international network work behavior study field important model conference service fields models investigators economic general states ideas large 
2	0.25	data sites research carbon marine soil project time north global provide support pacific ecosystems equipment water results ecology change isotopic 
...
18	0.25	project information computer provide software digital technologies develop resource engineering courses computing developed design techniques work experiments modeling simulation development 
19	0.25	technology project learning research industry curriculum environmental develop laboratory methods include manufacturing applications phase small process opportunities including industrial cost 
...
<1000> LL/token: -8.91212

Total time: 16 seconds

When executed, MALLET runs multiple iterations of Gibbs sampling on the input corpus and prints out the extracted topics and their top-20 associated words output at every 50 iterations until convergence. For example, the following output from MALLET

0	0.25	cell molecular genes proteins protein genetic cells biology function gene dna plant plants important research expression specific involved identify development

indicates that the most frequently associated words with the first topic (topic 0) are cell, molecular, genes, ..., and development.

Note: Please remember that LDA result is produced through a random sampling, and you are unlikely to get the same result between multiple runs on the same dataset. So don't panic if your results change in each run or your output looks very different from our sample results here.

The final result from LDA can be saved as a set of text files. Run MALLET again with the appropriate command-line options so that it will save the "topic-key" file (the file that contains the topic number and their top-20 most frequent words) as topic-words.txt and the "doc-topics" file (the file that contains the document ids and their topic-association probabilities) as doc-topics.txt. If you are not sure of mallet command-line opions, use the --help option to learn about MALLET's command line options and their usage.

Once saved, look at the files to investigate whether the extracted topics makes sense. Do they look reasonable? Are the top-20 words of each topic coherent? How noisy are the results? Now repeat the same experiments with different numbers of topics --- say, 10, 50, and 100 --- and on AP articles dataset. Check how the result changes and whether you still like them.

Now that you learned how to use MALLET to run LDA on a document collection, finish Task 2 by doing the following:

Modify task2.sh in project4.zip so that it does the following:
- It should take two command line parameters: <corpus dir> <number of topics>
- The script should read the files in <corpus dir> and run LDA with <number of topics>.
- The script should produce two output files in the current directory:
  1. topic-words.txt: This is the "topic-keys" file produced by MALLET, whose lines consist of the topic number followed top-20 associated words.
  2. doc-topics.txt: This is the "doc-topics" file produced by MALLET, whose lines start with document id and their file name, followed by their topic probabilities.
Run LDA on the NSF grants dataset with 20 topics and generate the two output files: topic-words.txt and doc-topics.txt
Manually go over the generated topic-words.txt file to come up with an appropriate "label" for each of the 20 topics. Then create a file named topic-labels.txt, whose lines consist of the topic number followed by its label. For example, if you decide that the first topic (topic 0) is about "cellular biology", the first line of topic-labels.txt file should be:
```
0: cellular biology
```

You will need to submit these four files, task2.sh, topic-words.txt, doc-topics.txt, and topic-labels.txt as part of your project 4 submission.

Your Final Submission

In project 4, you need to submit the following files:

task1.py: Your code for Task 1.
task2.sh: Your code for Task 2.
topic-words.txt: The "topic-key" file generated from data/nsf dataset using 20 topics
doc-topics.txt: The "doc-topics" file generated from data/nsf dataset using 20 topics
topic-labels.txt: Each line of this file should start with the topic number followed by a short "label" that best captures the general meaning of the topic

Your project must be submitted electronically before the deadline through CCLE. Navigate to Projects on left of the page, and click on the Project 4 submission section. If you submit multiple times, we will grade only the latest submission.

What to Submit

The zip file that you submit must be named project4.zip, created using a zip compression utility (like using "zip -r project4.zip *" command in the VM). You should submit this single file project4.zip that has the following packaging structure.

project4.zip
    |
    +- task1.py
    |
    +- task2.sh
    |
    +- topic-words.txt
    |
    +- doc-topics.txt
    |
    +- topic-labels.txt 
    |
    +- README.txt (optional)

Testing Your Submission

Grading is a difficult and time-consuming process, and file naming and packaging convention is very important to test your submission without any error. In order to help you ensure the correct packaging of your submission, we have made a "grading script" p4_test available. In essence, the grading script unzips your submission to a temporary directory and executes your files to test whether they are likely to run OK on the grader's machine. Download the grading script and execute it in the VM like:

cs246@cs246:~$ ./p4_test.sh project4.zip

(if your project4.zip file is not located in the current directory, you need to add the path to the zip file before project4.zip. You may need to use "chmod +x p4_test" if there is a permission error.)

You MUST test your submission using the script before your final submission to minimize the chance of an unexpected error during grading. When everything runs properly, you will see an output similar to the following from the grading script:

  Testing Task 1...
  Finished testing Task 1, SUCCESS!
  Testing Task 2...
  Finished testing Task 2, SUCCESS!

Grading

The overall grading breakdown of Project 4 is as below.

Your task1.py code runs without any error. -- 20%
The outputs of task1.py are correct. -- 30%
Your task2.sh runs correctly and produces expected results. -- 40%
Submitted topic-words.txt, doc-topics.txt and topic-labels files are reasonable. -- 10%