The primary goal of this project is to learn how to use common software tools to run two important theoretical methods that we learned in the class: (1) low-rank matrix approximation and (2) latent dirichlet analysis. In Task 1, you will learn how to perform a rank-k approximation of a matrix using Python. In Task 2, you will learn how to perform LDA using MALLET.
The first task of Project 4 is to "compress" an image through a rank-k matrix approximation using Singular Value Decomposition (SVD). In the past, Matlab (and its open-source alternative Octave) used to be the most popular software tool for matrix manipulation, but recently Python with its numpy package has taken over the community due to its ease and the wide availability of related tools.
Given its wide adoption in the data-science and machine-learing community, we will use Python for Task 1. Python is an easy to learn, powerful programming language with efficient high-level data structures and a simple yet effective approach to object-oriented programming. Basic Python is easy to learn and your code will be much clearer than when you use other languages. Python 3 is preinstalled in our virtual machine, which can be invoked as follows:
cs246@cs246:~$ python3 Python 3.5.2 (default, Sep 14 2017, 22:51:06) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
If you are not familiar with Python, go over one of the many online tutorials on Python, such as this one. While Python has many advanced programming constructs, the linked tutorial is likely to teach you what you need to complete this project.
Since we need to perform linear algebraic operations on matrix-like data for Task 1, we will have to use numpy python package. If you are not familiar with Python numpy package, go over this tutorial on Python numpy. (Note: You do not need to go over the "Matlab files" and "Matplotlib" sections of the linked tutorial since we won't be using them in this task.)
In the project4.zip file, we included task1.py
file that you have to update to implement the code for Task 1. The current code takes two command-line parameters, and simply prints out the task that it is supposed to do like the following:
cs246@cs246:~$ ./task1.py lion-small.png 10 This program should read lion-small.png file, perform rank 10 approximation, and save the results in lion-small.10.png and lion-small.10.npy files.
In the project4.zip file, we also included two grayscale PNG files, lion-small.png and lion-big.png, that you can use to develop and test your code.
Your updated code should perform the following operations:
scipy.misc.imread
function.task1.py
)getOutputNpyName
function call) using numpy.save
function and (2) a PNG file (with the name from the getOutputPngName
function call) using the scipy.misc.imsave
function.If you need to know how to perform basic matrix operations on numpy array (like multiplying matrices, removing rows and colums, etc.), read the linked tutorial on Python numpy. If you need to know how to perform advanced linear algebraic operations in Python, go over the documentation on numpy.linalg.
Once you finish developing your code, run it on the PNG files lion-small.png
and lion-big.png
with varying ranks --- say 1, 10, and 100 --- to see how the approximated image changes depending on the rank.
Note: The lion-big.png
image is of dimension 2560x1600, so it may take a while for the SVD and matrix multiplication operations to finish.
Latent Dirichlet Analysis (LDA) is a probabilistic topic model that is able to extract latent topics from a document corpus automatically without any user input. Once the user specifies the desired number of topics to extract, LDA takes care of the rest. As Task 2 of this project, we will run LDA on small corpuses and explore the results.
For this task, you will have to use the topic modeling libs in MALLET to perform LDA. MALLET is a Java-based library for machine learning toolkits of text analysis, including document classification, clustering, topic modeling, and information extraction. The MALLET topic modeling toolkit contains an efficient, sampling-based implementations of LDA, which allows you to perform LDA on a large text corpus by writing just a few lines of shell scripts.
In our project4.zip file, we included two datasets that can be used for this task: (1) NSF grants and (2) AP articles. NSF grants dataset, located in the data/nsf/
folder, consists of 1,166 abstracts of NSF Grant proposals sampled from the NSF Research Awards Corpus between 2000-2003. AP articles dataset, located in the data/ap/
folder, consists of 1,085 random Associated Press articles. Take a look at the files in the two folders to see what the texts look like.
Now read the MALLET Topic Modeling page to learn how to use MALLET to perform LDA on a document corpus. Roughly, you can use the MALLET topic model package in two steps: the first step is to import text files into MALLET's internal format. This can be done with the import-dir
command. Once you have imported documents into MALLET format, the next step is to run the topic model. This can be done using the command train-topics
with proper options.
Note: Within our virtual machine, MALLET is installed in the /usr/local/mallet/
folder. Also the PATH variable has already been set, so that you can execute MALLET just by typing mallet
in the command line.
Now run MALLET on the NSF grants dataset for 20 topics. If executed correctly, MALLET will print out a sequence of output similar to the following:
Mallet LDA: 20 topics, 5 topic bits, 11111 topic mask Data loaded. max tokens: 426 total tokens: 162501 <10> LL/token: -9.65293 ... 0 0.25 cell molecular genes proteins protein genetic cells biology function gene dna plant plants important research expression specific involved identify development 1 0.25 results research international network work behavior study field important model conference service fields models investigators economic general states ideas large 2 0.25 data sites research carbon marine soil project time north global provide support pacific ecosystems equipment water results ecology change isotopic ... 18 0.25 project information computer provide software digital technologies develop resource engineering courses computing developed design techniques work experiments modeling simulation development 19 0.25 technology project learning research industry curriculum environmental develop laboratory methods include manufacturing applications phase small process opportunities including industrial cost ... <1000> LL/token: -8.91212 Total time: 16 seconds
When executed, MALLET runs multiple iterations of Gibbs sampling on the input corpus and prints out the extracted topics and their top-20 associated words output at every 50 iterations until convergence. For example, the following output from MALLET
0 0.25 cell molecular genes proteins protein genetic cells biology function gene dna plant plants important research expression specific involved identify developmentindicates that the most frequently associated words with the first topic (topic 0) are cell, molecular, genes, ..., and development.
Note: Please remember that LDA result is produced through a random sampling, and you are unlikely to get the same result between multiple runs on the same dataset. So don't panic if your results change in each run or your output looks very different from our sample results here.
The final result from LDA can be saved as a set of text files. Run MALLET again with the appropriate command-line options so that it will save the "topic-key" file (the file that contains the topic number and their top-20 most frequent words) as topic-words.txt
and the "doc-topics" file (the file that contains the document ids and their topic-association probabilities) as doc-topics.txt
. If you are not sure of mallet command-line opions, use the --help
option to learn about MALLET's command line options and their usage.
Once saved, look at the files to investigate whether the extracted topics makes sense. Do they look reasonable? Are the top-20 words of each topic coherent? How noisy are the results? Now repeat the same experiments with different numbers of topics --- say, 10, 50, and 100 --- and on AP articles dataset. Check how the result changes and whether you still like them.
Now that you learned how to use MALLET to run LDA on a document collection, finish Task 2 by doing the following:
task2.sh
in project4.zip so that it does the following:
<corpus dir> <number of topics>
<corpus dir>
and run LDA with <number of topics>
.topic-words.txt
: This is the "topic-keys" file produced by MALLET, whose lines consist of the topic number followed top-20 associated words.doc-topics.txt
: This is the "doc-topics" file produced by MALLET, whose lines start with document id and their file name, followed by their topic probabilities.topic-words.txt
and doc-topics.txt
topic-words.txt
file to come up with an appropriate "label" for each of the 20 topics. Then create a file named topic-labels.txt
, whose lines consist of the topic number followed by its label. For example, if you decide that the first topic (topic 0) is about "cellular biology", the first line of topic-labels.txt
file should be:
0: cellular biology
task2.sh
, topic-words.txt
, doc-topics.txt
, and topic-labels.txt
as part of your project 4 submission.
In project 4, you need to submit the following files:
data/nsf
dataset using 20 topics
data/nsf
dataset using 20 topics
Your project must be submitted electronically before the deadline through CCLE. Navigate to Projects on left of the page, and click on the Project 4 submission section. If you submit multiple times, we will grade only the latest submission.
project4.zip
, created using a zip compression utility (like using "zip -r project4.zip *
" command in the VM).
You should submit this single file project4.zip that has the following packaging structure.
project4.zip | +- task1.py | +- task2.sh | +- topic-words.txt | +- doc-topics.txt | +- topic-labels.txt | +- README.txt (optional)
cs246@cs246:~$ ./p4_test.sh project4.zip(if your project4.zip file is not located in the current directory, you need to add the path to the zip file before project4.zip. You may need to use "chmod +x p4_test" if there is a permission error.)
You MUST test your submission using the script before your final submission to minimize the chance of an unexpected error during grading. When everything runs properly, you will see an output similar to the following from the grading script:
Testing Task 1... Finished testing Task 1, SUCCESS! Testing Task 2... Finished testing Task 2, SUCCESS!
topic-words.txt
, doc-topics.txt
and topic-labels
files are reasonable. -- 10%