Please take a look at Single Machine Usage for information on running individual commands. Here we give you ways to run those individual commands on multiple machines. So, we are not repeating the details on the individual commands.
Using Hadoop
Run Y!LDA on the corpus:
Assuming you have a homogenous setup, install Y!LDA on one machine.
Run make jar to create LDALibs.jar file with all the required libraries and binaries
Copy LDALibs.jar to HDFS
Copy Formatter.sh LDA.sh functions.sh runLDA.sh to gateway. A gateway is any machine with access to the grid. Think of it as a machine from which you run your hadoop commands.
Figure out the max memory allowed per map task for your cluster and use the same in the script via the maxmem parameter. This can be done by looking at any job conf (job.xml) and searching the value of "mapred.cluster.max.map.memory.mb" property.
Run runLDA.sh 1 "flags" [train|test] "queue" "organized-corpus" "output-dir" "max-mem" "number_of_topics" "number_of_iters" "full_hdfs_path_of_LDALibs.jar" "number_of_machines" ["training-output"]
For train ex. runLDA.sh 1 "" train default "/user/me/organized-corpus" "/user/me/lda-output" 6144 100 100 "LDALibs.jar" 10
For test ex. runLDA.sh 1 "" test default "/user/me/organized-corpus" "/user/me/lda-output" 6144 100 100 "LDALibs.jar" 10 "/user/me/lda-output"
This starts 2 Map Reduce Streaming jobs. The first job does the formatting & the second job starts a map-only script on each machine. This script starts the DM_Server on all the machines. Then Y!LDA is run on each machine. The input is one chunk of the corpus. The first job runs the formatter in the reducer using the supplied number_of_machines as the number of reduce tasks. So your corpus will be split into number_of_machines parts and formatted into protobuffer format files as needed by the second job. The second job uses this formatted input as its input and will also run on number_of_machines separate machines with each task working on one chunk.
For testing, use the test flag and provide directory storing the training output.
Output generated
Creates <number_of_machines> folders in <output-dir> one for each client.
Each of these directories hold the same output as the single machine case but from different clients.
<output-dir>/<client-id>/learntopics.WARNING contains the output written to stderr by client <client-id>
<output-dir>/<client-id>/lda.docToTop.txt contains the topic proportions assigned to the documents in the portion of the corpus alloted to client <client-id>
<output-dir>/<client-id>/lda.topToWor.txt contains the salient words learned for each topic. This remains almost same across clients. So you can pick one of these as the salient words per topic for the full corpus.
<output-dir>/<client-id>/lda.ttc.dump contains the actual model. Even this like the salient words is almost same across clients and any one can be used as the model for the full corpus.
<output-dir>/global contains the dump of the global dictionary and the partitioned gobal topic counts table. These are generated in the training phase and are critical for the test option to work.
Viewing progress
The stderr output of the code will be redirected into hadoop logs. So you can check the task logs from the tracking URL displayed in the output of runLDA.sh to see what is happening
Failure Recovery
We provide a check-pointing mechanism to handle recovery from failures. The current scheme works in local mode and for distributed mode using Hadoop. The reason for this being that the distributed check-pointing uses the hdfs to store the check-points. The following is the process:
This mechanism is utilized by the scripts to detect failure cases and attempt to re-run the task again from the previous checkpoint. As learntopics is designed to check if check-point metadata is available in the working directory and use it to start-off from there a separate restart option is obviated.
As a by product one gets the facility of doing incremental runs, that is, to run say 100 iterations, check the output and run the next 100 iterations if needed. The scripts detect this condition and ask you if you want to start-off from where you left or restart from the beginning.
The scripts are designed in such a fashion that these happen transparently to the user. This is information for developers and for cases where the recovery mechanism could not handle the failure in the specified number of attempts. Check the stderr logs to see what the reason for failure is. Most times it is due to wrong usage which results in unrecoverable aborts. If you think its because of a flaky cluster, then try increasing the number of attempts. If nothing works and you think there is a bug in the code please let us know.
Using SSH - Assume you have 'm' machines
If you have a homogenous set up, install Y!LDA on one machine, run make jar and copy LDALibs.jar to all the other machines in the set up. Else install Y!LDA on all machines.
Split the corpus into 'm' parts and distribute them to the 'm' machines
Run formatter on each split of the corpus on every machine.
Run the Distributed_Map Server on each machine as a background process using nohup:
nohup ./DM_Server <model> <server_id> <num_clients> <host:port> --Ice.ThreadPool.Server.SizeMax=9 &
model: an integer that represents the model. Set it to 1 for Unigram_Model
server_id: a number that denotes the index of this server in the list of servers that is provided to 'learntopics'. If server1 has h:p, 10.1.1.1:10000 & is assigned id 0, server2 has h:p, 10.1.1.2:10000 & is assigned id 1, the list of servers that is provided to 'learntopics' has to be 10.1.1.1:10000, 10.1.1.2:10000 and not the other way around.
num_clients: a number that denotes the number of clients that will access the Distributed Map. This is usually equal to 'm'. This is used to provide a barrier implementation
host:port- the port and ip address on which the server must listen on
Run Y!LDA on the corpus:
On every machine run
learntopics --topics=<topics> --iter=<iter> --servers=<list-of-servers> --chkptdir="/tmp" --chkptinterval=10000
<list-of-servers>: The comma separated list of ip:port numbers of the servers involved in the set-up. The index of the ip:port numbers should be as per the server_id parameter used in starting the server
chkptdir & chkptinterval: These are currently used only with the Hadoop set-up. Set chkptdir to something dummy. In order that the checkpointing code does not execute, set the chkptinterval to a very large value or some number greater than the number of iterations
Create Global Dictionary - Run the following on server with id 0. Assuming learntopics was run in the folder /tmp/corpus
mkdir -p /tmp/corpus/global_dict; cd /tmp/corpus/global_dict;
scp server_i:/tmp/corpus/lda.dict.dump lda.dict.dump.i
where the variable 'i' is the same as the server_id.
Merge Dictionaries
Merge_Dictionaries --dictionaries=m --dumpprefix=lda.dict.dump
mkdir -p ../global; mkdir -p ../global/topic_counts; cp lda.dict.dump ../global/;
Create a sharded Global Word-Topic Counts dump - Run on every machine in the set-up
mkdir -p /tmp/corpus/global_top_cnts; cd /tmp/corpus/global_top_cnts;
scp server_0:/tmp/corpus/global/lda.dict.dump lda.dict.dump.global
Merge_Topic_Counts --topics=<topics> --clientid=<server-id> --servers=<list-of-servers> --globaldictionary="lda.dict.dump.global"
scp lda.ttc.dump server_0:/tmp/corpus/global/topic_counts/lda.ttc.dump.$server-id
Copy the parameters dump file to global dump - Run on server_0
cd /tmp/corpus; cp lda.par.dump global/topic_counts/lda.par.dump
This completes training and the model is available on server_0:/tmp/corpus/global
Running Y!LDA is test mode: Run on server_0. Assuming test corpus is in /tmp/test_corpus
cd /tmp/test_corpus;
cp -r ../corpus/global .
learntopics -teststream --dumpprefix=global/topic_counts/lda --numdumps=m --dictionary=global/lda.dict.dump --maxmemory=2048 --topics=<topics>
cat all your documents, in the same format that 'formatter' expects, to the above command's stdin