Multi-Machine Setup

Please take a look at Single Machine Usage for information on running individual commands. Here we give you ways to run those individual commands on multiple machines. So, we are not repeating the details on the individual commands.

  1. Using Hadoop

    1. Run Y!LDA on the corpus:

      1. Assuming you have a homogenous setup, install Y!LDA on one machine.

      2. Run make jar to create LDALibs.jar file with all the required libraries and binaries

      3. Copy LDALibs.jar to HDFS

      4. Copy Formatter.sh LDA.sh functions.sh runLDA.sh to gateway. A gateway is any machine with access to the grid. Think of it as a machine from which you run your hadoop commands.

      5. Figure out the max memory allowed per map task for your cluster and use the same in the script via the maxmem parameter. This can be done by looking at any job conf (job.xml) and searching the value of "mapred.cluster.max.map.memory.mb" property.

      6. Run runLDA.sh 1 "flags" [train|test] "queue" "organized-corpus" "output-dir" "max-mem" "number_of_topics" "number_of_iters" "full_hdfs_path_of_LDALibs.jar" "number_of_machines" ["training-output"]

        For train ex. runLDA.sh 1 "" train default "/user/me/organized-corpus" "/user/me/lda-output" 6144 100 100 "LDALibs.jar" 10

        For test ex. runLDA.sh 1 "" test default "/user/me/organized-corpus" "/user/me/lda-output" 6144 100 100 "LDALibs.jar" 10 "/user/me/lda-output"

      7. This starts 2 Map Reduce Streaming jobs. The first job does the formatting & the second job starts a map-only script on each machine. This script starts the DM_Server on all the machines. Then Y!LDA is run on each machine. The input is one chunk of the corpus. The first job runs the formatter in the reducer using the supplied number_of_machines as the number of reduce tasks. So your corpus will be split into number_of_machines parts and formatted into protobuffer format files as needed by the second job. The second job uses this formatted input as its input and will also run on number_of_machines separate machines with each task working on one chunk.

      8. For testing, use the test flag and provide directory storing the training output.

    2. Output generated

      1. Creates <number_of_machines> folders in <output-dir> one for each client.

      2. Each of these directories hold the same output as the single machine case but from different clients.

      3. <output-dir>/<client-id>/learntopics.WARNING contains the output written to stderr by client <client-id>

      4. <output-dir>/<client-id>/lda.docToTop.txt contains the topic proportions assigned to the documents in the portion of the corpus alloted to client <client-id>

      5. <output-dir>/<client-id>/lda.topToWor.txt contains the salient words learned for each topic. This remains almost same across clients. So you can pick one of these as the salient words per topic for the full corpus.

      6. <output-dir>/<client-id>/lda.ttc.dump contains the actual model. Even this like the salient words is almost same across clients and any one can be used as the model for the full corpus.

      7. <output-dir>/global contains the dump of the global dictionary and the partitioned gobal topic counts table. These are generated in the training phase and are critical for the test option to work.

    3. Viewing progress

      1. The stderr output of the code will be redirected into hadoop logs. So you can check the task logs from the tracking URL displayed in the output of runLDA.sh to see what is happening

    4. Failure Recovery

      We provide a check-pointing mechanism to handle recovery from failures. The current scheme works in local mode and for distributed mode using Hadoop. The reason for this being that the distributed check-pointing uses the hdfs to store the check-points. The following is the process:

      1. The formatter task is run on the inputs and the formatted input is stored in a temporary location.
      2. The learntopics task is run using the temporary location as an input and the specified output as the output directory. Care is taken to start the same number of mappers as the number_of_machines for learntopics tasks. The input is a dummy directory structure with dummy directories equal to the number_of_machines parameter supplied by the user.
      3. Each learntopics task copies its portion of the formatted input by dfs copy_to_local the folder corresponding to its mapred_task_partition.
      4. Runs learntopics with the temporary directory containing the formatted input as a check-point directory. So all information needed to start learntopics from the previous check-pointed iteration is available locally and any progress made is written back to the temporary input directory.

      This mechanism is utilized by the scripts to detect failure cases and attempt to re-run the task again from the previous checkpoint. As learntopics is designed to check if check-point metadata is available in the working directory and use it to start-off from there a separate restart option is obviated.

      As a by product one gets the facility of doing incremental runs, that is, to run say 100 iterations, check the output and run the next 100 iterations if needed. The scripts detect this condition and ask you if you want to start-off from where you left or restart from the beginning.

      The scripts are designed in such a fashion that these happen transparently to the user. This is information for developers and for cases where the recovery mechanism could not handle the failure in the specified number of attempts. Check the stderr logs to see what the reason for failure is. Most times it is due to wrong usage which results in unrecoverable aborts. If you think its because of a flaky cluster, then try increasing the number of attempts. If nothing works and you think there is a bug in the code please let us know.

  1. Using SSH - Assume you have 'm' machines

    1. If you have a homogenous set up, install Y!LDA on one machine, run make jar and copy LDALibs.jar to all the other machines in the set up. Else install Y!LDA on all machines.

    2. Split the corpus into 'm' parts and distribute them to the 'm' machines

    3. Run formatter on each split of the corpus on every machine.

    4. Run the Distributed_Map Server on each machine as a background process using nohup:

      nohup ./DM_Server <model> <server_id> <num_clients> <host:port> --Ice.ThreadPool.Server.SizeMax=9 &

      1. model: an integer that represents the model. Set it to 1 for Unigram_Model

      2. server_id: a number that denotes the index of this server in the list of servers that is provided to 'learntopics'. If server1 has h:p, 10.1.1.1:10000 & is assigned id 0, server2 has h:p, 10.1.1.2:10000 & is assigned id 1, the list of servers that is provided to 'learntopics' has to be 10.1.1.1:10000, 10.1.1.2:10000 and not the other way around.

      3. num_clients: a number that denotes the number of clients that will access the Distributed Map. This is usually equal to 'm'. This is used to provide a barrier implementation

      4. host:port- the port and ip address on which the server must listen on

    5. Run Y!LDA on the corpus:

      1. On every machine run

        learntopics --topics=<topics> --iter=<iter> --servers=<list-of-servers> --chkptdir="/tmp" --chkptinterval=10000

        1. <list-of-servers>: The comma separated list of ip:port numbers of the servers involved in the set-up. The index of the ip:port numbers should be as per the server_id parameter used in starting the server

        2. chkptdir & chkptinterval: These are currently used only with the Hadoop set-up. Set chkptdir to something dummy. In order that the checkpointing code does not execute, set the chkptinterval to a very large value or some number greater than the number of iterations

      2. Create Global Dictionary - Run the following on server with id 0. Assuming learntopics was run in the folder /tmp/corpus

        1. mkdir -p /tmp/corpus/global_dict; cd /tmp/corpus/global_dict;

        2. scp server_i:/tmp/corpus/lda.dict.dump lda.dict.dump.i where the variable 'i' is the same as the server_id.

        3. Merge Dictionaries
          Merge_Dictionaries --dictionaries=m --dumpprefix=lda.dict.dump

        4. mkdir -p ../global; mkdir -p ../global/topic_counts; cp lda.dict.dump ../global/;

      3. Create a sharded Global Word-Topic Counts dump - Run on every machine in the set-up

        1. mkdir -p /tmp/corpus/global_top_cnts; cd /tmp/corpus/global_top_cnts;

        2. scp server_0:/tmp/corpus/global/lda.dict.dump lda.dict.dump.global

        3. Merge_Topic_Counts --topics=<topics> --clientid=<server-id> --servers=<list-of-servers> --globaldictionary="lda.dict.dump.global"

        4. scp lda.ttc.dump server_0:/tmp/corpus/global/topic_counts/lda.ttc.dump.$server-id

      4. Copy the parameters dump file to global dump - Run on server_0

        1. cd /tmp/corpus; cp lda.par.dump global/topic_counts/lda.par.dump

      5. This completes training and the model is available on server_0:/tmp/corpus/global

    6. Running Y!LDA is test mode: Run on server_0. Assuming test corpus is in /tmp/test_corpus

      1. cd /tmp/test_corpus;

      2. cp -r ../corpus/global .

      3. learntopics -teststream --dumpprefix=global/topic_counts/lda --numdumps=m --dictionary=global/lda.dict.dump --maxmemory=2048 --topics=<topics>

      4. cat all your documents, in the same format that 'formatter' expects, to the above command's stdin

Generated on Tue Jul 19 11:45:25 2011 for Y!LDA by  doxygen 1.6.3