00001 /***************************************************************************** 00002 The contents of this file are subject to the Mozilla Public License 00003 Version 1.1 (the "License"); you may not use this file except in 00004 compliance with the License. You may obtain a copy of the License at 00005 http://www.mozilla.org/MPL/ 00006 00007 Software distributed under the License is distributed on an "AS IS" 00008 basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the 00009 License for the specific language governing rights and limitations 00010 under the License. 00011 00012 The Original Code is Copyright (C) by Yahoo! Research. 00013 00014 The Initial Developer of the Original Code is Shravan Narayanamurthy. 00015 00016 All Rights Reserved. 00017 ******************************************************************************/ 00018 /** \page multi_machine_usage Multi-Machine Setup 00019 * <P>Please take a look at \ref single_machine_usage "Single Machine Usage" for information on running 00020 * individual commands. Here we give you ways to run those individual commands on multiple machines. 00021 * So, we are not repeating the details on the individual commands.</P> 00022 * <OL> 00023 * <LI><P STYLE="margin-bottom: 0cm"><B>Using Hadoop</B></P> 00024 * <OL> 00025 * <LI><P STYLE="margin-bottom: 0cm">Run Y!LDA on the corpus:</P> 00026 * <OL> 00027 * <LI><P STYLE="margin-bottom: 0cm">Assuming you have a homogenous 00028 * setup, install Y!LDA on one machine.</P> 00029 * <LI><P STYLE="margin-bottom: 0cm">Run make jar to create 00030 * LDALibs.jar file with all the required libraries and binaries</P> 00031 * <LI><P STYLE="margin-bottom: 0cm">Copy LDALibs.jar to HDFS</P> 00032 * <LI><P STYLE="margin-bottom: 0cm">Copy Formatter.sh LDA.sh functions.sh 00033 * runLDA.sh to gateway. A gateway is any machine with access to the grid. 00034 * Think of it as a machine from which you run your hadoop commands.</P> 00035 * <LI><P STYLE="margin-bottom: 0cm">Figure out the max memory 00036 * allowed per map task for your cluster and use the same in the 00037 * script via the maxmem parameter. This can be done by looking at 00038 * any job conf (job.xml) and searching the value of 00039 * "mapred.cluster.max.map.memory.mb" property.</P> 00040 * <LI><P STYLE="margin-bottom: 0cm">Run <CODE>runLDA.sh 00041 * 1 "flags" [train|test] "queue" 00042 * "organized-corpus" "output-dir" "max-mem" 00043 * "number_of_topics" "number_of_iters" 00044 * "full_hdfs_path_of_LDALibs.jar" "number_of_machines" ["training-output"] 00045 * </CODE><BR/><BR/> 00046 * For train ex. <CODE>runLDA.sh 00047 * 1 "" train default 00048 * "/user/me/organized-corpus" "/user/me/lda-output" 6144 100 100 00049 * "LDALibs.jar" 10 00050 * </CODE><BR/><BR/> 00051 * For test ex. <CODE>runLDA.sh 00052 * 1 "" test default 00053 * "/user/me/organized-corpus" "/user/me/lda-output" 6144 100 100 00054 * "LDALibs.jar" 10 "/user/me/lda-output" 00055 * </CODE> 00056 * </P> 00057 * <LI><P STYLE="margin-bottom: 0cm">This starts 2 Map Reduce Streaming jobs. 00058 * The first job does the formatting & the second job 00059 * starts a map-only script on each machine. This script starts the 00060 * DM_Server on all the machines. Then Y!LDA is run on each machine. 00061 * The input is one chunk of the corpus. The first job runs the formatter 00062 * in the reducer using the supplied number_of_machines as the number of reduce tasks. 00063 * So your corpus will be split into number_of_machines parts and formatted into protobuffer 00064 * format files as needed by the second job. The second job uses this formatted input 00065 * as its input and will also run on number_of_machines separate machines with each task 00066 * working on one chunk.</P> 00067 * <LI><P STYLE="margin-bottom: 0cm">For 00068 * testing, use the test flag and provide directory storing the 00069 * training output.</P> 00070 * </OL> 00071 * <LI><P STYLE="margin-bottom: 0cm">Output 00072 * generated</P> 00073 * <OL> 00074 * <LI><P STYLE="margin-bottom: 0cm">Creates <number_of_machines> folders 00075 * in <output-dir> one for each client. 00076 * </P> 00077 * <LI><P STYLE="margin-bottom: 0cm">Each of these directories hold 00078 * the same output as the single machine case but from different 00079 * clients. 00080 * </P> 00081 * <LI><P STYLE="margin-bottom: 0cm"><output-dir>/<client-id>/learntopics.WARNING 00082 * contains the output written to stderr by client <client-id></P> 00083 * <LI><P STYLE="margin-bottom: 0cm"><output-dir>/<client-id>/lda.docToTop.txt 00084 * contains the topic proportions assigned to the documents in the 00085 * portion of the corpus alloted to client <client-id></P> 00086 * <LI><P STYLE="margin-bottom: 0cm"><output-dir>/<client-id>/lda.topToWor.txt 00087 * contains the salient words learned for each topic. This remains 00088 * almost same across clients. So you can pick one of these as the 00089 * salient words per topic for the full corpus.</P> 00090 * <LI><P STYLE="margin-bottom: 0cm"><output-dir>/<client-id>/lda.ttc.dump 00091 * contains the actual model. Even this like the salient words is 00092 * almost same across clients and any one can be used as the model 00093 * for the full corpus.</P> 00094 * <LI><P STYLE="margin-bottom: 0cm"><output-dir>/global 00095 * contains the dump of the global dictionary and the partitioned 00096 * gobal topic counts table. These are generated in the training 00097 * phase and are critical for the test option to work.</P> 00098 * </OL> 00099 * <LI><P STYLE="margin-bottom: 0cm">Viewing 00100 * progress</P> 00101 * <OL> 00102 * <LI><P STYLE="margin-bottom: 0cm">The 00103 * stderr output of the code will be redirected into hadoop logs. So 00104 * you can check the task logs from the tracking URL displayed in the 00105 * output of runLDA.sh to see what is happening</P> 00106 * </OL> 00107 * <LI><P STYLE="margin-bottom: 0cm">Failure Recovery</P> 00108 * <P> We provide a check-pointing mechanism to handle recovery from failures. 00109 * The current scheme works in local mode and for distributed mode using Hadoop. 00110 * The reason for this being that the distributed check-pointing uses the hdfs to store 00111 * the check-points. The following is the process: </P> 00112 * <OL> 00113 * <LI>The formatter task is run on the inputs and the formatted input is stored in a temporary location. 00114 * <LI>The learntopics task is run using the temporary location as an input and the specified output as 00115 * the output directory. Care is taken to start the same number of mappers as the number_of_machines for 00116 * learntopics tasks. The input is a dummy directory structure with dummy directories equal to the 00117 * number_of_machines parameter supplied by the user. 00118 * <LI>Each learntopics task copies its portion of the formatted input by dfs copy_to_local the 00119 * folder corresponding to its mapred_task_partition. 00120 * <LI>Runs learntopics with the temporary directory containing the formatted input 00121 * as a check-point directory. So all information needed to start learntopics from 00122 * the previous check-pointed iteration is available locally and any progress made 00123 * is written back to the temporary input directory. 00124 * </OL> 00125 * <P>This mechanism is utilized by the scripts to detect failure cases 00126 * and attempt to re-run the task again from the previous checkpoint. As learntopics is 00127 * designed to check if check-point metadata is available in the working directory and 00128 * use it to start-off from there a separate restart option is obviated. </P> 00129 * <P>As a by product one gets the facility of doing incremental runs, that is, 00130 * to run say 100 iterations, check the output and run the next 100 iterations if needed. 00131 * The scripts detect this condition and ask you if you want to start-off from where you left 00132 * or restart from the beginning.</P> 00133 * <P>The scripts are designed in such a fashion that these 00134 * happen transparently to the user. This is information for developers and for cases 00135 * where the recovery mechanism could not handle the failure in the specified number of 00136 * attempts. Check the stderr logs to see what the reason for failure is. Most times it is due 00137 * to wrong usage which results in unrecoverable aborts. If you think its because of a flaky 00138 * cluster, then try increasing the number of attempts. If nothing works and you think there 00139 * is a bug in the code please let us know.</P> 00140 * </OL> 00141 * </OL> 00142 * <OL START=2> 00143 * <LI><P STYLE="margin-bottom: 0cm"><B>Using SSH - Assume you have 00144 * 'm' machines</B></P> 00145 * <OL> 00146 * <LI><P STYLE="margin-bottom: 0cm">If you have a homogenous set up, 00147 * install Y!LDA on one machine, run make jar and copy LDALibs.jar to 00148 * all the other machines in the set up. Else install Y!LDA on all 00149 * machines.</P> 00150 * <LI><P STYLE="margin-bottom: 0cm">Split the corpus into 'm' parts 00151 * and distribute them to the 'm' machines</P> 00152 * <LI><P STYLE="margin-bottom: 0cm">Run formatter on each split of 00153 * the corpus on every machine.</P> 00154 * <LI><P STYLE="margin-bottom: 0cm">Run the Distributed_Map Server on 00155 * each machine as a background process using nohup:</P> 00156 * <P STYLE="margin-bottom: 0cm"><CODE>nohup 00157 * ./DM_Server <model> <server_id> <num_clients> 00158 * <host:port> --Ice.ThreadPool.Server.SizeMax=9 &</CODE></P> 00159 * <OL> 00160 * <LI><P STYLE="margin-bottom: 0cm">model: an integer that 00161 * represents the model. Set it to 1 for Unigram_Model</P> 00162 * <LI><P STYLE="margin-bottom: 0cm">server_id: a number that denotes 00163 * the index of this server in the list of servers that is provided 00164 * to 'learntopics'. If server1 has h:p, 10.1.1.1:10000 & is 00165 * assigned id 0, server2 has h:p, 10.1.1.2:10000 & is assigned 00166 * id 1, the list of servers that is provided to 'learntopics' has to 00167 * be 10.1.1.1:10000, 10.1.1.2:10000 and not the other way around.</P> 00168 * <LI><P STYLE="margin-bottom: 0cm">num_clients: a number that 00169 * denotes the number of clients that will access the Distributed 00170 * Map. This is usually equal to 'm'. This is used to provide a 00171 * barrier implementation</P> 00172 * <LI><P STYLE="margin-bottom: 0cm">host:port- the port and ip 00173 * address on which the server must listen on</P> 00174 * </OL> 00175 * <LI><P STYLE="margin-bottom: 0cm">Run Y!LDA on the corpus:</P> 00176 * <OL> 00177 * <LI><P STYLE="margin-bottom: 0cm">On every machine run</P> 00178 * <P><CODE>learntopics 00179 * --topics=<#topics> --iter=<#iter> 00180 * --servers=<list-of-servers> --chkptdir="/tmp" 00181 * --chkptinterval=10000</CODE></P> 00182 * <OL> 00183 * <LI><P STYLE="margin-bottom: 0cm"><list-of-servers>: The 00184 * comma separated list of ip:port numbers of the servers involved 00185 * in the set-up. The index of the ip:port numbers should be as per 00186 * the server_id parameter used in starting the server</P> 00187 * <LI><P STYLE="margin-bottom: 0cm">chkptdir & chkptinterval: 00188 * These are currently used only with the Hadoop set-up. Set 00189 * chkptdir to something dummy. In order that the checkpointing code 00190 * does not execute, set the chkptinterval to a very large value or 00191 * some number greater than the number of iterations</P> 00192 * </OL> 00193 * <LI><P STYLE="margin-bottom: 0cm">Create Global Dictionary - Run 00194 * the following on server with id 0. Assuming learntopics was run 00195 * in the folder /tmp/corpus</P> 00196 * <OL> 00197 * <LI><P STYLE="margin-bottom: 0cm"> <CODE>mkdir 00198 * -p /tmp/corpus/global_dict; cd /tmp/corpus/global_dict;</CODE></P> 00199 * <LI><P STYLE="margin-bottom: 0cm"> <CODE>scp 00200 * server_i:/tmp/corpus/lda.dict.dump lda.dict.dump.i </CODE> 00201 * where the variable 'i' is the same as the server_id.</P> 00202 * <LI><P STYLE="margin-bottom: 0cm">Merge Dictionaries<BR> 00203 * <CODE>Merge_Dictionaries 00204 * --dictionaries=m --dumpprefix=lda.dict.dump</CODE></P> 00205 * <LI><P STYLE="margin-bottom: 0cm"> <CODE>mkdir 00206 * -p ../global; mkdir -p ../global/topic_counts; cp lda.dict.dump 00207 * ../global/;</CODE> 00208 * </P> 00209 * </OL> 00210 * <LI><P STYLE="margin-bottom: 0cm">Create a sharded Global 00211 * Word-Topic Counts dump - Run on every machine in the set-up</P> 00212 * <OL> 00213 * <LI><P STYLE="margin-bottom: 0cm"><CODE>mkdir 00214 * -p /tmp/corpus/global_top_cnts; cd /tmp/corpus/global_top_cnts;</CODE></P> 00215 * <LI><P STYLE="margin-bottom: 0cm"><CODE>scp 00216 * server_0:/tmp/corpus/global/lda.dict.dump lda.dict.dump.global</CODE></P> 00217 * <LI><P STYLE="margin-bottom: 0cm"><CODE>Merge_Topic_Counts 00218 * --topics=<#topics> --clientid=<server-id> 00219 * --servers=<list-of-servers> 00220 * --globaldictionary="lda.dict.dump.global"</CODE></P> 00221 * <LI><P STYLE="margin-bottom: 0cm"><CODE>scp 00222 * lda.ttc.dump 00223 * server_0:/tmp/corpus/global/topic_counts/lda.ttc.dump.$server-id</CODE></P> 00224 * </OL> 00225 * <LI><P STYLE="margin-bottom: 0cm">Copy the parameters dump file to 00226 * global dump - Run on server_0</P> 00227 * <OL> 00228 * <LI><P STYLE="margin-bottom: 0cm"><CODE>cd 00229 * /tmp/corpus; cp lda.par.dump global/topic_counts/lda.par.dump</CODE></P> 00230 * </OL> 00231 * <LI><P STYLE="margin-bottom: 0cm">This completes training and the 00232 * model is available on server_0:/tmp/corpus/global</P> 00233 * </OL> 00234 * <LI><P STYLE="margin-bottom: 0cm">Running Y!LDA is test mode: Run 00235 * on server_0. Assuming test corpus is in /tmp/test_corpus</P> 00236 * <OL> 00237 * <LI><P STYLE="margin-bottom: 0cm"><CODE>cd 00238 * /tmp/test_corpus;</CODE> 00239 * </P> 00240 * <LI><P STYLE="margin-bottom: 0cm"><CODE>cp 00241 * -r ../corpus/global .</CODE></P> 00242 * <LI><P STYLE="margin-bottom: 0cm"><CODE>learntopics 00243 * -teststream --dumpprefix=global/topic_counts/lda --numdumps=m 00244 * --dictionary=global/lda.dict.dump --maxmemory=2048 00245 * --topics=<#topics></CODE></P> 00246 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">cat all your 00247 * documents, in the same format that 'formatter' expects, to the 00248 * above command's stdin</P> 00249 * </OL> 00250 * </OL> 00251 * </OL> 00252 */