00001 /***************************************************************************** 00002 The contents of this file are subject to the Mozilla Public License 00003 Version 1.1 (the "License"); you may not use this file except in 00004 compliance with the License. You may obtain a copy of the License at 00005 http://www.mozilla.org/MPL/ 00006 00007 Software distributed under the License is distributed on an "AS IS" 00008 basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the 00009 License for the specific language governing rights and limitations 00010 under the License. 00011 00012 The Original Code is Copyright (C) by Yahoo! Research. 00013 00014 The Initial Developer of the Original Code is Shravan Narayanamurthy. 00015 00016 All Rights Reserved. 00017 ******************************************************************************/ 00018 /** \page single_machine_usage Single Machine Setup 00019 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">First step is to set 00020 * LD_LIBRARY_PATH. This can be done by sourcing the script 00021 * setLibVars.sh</P> 00022 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Assume Y!LDA is installed 00023 * in $LDA_HOME.</P> 00024 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">cd $LDA_HOME</P> 00025 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">source ./setLibVars.sh</P> 00026 * <H3 CLASS="western"> Basic Usage: 00027 * </H3> 00028 * <OL> 00029 * <LI>\ref learning_model</LI> 00030 * <LI>\ref using_model</LI> 00031 * <LI>\ref generated_output</LI> 00032 * <LI>\ref customizations</LI> 00033 * </OL> 00034 * \section learning_model Learning a Model 00035 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"> The process of learning 00036 * a model has four steps:</P> 00037 * <OL> 00038 * <LI>\ref tokenize_format </LI> 00039 * <LI>\ref learntopics </LI> 00040 * <LI>\ref word_mix </LI> 00041 * <LI>\ref topic_mix </LI> 00042 * </OL> 00043 * \subsection tokenize_format Tokenization and Formatting 00044 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"> The tokenizer converts 00045 * text into tokens that undergone some basic normalizations. 00046 * Most likely you want to write your own. A Java 00047 * version is provided for convenience more than anything else. This is 00048 * a simple java class that tokenizes the file by splitting the stream 00049 * around non-character[^a-zA-Z] boundaries into word tokens. So, it 00050 * ignores numbers & punctuation currently. The raw text corpus is 00051 * assumed to have the following format. Its a single file containing 00052 * documents on every line. Each document has the format:</P> 00053 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"> doc-id<space>aux-id<space>word1[<space>word2]*</P> 00054 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">The tokenizer just writes 00055 * the tokens appended to the <STRONG>doc-id</STRONG> & <STRONG>aux-id</STRONG> 00056 * to stdout.</P> 00057 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"> The 'formatter' converts 00058 * the text into GoogleProtocolBuffer messages. This is done to ensure 00059 * that we have a very small file footprint on disk. It takes the 00060 * tokenized corpus supplied as input and formats it into the internal 00061 * format needed for learntopics. The primary output is the documents in 00062 * the internal format. The program by default removes stop words 00063 * supplied statically in the src/commons/constants.h file. It then 00064 * creates the dictionary from the remaining words.</P> 00065 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">The following is an 00066 * illustration of this step. It uses the taining set that is available 00067 * with the package (ut_out/ydir_1k.txt)</P> 00068 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00069 * </P> 00070 * <CODE> 00071 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">$ 00072 * cd $LDA_HOME/ut_out</P> 00073 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">$ 00074 * cp ../Tokenizer.java .</P> 00075 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">$ 00076 * javac Tokenizer.java</P> 00077 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">$ 00078 * ls -al ydir_1k.txt</P> 00079 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">-rw-r--r-- 00080 * 1 shravanm shravanm 1818848 2011-04-17 17:34 ydir_1k.txt 00081 * </P> 00082 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">$ 00083 * cat ydir_1k.txt | java Tokenizer | ../formatter</P> 00084 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00085 * 17:35:44.967370 19605 Controller.cpp:83] 00086 * ---------------------------------------------------------------------- 00087 * 00088 * </P> 00089 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00090 * 17:35:44.967788 19605 Controller.cpp:84] Log files are being stored 00091 * at /home/shravanm/workspace/LDA_Refactored/ut_out/unigram/formatter.* 00092 * 00093 * </P> 00094 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00095 * 17:35:44.967808 19605 Controller.cpp:85] 00096 * ---------------------------------------------------------------------- 00097 * 00098 * </P> 00099 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00100 * 17:35:44.968111 19605 Controller.cpp:91] Assuming that corpus is 00101 * being piped through stdin. Reading from stdin... 00102 * </P> 00103 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00104 * 17:35:45.652430 19605 Controller.cpp:107] Formatting Complete. 00105 * Formatted document stored in lda.wor. You have used lda as the output 00106 * prefix. Make sure you use the same as the input prefix for 00107 * learntopics 00108 * </P> 00109 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00110 * 17:35:45.652497 19605 Controller.cpp:113] Dumping dictionary for 00111 * later use by learntopics into lda.dict.dump 00112 * </P> 00113 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00114 * 17:35:45.674201 19605 Controller.cpp:117] Finished dictionary dump 00115 * </P> 00116 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00117 * 17:35:45.674236 19605 Controller.cpp:118] Formatting done 00118 * </P> 00119 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00120 * 17:35:45.674253 19605 Controller.cpp:122] Total number of unique 00121 * words found: 17208 00122 * </P> 00123 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00124 * 17:35:45.674270 19605 Controller.cpp:123] Total num of docs found: 00125 * 900 00126 * </P> 00127 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00128 * 17:35:45.674288 19605 Controller.cpp:124] Total num of tokens found: 00129 * 182525</P> 00130 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">$ 00131 * ls -1 lda*</P> 00132 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00133 * 1 shravanm shravanm 204655 2011-04-17 17:35 lda.dict.dump 00134 * </P> 00135 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00136 * 1 shravanm shravanm 397983 2011-04-17 17:35 lda.wor 00137 * </P> 00138 * </CODE> 00139 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00140 * </P> 00141 * \subsection learntopics Learning the topic mixtures 00142 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"> 'learntopics' learns a topic model from a corpus which 00143 * has been formatted using 'formatter'. The input to this are the files 00144 * generated by 'formatter'. So you need to run this in the same 00145 * directory that contains these files. The output from this step 00146 * contains two types of files: binary files that are used by other 00147 * modes of operation like batch and stream mode testing and text files 00148 * which are human readable and give a sense of what has happened. The 00149 * generated output will be explained in detail in the next sections. 00150 * But the primary outputs are the topic assignments to the documents in 00151 * the corpus and the word mixtures that represent each topic that has 00152 * been learnt.</P> 00153 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Continuing with the 00154 * illustration of step 1 we do the following to learn the model: 00155 * Assuming that PWD=$LDA_HOME/ut_out</P> 00156 * <CODE> 00157 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">$ 00158 * ../learntopics --topics=100 --iter=500</P> 00159 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Log 00160 * file created at: 2011/04/17 17:44:04 00161 * </P> 00162 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Running 00163 * on machine: offerenjoy 00164 * </P> 00165 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Log 00166 * line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg 00167 * </P> 00168 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00169 * 17:44:04.082424 19628 Controller.cpp:68] 00170 * ---------------------------------------------------------------------- 00171 * 00172 * </P> 00173 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00174 * 17:44:04.082855 19628 Controller.cpp:69] Log files are being stored 00175 * at 00176 * /home/shravanm/workspace/LDA_Refactored/ut_out/unigram/learntopics 00177 * </P> 00178 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00179 * 17:44:04.082871 19628 Controller.cpp:70] 00180 * ---------------------------------------------------------------------- 00181 * 00182 * </P> 00183 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00184 * 17:44:04.083418 19628 Controller.cpp:81] You have chosen single 00185 * machine training mode 00186 * </P> 00187 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00188 * 17:44:04.083976 19628 Unigram_Model_Training_Builder.cpp:43] 00189 * Initializing Dictionary from lda.dict.dump 00190 * </P> 00191 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00192 * 17:44:04.137420 19628 Unigram_Model_Training_Builder.cpp:45] 00193 * Dictionary Initialized 00194 * </P> 00195 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00196 * 17:44:04.174685 19628 Unigram_Model_Trainer.cpp:24] Initializing 00197 * Word-Topic counts table from docs lda.wor, lda.top using 17208 words 00198 * & 100 topics. 00199 * </P> 00200 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00201 * 17:44:04.250243 19628 Unigram_Model_Trainer.cpp:26] Initialized 00202 * Word-Topic counts table 00203 * </P> 00204 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00205 * 17:44:04.250313 19628 Unigram_Model_Trainer.cpp:29] Initializing 00206 * Alpha vector from Alpha_bar = 50 00207 * </P> 00208 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00209 * 17:44:04.250356 19628 Unigram_Model_Trainer.cpp:31] Alpha vector 00210 * initialized 00211 * </P> 00212 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00213 * 17:44:04.250375 19628 Unigram_Model_Trainer.cpp:34] Initializing Beta 00214 * Parameter from specified Beta = 0.01 00215 * </P> 00216 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00217 * 17:44:04.250395 19628 Unigram_Model_Trainer.cpp:37] Beta param 00218 * initialized 00219 * </P> 00220 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00221 * 17:44:04.251711 19628 Training_Execution_Strategy.cpp:35] Starting 00222 * Parallel training Pipeline 00223 * </P> 00224 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00225 * 17:44:04.590452 19628 Training_Execution_Strategy.cpp:53] Iteration 1 00226 * done. Took 0.00564301 mins 00227 * </P> 00228 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00229 * 17:44:04.700546 19628 Unigram_Model.cpp:92] Average num of topics 00230 * assigned per word = 4.17091 00231 * </P> 00232 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00233 * 17:44:04.700599 19628 Training_Execution_Strategy.cpp:58] >>>>>>>>>> 00234 * Log-Likelihood (model, doc, total): -1.47974e+06 , -815572 , 00235 * -2.29531e+06 00236 * </P> 00237 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00238 * 17:44:04.703927 19628 Unigram_Model_Trainer.cpp:478] Restarting IO 00239 * </P> 00240 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00241 * 17:44:04.992218 19628 Training_Execution_Strategy.cpp:53] Iteration 2 00242 * done. Took 0.00477359 mins 00243 * </P> 00244 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00245 * 17:44:04.995560 19628 Unigram_Model_Trainer.cpp:478] Restarting IO 00246 * </P> 00247 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00248 * 17:44:05.261360 19628 Training_Execution_Strategy.cpp:53] Iteration 3 00249 * done. Took 0.00439862 mins 00250 * </P> 00251 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00252 * 17:44:05.264669 19628 Unigram_Model_Trainer.cpp:478] Restarting IO 00253 * </P> 00254 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00255 * 17:44:05.516330 19628 Training_Execution_Strategy.cpp:53] Iteration 4 00256 * done. Took 0.00416395 mins 00257 * </P> 00258 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">...</P> 00259 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">...</P> 00260 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00261 * 17:45:59.021239 19628 Unigram_Model_Trainer.cpp:478] Restarting IO 00262 * </P> 00263 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00264 * 17:45:59.242962 19628 Training_Execution_Strategy.cpp:53] Iteration 00265 * 498 done. Took 0.00366346 mins 00266 * </P> 00267 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00268 * 17:45:59.246363 19628 Unigram_Model_Trainer.cpp:478] Restarting IO 00269 * </P> 00270 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00271 * 17:45:59.469266 19628 Training_Execution_Strategy.cpp:53] Iteration 00272 * 499 done. Took 0.00368553 mins 00273 * </P> 00274 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00275 * 17:45:59.472687 19628 Unigram_Model_Trainer.cpp:478] Restarting IO 00276 * </P> 00277 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00278 * 17:45:59.513748 19628 Training_Execution_Strategy.cpp:53] Iteration 00279 * 500 done. Took 0.000653982 mins 00280 * </P> 00281 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00282 * 17:45:59.616693 19628 Unigram_Model.cpp:92] Average num of topics 00283 * assigned per word = 1.85268 00284 * </P> 00285 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00286 * 17:45:59.616768 19628 Training_Execution_Strategy.cpp:58] >>>>>>>>>> 00287 * Log-Likelihood (model, doc, total): -1.04167e+06 , -425400 , 00288 * -1.46707e+06 00289 * </P> 00290 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00291 * 17:45:59.620058 19628 Unigram_Model_Trainer.cpp:478] Restarting IO 00292 * </P> 00293 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00294 * 17:45:59.622022 19628 Training_Execution_Strategy.cpp:66] >>>>>>>>>>> 00295 * Check Pointing at iteration: 500 00296 * </P> 00297 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00298 * 17:45:59.625076 19628 Training_Execution_Strategy.cpp:72] Parallel 00299 * training Pipeline done 00300 * </P> 00301 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00302 * 17:45:59.625128 19628 Controller.cpp:128] Model has been learnt 00303 * </P> 00304 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00305 * 17:45:59.627476 19628 Unigram_Model.cpp:105] Saving model for test 00306 * pipeline in lda.ttc.dump and lda.par.dump 00307 * </P> 00308 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00309 * 17:45:59.695773 19628 Unigram_Model.cpp:108] Model saved 00310 * </P> 00311 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00312 * 17:45:59.695843 19628 Unigram_Model.cpp:117] Writing top words 00313 * identified per topic into lda.topToWor.txt 00314 * </P> 00315 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00316 * 17:45:59.701156 19628 Unigram_Model.cpp:139] Word statistics per 00317 * topic written 00318 * </P> 00319 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00320 * 17:45:59.701257 19628 Unigram_Model_Training_Builder.cpp:126] Saving 00321 * document to topic-proportions in lda.docToTop.txt 00322 * </P> 00323 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00324 * 17:45:59.701279 19628 Unigram_Model_Training_Builder.cpp:127] Saving 00325 * word to topic assignment in lda.worToTop.txt 00326 * </P> 00327 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00328 * 17:45:59.911658 19628 Unigram_Model_Training_Builder.cpp:193] 00329 * Document to topic-proportions saved in lda.docToTop.txt 00330 * </P> 00331 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00332 * 17:45:59.911726 19628 Unigram_Model_Training_Builder.cpp:194] Word to 00333 * topic assignment saved in lda.worToTop.txt 00334 * </P> 00335 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">$ 00336 * ls -al lda*</P> 00337 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00338 * 1 shravanm shravanm 4 2011-04-17 17:45 lda.chk 00339 * </P> 00340 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00341 * 1 shravanm shravanm 204655 2011-04-17 17:35 lda.dict.dump 00342 * </P> 00343 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00344 * 1 shravanm shravanm 168223 2011-04-17 17:45 lda.docToTop.txt 00345 * </P> 00346 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00347 * 1 shravanm shravanm 816 2011-04-17 17:45 lda.par.dump 00348 * </P> 00349 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00350 * 1 shravanm shravanm 230304 2011-04-17 17:45 lda.top 00351 * </P> 00352 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00353 * 1 shravanm shravanm 38415 2011-04-17 17:45 lda.topToWor.txt 00354 * </P> 00355 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00356 * 1 shravanm shravanm 267631 2011-04-17 17:45 lda.ttc.dump 00357 * </P> 00358 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00359 * 1 shravanm shravanm 397983 2011-04-17 17:35 lda.wor 00360 * </P> 00361 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00362 * 1 shravanm shravanm 2213071 2011-04-17 17:45 lda.worToTop.txt 00363 * </P> 00364 * </CODE> 00365 * \subsection word_mix Viewing the word mixtures for each topic 00366 * <CODE> 00367 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">$ 00368 * cat lda.topToWor.txt</P> 00369 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Topic 00370 * 0: (center,0.171509) (diving,0.110265) (scuba,0.0857668) 00371 * (equipment,0.0686183) (mark,0.0637188) (olympic,0.0465703) 00372 * (rescue,0.0441205) (aquatic,0.039221) (ymca,0.039221) 00373 * (family,0.0367712) (ruiz,0.0343214) (dive,0.0318716) 00374 * (sports,0.0294219) (divers,0.0294219) (safety,0.0294219) 00375 * (minnesota,0.0294219) (orlando,0.0294219) (advanced,0.0294219) 00376 * (training,0.0269721) (international,0.0245223) 00377 * </P> 00378 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Topic 00379 * 1: (beach,0.138171) (resort,0.0760789) (florida,0.0714219) 00380 * (world,0.062108) (center,0.0543465) (experience,0.0527942) 00381 * (vacation,0.0481372) (fl,0.0465849) (located,0.0450326) 00382 * (offers,0.0450326) (holiday,0.0403757) (south,0.0403757) 00383 * (rates,0.0403757) (beautiful,0.0388233) (location,0.0341664) 00384 * (sea,0.0341664) (activities,0.0341664) (resorts,0.0326141) 00385 * (accommodations,0.0326141) (perfect,0.0326141) 00386 * </P> 00387 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">...</P> 00388 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Topic 00389 * 25: (surf,0.381466) (surfing,0.0727078) (surfboards,0.0573127) 00390 * (surfboard,0.0547468) (board,0.0530363) (beach,0.0367858) 00391 * (shop,0.0367858) (malibu,0.0350753) (longboard,0.0325094) 00392 * (boards,0.0325094) (gear,0.0299435) (bags,0.022246) (travel,0.022246) 00393 * (racks,0.0213907) (wetsuits,0.0205354) (clothing,0.0205354) 00394 * (california,0.0196801) (lessons,0.0188248) (ocean,0.016259) 00395 * (surfers,0.0154037) 00396 * </P> 00397 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">...</P> 00398 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Topic 00399 * 35: (credit,0.101053) (card,0.0893533) (money,0.0680813) 00400 * (orders,0.0670177) (make,0.060636) (mail,0.0574452) (cards,0.0563816) 00401 * (valley,0.055318) (paypal,0.0521272) (accept,0.0510636) 00402 * (secure,0.0468092) (purchase,0.0468092) (time,0.0372368) 00403 * (apple,0.0351095) (checks,0.0351095) (email,0.0351095) 00404 * (address,0.0287279) (due,0.0276643) (special,0.0255371) 00405 * (people,0.0234099)</P> 00406 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">...</P> 00407 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Topic 00408 * 43: (racquet,0.12544) (racquets,0.107418) (wilson,0.101651) 00409 * (head,0.094442) (tennis,0.0908377) (prince,0.0821871) 00410 * (shoes,0.0648861) (babolat,0.0367719) (pro,0.0367719) (bags,0.030284) 00411 * (string,0.0266796) (men,0.0259588) (tour,0.0237961) (mp,0.0223544) 00412 * (rackets,0.0223544) (ncode,0.0223544) (grips,0.0223544) 00413 * (women,0.0223544) (dunlop,0.0216335) (nike,0.0194709)</P> 00414 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">...</P> 00415 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Topic 00416 * 81: (gear,0.151097) (water,0.0828022) (swim,0.0786631) 00417 * (floatation,0.0621068) (html,0.0600373) (pro,0.0579677) 00418 * (exercise,0.0538286) (mask,0.0496896) (swimming,0.04762) 00419 * (aqua,0.0414114) (watergearwg,0.0372724) (aquajoggeraj,0.0372724) 00420 * (kids,0.0352028) (goggles,0.0331333) (zoggszoggs,0.0310637) 00421 * (fins,0.0310637) (hydro,0.0289942) (snorkel,0.0269247) 00422 * (snorkeling,0.0269247) (caps,0.0269247) 00423 * </P> 00424 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">...</P> 00425 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Topic 00426 * 96: (horse,0.190855) (horses,0.126676) (riding,0.101343) 00427 * (training,0.0996538) (farm,0.0785425) (equestrian,0.0667202) 00428 * (dressage,0.0472978) (equine,0.0405421) (boarding,0.0380088) 00429 * (show,0.0337865) (stables,0.0270309) (lessons,0.0194308) 00430 * (michigan,0.0185864) (tack,0.0185864) (riders,0.0185864) 00431 * (ranch,0.016053) (farms,0.0152086) (pony,0.0152086) 00432 * (ponies,0.0143641) (sale,0.0135197)</P> 00433 * </CODE> 00434 * \subsection topic_mix Viewing the topic assignments 00435 * <CODE> 00436 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">$ 00437 * cat lda.worToTop.txt</P> 00438 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">www.teddybears.com/ recreation/toys (teddy,36) 00439 * (bears,36) (enjoy,66) (teddy,36) (bears,36) (enjoy,66) (featuring,61) 00440 * (teddy,36) (bears,36) (teddy,36) (bear,36) (related,77) 00441 * (information,66) (everyday,38) (fun,44) (enjoy,66) (learn,2) 00442 * (enter,28) (love,61) (lord,51) (god,51) (heart,48) (soul,77) 00443 * (strength,80) (commandments,2) (give,13) (today,28) (hearts,61) 00444 * (impress,61) (house,80) (gates,60) (deuteronomy,36) (site,66) 00445 * (sponsored,77) (brown,48) (brehm,36) (bears,36) (teddy,36) (bear,36) 00446 * (artists,63) (teddy,36) (bears,36) (teddy,36) (bear,36) 00447 * (classifieds,61) (teddy,36) (bear,36) (clubs,77) (teddy,36) (bear,36) 00448 * (events,77) (teddy,36) (bear,36) (retailers,66) (teddy,36) (bear,36) 00449 * (magazines,61) (teddy,36) (bear,36) (books,66) (teddy,36) (bear,36) 00450 * (history,28) (web,66) (page,66) (design,16) (graphic,61) 00451 * (elements,45) (embedded,61) (html,81) (coding,48) (created,16) 00452 * (copyrighted,22) (kelly,6) (brown,48) (brehm,36) (rights,16) 00453 * (reserved,16) 00454 * </P> 00455 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">www.bearsbythesea.com/ recreation/toys (teddy,36) 00456 * (bear,36) (store,20) (pismo,18) (beach,25) (california,25) 00457 * (specialize,56) (muffy,14) (store,20) (complete,20) (collections,87) 00458 * (checkout,91) (web,66) (site,66) (muffy,14) (muffy,14) 00459 * (interested,66) (information,56) (price,3) (guides,77) (forums,28) 00460 * (newsletters,78) (follow,56) (information,66) (link,78) 00461 * (interested,66) (purchasing,4) (items,4) (follow,56) (online,20) 00462 * (store,20) (link,78) (online,20) (store,20) (information,66) 00463 * </P> 00464 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">www.the-toybox.com recreation/toys (party,0) 00465 * (supplies,38) (wiggles,56) (licensed,19) (characters,83) (jay,0) 00466 * (jay,0) (jet,83) (cabbage,49) (patch,49) (play,61) (tents,59) 00467 * (activity,91) (master,0) (roll,57) (building,90) (toy,83) 00468 * (building,90) (racing,50) (beanie,19) (babies,13) (personalized,19) 00469 * (toys,83) (requires,85) (minimum,12) (order,19) (enter,91) 00470 * (coupon,90) (code,90) (good,61) (coupons,90) (mail,56) (order,19) 00471 * (info,57) (order,13) (options,78) (return,4) (policy,9) (shipping,13) 00472 * (rates,1) (email,56) (wiggles,0) (party,19) (supplies,38) (toys,83) 00473 * (accessories,13) (jay,0) (jay,0) (jet,83) (plane,91) (party,19) 00474 * (supplies,90) (toys,83) (accessories,13) (toys,83) (party,19) 00475 * (supplies,90) (licensed,19) (characters,83) (toys,83) (sell,56) 00476 * (free,13) (internet,85) (premier,78) (shopping,83) (network,61) 00477 * (discount,13) (shopping,4) (internet,19) (click,78) (visit,66) 00478 * </P> 00479 * </CODE> 00480 * \section using_model Using the Model 00481 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"> Once the model has been 00482 * trained, you can provide the binary files as a parameter to 00483 * 'learntopics' to infer topic mixtures on new documents. The following 00484 * illustration uses the test set that is available with the package 00485 * (ut_test/ydir_1k.tst.txt).</P> 00486 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"> As explained in the overview, 00487 * there are two modes in which the above learnt model can be used:</P> 00488 * <OL> 00489 * <LI>\ref batch_mode</LI> 00490 * <LI>\ref stream_mode</LI> 00491 * </OL> 00492 * \subsection batch_mode Batch Mode 00493 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Assuming that PWD=$LDA_HOME/ut_test.</P> 00494 * <CODE> 00495 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">First Tokenize and format 00496 * the test data.</P> 00497 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">$ 00498 * cp ../ut_out/Tokenizer.class .</P> 00499 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">$ 00500 * cat ydir_1k.tst.txt | java Tokenizer | ../formatter 00501 * --dumpfile=../ut_out/unigram/lda.dict.dump</P> 00502 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00503 * 22:16:41.834889 20929 Controller.cpp:83] 00504 * ---------------------------------------------------------------------- 00505 * 00506 * </P> 00507 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00508 * 22:16:41.835304 20929 Controller.cpp:84] Log files are being stored 00509 * at /home/shravanm/workspace/LDA_Refactored/ut_test/formatter.* 00510 * </P> 00511 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00512 * 22:16:41.835325 20929 Controller.cpp:85] 00513 * ---------------------------------------------------------------------- 00514 * 00515 * </P> 00516 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00517 * 22:16:41.835626 20929 Controller.cpp:91] Assuming that corpus is 00518 * being piped through stdin. Reading from stdin... 00519 * </P> 00520 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00521 * 22:16:41.835649 20929 Controller.cpp:97] Will use the dictionary dump 00522 * ../ut_out/unigram/lda.dict.dump to load the global dictionary. 00523 * </P> 00524 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00525 * 22:16:41.836802 20929 Unigram_Test_Data_Formatter.cpp:14] 00526 * Initializing Dictionary from ../ut_out/unigram/lda.dict.dump 00527 * </P> 00528 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00529 * 22:16:41.900626 20929 Unigram_Test_Data_Formatter.cpp:16] Num of 00530 * unique words: 17208 00531 * </P> 00532 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00533 * 22:16:42.145593 20929 Controller.cpp:107] Formatting Complete. 00534 * Formatted document stored in lda.wor. You have used lda as the output 00535 * prefix. Make sure you use the same as the input prefix for 00536 * learntopics 00537 * </P> 00538 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00539 * 22:16:42.145661 20929 Controller.cpp:115] Induced local dictionary 00540 * being dumped to lda.dict.dump 00541 * </P> 00542 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00543 * 22:16:42.150445 20929 Controller.cpp:117] Finished dictionary dump 00544 * </P> 00545 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00546 * 22:16:42.150481 20929 Controller.cpp:118] Formatting done 00547 * </P> 00548 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00549 * 22:16:42.150507 20929 Controller.cpp:122] Total number of unique 00550 * words found: 3506 00551 * </P> 00552 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00553 * 22:16:42.150533 20929 Controller.cpp:123] Total num of docs found: 00554 * 100 00555 * </P> 00556 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00557 * 22:16:42.150559 20929 Controller.cpp:124] Total num of tokens found: 00558 * 16394 00559 * </P> 00560 * </CODE> 00561 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Note the --dumpfile flag. 00562 * Here is where you provide the dictionary of words that the model 00563 * knows about. Only those words in the test data set are recognized and 00564 * the rest are ignored.</P> 00565 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"><BR> 00566 * </P> 00567 * <CODE> 00568 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">$ 00569 * ../learntopics -test --dumpprefix=../ut_out/unigram/lda --topics=100</P> 00570 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00571 * 22:22:00.932081 20947 Controller.cpp:68] 00572 * ---------------------------------------------------------------------- 00573 * 00574 * </P> 00575 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00576 * 22:22:00.932518 20947 Controller.cpp:69] Log files are being stored 00577 * at /home/shravanm/workspace/LDA_Refactored/ut_test/learntopics.* 00578 * </P> 00579 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00580 * 22:22:00.932539 20947 Controller.cpp:70] 00581 * ---------------------------------------------------------------------- 00582 * 00583 * </P> 00584 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00585 * 22:22:00.933087 20947 Controller.cpp:92] You have chosen single 00586 * machine testing mode 00587 * </P> 00588 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00589 * 22:22:00.933671 20947 Unigram_Model_Training_Builder.cpp:43] 00590 * Initializing Dictionary from lda.dict.dump 00591 * </P> 00592 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00593 * 22:22:00.945226 20947 Unigram_Model_Training_Builder.cpp:45] 00594 * Dictionary Initialized 00595 * </P> 00596 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00597 * 22:22:00.998955 20947 Unigram_Model_Tester.cpp:33] Initializing 00598 * Word-Topic counts table from dump ../ut_out/unigram/lda.ttc.dump 00599 * using 3506 words & 100 topics. 00600 * </P> 00601 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00602 * 22:22:01.029986 20947 Unigram_Model_Tester.cpp:51] Initialized 00603 * Word-Topic counts table 00604 * </P> 00605 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00606 * 22:22:01.030045 20947 Unigram_Model_Tester.cpp:55] Initializing Alpha 00607 * vector from dumpfile ../ut_out/unigram/lda.par.dump 00608 * </P> 00609 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00610 * 22:22:01.030118 20947 Unigram_Model_Tester.cpp:57] Alpha vector 00611 * initialized 00612 * </P> 00613 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00614 * 22:22:01.030144 20947 Unigram_Model_Tester.cpp:60] Initializing Beta 00615 * Parameter from specified Beta = 0.01 00616 * </P> 00617 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00618 * 22:22:01.030186 20947 Unigram_Model_Tester.cpp:63] Beta param 00619 * initialized 00620 * </P> 00621 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00622 * 22:22:01.036779 20947 Testing_Execution_Strategy.cpp:24] Starting 00623 * Parallel testing Pipeline 00624 * </P> 00625 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00626 * 22:22:01.329864 20947 Testing_Execution_Strategy.cpp:36] Iteration 0 00627 * done. Took 0.00488273 mins 00628 * </P> 00629 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00630 * 22:22:01.354588 20947 Unigram_Model.cpp:92] Average num of topics 00631 * assigned per word = 3.37222 00632 * </P> 00633 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00634 * 22:22:01.354617 20947 Testing_Execution_Strategy.cpp:41] >>>>>>>>>> 00635 * Log-Likelihood (model, doc, total): -1.93551e+06 , -57496.6 , 00636 * -1.99301e+06 00637 * </P> 00638 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00639 * 22:22:01.358059 20947 Testing_Execution_Strategy.cpp:49] Parallel 00640 * testing Pipeline done 00641 * </P> 00642 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00643 * 22:22:01.360312 20947 Unigram_Model.cpp:105] Saving model for test 00644 * pipeline in lda.ttc.dump and lda.par.dump 00645 * </P> 00646 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00647 * 22:22:01.375068 20947 Unigram_Model.cpp:108] Model saved 00648 * </P> 00649 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00650 * 22:22:01.375098 20947 Unigram_Model.cpp:117] Writing top words 00651 * identified per topic into lda.topToWor.txt 00652 * </P> 00653 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00654 * 22:22:01.380234 20947 Unigram_Model.cpp:139] Word statistics per 00655 * topic written 00656 * </P> 00657 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00658 * 22:22:01.380290 20947 Unigram_Model_Training_Builder.cpp:126] Saving 00659 * document to topic-proportions in lda.docToTop.txt 00660 * </P> 00661 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00662 * 22:22:01.380311 20947 Unigram_Model_Training_Builder.cpp:127] Saving 00663 * word to topic assignment in lda.worToTop.txt 00664 * </P> 00665 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00666 * 22:22:01.401484 20947 Unigram_Model_Training_Builder.cpp:193] 00667 * Document to topic-proportions saved in lda.docToTop.txt 00668 * </P> 00669 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00670 * 22:22:01.401512 20947 Unigram_Model_Training_Builder.cpp:194] Word to 00671 * topic assignment saved in lda.worToTop.txt 00672 * </P> 00673 * </CODE> 00674 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"><BR> 00675 * </P> 00676 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Note the following flags:</P> 00677 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">1. test: This indicates the 00678 * batch test mode</P> 00679 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">2. dumpprefix: This flag 00680 * gives the prefix for the binary files storing the model in our case, 00681 * ../ut_out/lda. 'lda' is the default prefix used by the framework when 00682 * none is specified. If you have specified a different prefix here is 00683 * the place to use it. This will add different suffixes to fetch the 00684 * various binary files to fetch the model saved and use them to infer 00685 * the topic mixture for your new documents.</P> 00686 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">3. topics: This flag 00687 * indicates the number of topics that your trained model contained. The 00688 * same number should be used for your new documents.</P> 00689 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"><BR> 00690 * </P> 00691 * <CODE> 00692 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">$ 00693 * ls -al lda*</P> 00694 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00695 * 1 shravanm shravanm 40059 2011-04-17 22:16 lda.dict.dump 00696 * </P> 00697 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00698 * 1 shravanm shravanm 204655 2011-04-17 22:16 lda.dict.dump.global 00699 * </P> 00700 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00701 * 1 shravanm shravanm 18947 2011-04-17 22:22 lda.docToTop.txt 00702 * </P> 00703 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00704 * 1 shravanm shravanm 816 2011-04-17 22:22 lda.par.dump 00705 * </P> 00706 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00707 * 1 shravanm shravanm 62380 2011-04-17 22:22 lda.top 00708 * </P> 00709 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00710 * 1 shravanm shravanm 38389 2011-04-17 22:22 lda.topToWor.txt 00711 * </P> 00712 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00713 * 1 shravanm shravanm 83799 2011-04-17 22:22 lda.ttc.dump 00714 * </P> 00715 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00716 * 1 shravanm shravanm 36374 2011-04-17 22:16 lda.wor 00717 * </P> 00718 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">-rw-r--r-- 00719 * 1 shravanm shravanm 200546 2011-04-17 22:22 lda.worToTop.txt 00720 * </P> 00721 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"><BR> 00722 * </P> 00723 * </CODE> 00724 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">This is pretty much the same 00725 * output that you see when you learnt the model but for your new 00726 * documents.</P> 00727 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"><BR> 00728 * </P> 00729 * \subsection stream_mode Streaming Mode 00730 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"> Again assuming that PWD=$LDA_HOME/ut_test</P> 00731 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">In this there is no need for 00732 * formatting the documents but you need to provide the dictionary as an 00733 * additional parameter. You also need to specify the maximum memory in 00734 * MBs that you can allocate to store the model. You can tokenize and 00735 * directly pipe the raw text documents through 'learntopics'. 00736 * </P> 00737 * <CODE> 00738 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">$ 00739 * java Tokenizer | ../learntopics -teststream 00740 * --dumpprefix=../ut_out/unigram/lda --topics=100 00741 * --dictionary=../ut_out/unigram/lda.dict.dump --maxmem=128</P> 00742 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00743 * 23:06:24.342099 21004 Controller.cpp:68] 00744 * ---------------------------------------------------------------------- 00745 * 00746 * </P> 00747 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00748 * 23:06:24.342605 21004 Controller.cpp:69] Log files are being stored 00749 * at /home/shravanm/workspace/LDA_Refactored/ut_test/learntopics.* 00750 * </P> 00751 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00752 * 23:06:24.342627 21004 Controller.cpp:70] 00753 * ---------------------------------------------------------------------- 00754 * 00755 * </P> 00756 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00757 * 23:06:24.343189 21004 Controller.cpp:92] You have chosen single 00758 * machine testing mode 00759 * </P> 00760 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00761 * 23:06:24.343723 21004 Unigram_Model_Streaming_Builder.cpp:20] 00762 * Initializing global dictionary from ../ut_out/unigram/lda.dict.dump 00763 * </P> 00764 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00765 * 23:06:24.398543 21004 Unigram_Model_Streaming_Builder.cpp:23] 00766 * Dictionary initialized and has 17208 00767 * </P> 00768 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00769 * 23:06:24.398643 21004 Unigram_Model_Streaming_Builder.cpp:49] 00770 * Estimating the words that will fit in 128 MB 00771 * </P> 00772 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00773 * 23:06:24.486726 21004 Unigram_Model_Streaming_Builder.cpp:52] 17208 00774 * will fit in 1.05881 MB of memory 00775 * </P> 00776 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00777 * 23:06:24.486809 21004 Unigram_Model_Streaming_Builder.cpp:53] 00778 * Initializing Local Dictionary from ../ut_out/unigram/lda.dict.dump 00779 * with 17208 words. 00780 * </P> 00781 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00782 * 23:06:24.562101 21004 Unigram_Model_Streaming_Builder.cpp:82] Local 00783 * Dictionary Initialized. Size: 34416 00784 * </P> 00785 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00786 * 23:06:24.565176 21004 Unigram_Model_Streamer.cpp:27] Initializing 00787 * Word-Topic counts table from dump ../ut_out/unigram/lda.ttc.dump 00788 * using 17208 words & 100 topics. 00789 * </P> 00790 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00791 * 23:06:24.608408 21004 Unigram_Model_Streamer.cpp:45] Initialized 00792 * Word-Topic counts table 00793 * </P> 00794 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00795 * 23:06:24.608469 21004 Unigram_Model_Streamer.cpp:49] Initializing 00796 * Alpha vector from dumpfile ../ut_out/unigram/lda.par.dump 00797 * </P> 00798 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00799 * 23:06:24.608543 21004 Unigram_Model_Streamer.cpp:51] Alpha vector 00800 * initialized 00801 * </P> 00802 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00803 * 23:06:24.608571 21004 Unigram_Model_Streamer.cpp:54] Initializing 00804 * Beta Parameter from specified Beta = 0.01 00805 * </P> 00806 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00807 * 23:06:24.608608 21004 Unigram_Model_Streamer.cpp:57] Beta param 00808 * initialized 00809 * </P> 00810 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00811 * 23:06:24.615538 21004 Testing_Execution_Strategy.cpp:24] Starting 00812 * Parallel testing Pipeline 00813 * </P> 00814 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">www.sauritchsurfboards.com/ 00815 * recreation/sports/aquatic_sports watch out jeremy sherwin is here 00816 * over the past six months you may have noticed this guy in every surf 00817 * magazine published jeremy is finally getting his run more.. copyright 00818 * surfboards 2004 all rights reserved june 6 2004 new launches it s new 00819 * and improved site you can now order custom surfboards online more 00820 * improvements to come.. top selling models middot rocket fish middot 00821 * speed egg middot classic middot squash 00822 * </P> 00823 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">www.sauritchsurfboards.com/ recreation/sports/aquatic_sports (watch,83) 00824 * (past,86) (months,77) (noticed,15) (guy,93) (surf,35) (magazine,86) 00825 * (published,92) (finally,49) (run,21) (copyright,62) (surfboards,27) 00826 * (rights,90) (reserved,59) (june,63) (launches,26) (improved,40) 00827 * (site,26) (order,72) (custom,36) (surfboards,11) (online,68) 00828 * (improvements,67) (top,29) (selling,82) (models,30) (middot,62) 00829 * (rocket,23) (fish,67) (middot,35) (speed,29) (egg,2) (middot,22) 00830 * (classic,58) (middot,69) (squash,67) 00831 * </P> 00832 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">www.semente.pt 00833 * recreation/sports/aquatic_sports por desde de 1999 para este site com 00834 * o para browsers 4 ou superior de prefer ecirc ncia o internet 00835 * explorer aqui gr aacute tis este site foi e pela think pink 00836 * multimedia 2000 surfboards all rights reserved 00837 * </P> 00838 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">www.semente.pt recreation/sports/aquatic_sports (por,83) 00839 * (de,86) (para,77) (site,15) (para,93) (browsers,35) (superior,86) 00840 * (de,92) (prefer,49) (internet,21) (explorer,62) (aqui,27) (gr,90) 00841 * (aacute,59) (site,63) (pink,26) (multimedia,40) (surfboards,26) 00842 * (rights,72) (reserved,36)</P> 00843 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00844 * 23:06:38.769309 21004 Testing_Execution_Strategy.cpp:36] Iteration 0 00845 * done. Took 0.235894 mins 00846 * </P> 00847 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00848 * 23:06:38.875191 21004 Unigram_Model.cpp:92] Average num of topics 00849 * assigned per word = 1.85268 00850 * </P> 00851 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00852 * 23:06:38.875252 21004 Testing_Execution_Strategy.cpp:41] >>>>>>>>>> 00853 * Log-Likelihood (model, doc, total): -2.52084e+06 , -194.288 , 00854 * -2.52103e+06 00855 * </P> 00856 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">W0417 00857 * 23:06:38.875350 21004 Testing_Execution_Strategy.cpp:49] Parallel 00858 * testing Pipeline done 00859 * </P> 00860 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"><BR> 00861 * </P> 00862 * </CODE> 00863 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Note the flags used:</P> 00864 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">1. teststream: Indicates the 00865 * streaming test mode</P> 00866 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">2. dictionary: The 00867 * dictionary of words that the model knows about. Only those words in 00868 * the test data set are recognized and the rest are ignored.</P> 00869 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">3. maxmemory: In MB. Denotes 00870 * the amount of memory you allocate to store the model.</P> 00871 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">The rest are same as batch 00872 * mode.</P> 00873 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm"><BR> 00874 * </P> 00875 * <P ALIGN=LEFT STYLE="margin-bottom: 0cm">Here there is no other 00876 * output that you need to look for. The topic assignments are dumped 00877 * back to stdout. The word mixtures for the topics are the same as that 00878 * in the model.</P> 00879 * \section generated_output Output Generated 00880 * <OL> 00881 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.wor: This file 00882 * generated by 'formatter' is the document in the protobuffer format 00883 * with words replaced by their indices</P> 00884 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.top: This file 00885 * generated by 'learntopics' contains the current topic assignments 00886 * for all the documents in the protobuffer format</P> 00887 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.dict.dump: This 00888 * file is a binary dump of the dictionary generated by the 00889 * 'formatter' program. This will be implicitly used by the 00890 * 'learntopics' program</P> 00891 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.ttc.dump: This 00892 * file generated by 'learntopics' is the binary dump of the 00893 * word-topic counts table. This is the state at the end of all the 00894 * iterations and essentially represents all the training that has 00895 * happened through all the iterations. It is very important as this 00896 * is a necessary input to the test pipeline.</P> 00897 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.par.dump: This 00898 * file generated by 'learntopics' is the binary dump of the 00899 * parameters specified by the user which might be modified by an 00900 * optimization step</P> 00901 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.chk: This file 00902 * generated by 'learntopics' is a checkpoint keeping the current 00903 * iteration and some metadata</P> 00904 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.*.txt: These are 00905 * for human consumption and there are three of these generated by 00906 * 'learntopics':</P> 00907 * <OL> 00908 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.topToWor.txt – 00909 * The word mixtures for each topics</P> 00910 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.worToTop.txt – 00911 * The topic assignments for every document on a per word basis</P> 00912 * <LI><P ALIGN=LEFT STYLE="margin-bottom: 0cm">lda.docToTop.txt – 00913 * The topic proportions for every document</P> 00914 * </OL> 00915 * </OL> 00916 * \section customizations Customization 00917 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">In its simplest form, all 00918 * the parameters to 'formatter' and 'learntopics' have sensible default 00919 * values which practically allows them to be used without any 00920 * arguments. However, if you need to customize your setup then the 00921 * following options are available:</P> 00922 * <UL> 00923 * <LI><P ALIGN=JUSTIFY>I/O Customization: By default all files output 00924 * by the formatter, which are also the input to the learnTopcis 00925 * program and all files output by the learntopics program use the 00926 * prefix <STRONG>lda</STRONG>. This can be easily customized using the 00927 * <STRONG>outputprefix</STRONG> & <STRONG>inputprefix</STRONG> 00928 * flags respectively. But while this customization is used, one needs 00929 * to be careful about keeping the <STRONG>outputprefix</STRONG> & 00930 * <STRONG>inputprefix</STRONG> same. 00931 * </P> 00932 * <LI><P ALIGN=JUSTIFY>Parameter Customization: The weights of the 00933 * Dirichlet Conjugates for both topics(alpha) & words(beta) can be 00934 * changed by the <STRONG>alpha</STRONG> & <STRONG>beta</STRONG> 00935 * flags. 00936 * </P> 00937 * <LI><P>Diagnostics & Optimization: By default, the code performs 00938 * alpha optimization every 25 iterations after burn in and also prints 00939 * the log likelihood every 25 iterations. These are customizable using 00940 * <STRONG>optimizestats</STRONG>, <STRONG>printloglikelihood</STRONG> 00941 * & <STRONG>burnin</STRONG> flags. 00942 * </P> 00943 * <LI><P>Initialization Options: By default, we do random initialization 00944 * for the topic assignments in the first iteration. However, we can be a 00945 * bit smarter about this. Instead of random assignments, we start out 00946 * with no topic assignments and an empty word-topic counts tables. 00947 * The sampling depends entirely on the smoothing mass for the first few 00948 * documents and subsequent documents use the counts table built so far. 00949 * This is similar in style to sequential monte carlo. You can use the 00950 * \'online\' flag to signal this. We have seen that this leads to faster 00951 * convergence. 00952 * </UL> 00953 */