00001 /***************************************************************************** 00002 The contents of this file are subject to the Mozilla Public License 00003 Version 1.1 (the "License"); you may not use this file except in 00004 compliance with the License. You may obtain a copy of the License at 00005 http://www.mozilla.org/MPL/ 00006 00007 Software distributed under the License is distributed on an "AS IS" 00008 basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the 00009 License for the specific language governing rights and limitations 00010 under the License. 00011 00012 The Original Code is Copyright (C) by Yahoo! Research. 00013 00014 The Initial Developer of the Original Code is Shravan Narayanamurthy. 00015 00016 All Rights Reserved. 00017 ******************************************************************************/ 00018 /** \mainpage Y!LDA Topic Modelling Framework 00019 * <H1 CLASS="western">What is Topic Modelling?</H1> 00020 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"> 00021 * It is a Machine Learning technique to categorize data. If we have 00022 * to group the home pages of Singapore Airline, National University 00023 * of Singapore & Chijmes which is a restaurant in Singapore, we can 00024 * group them as all belonging to Singapore. Now if we have more pages 00025 * to group lets say, United Airlines, Australian National University 00026 * and a restaurant in Berkeley, then we can group the combined set of 00027 * pages in multiple ways: by country, by type of business and so on. 00028 * Choosing one of these different ways is hard because each one of 00029 * them has multiple roles to play depending on the context. In a 00030 * document that talks about nations strengths then the grouping by 00031 * nationality is good and in some document which talks about universties 00032 * its apt to use the grouping by type of business. 00033 * So the only alternative is to assign or tag each page with all the 00034 * categories it belongs to. So we tag the United Airlines page as 00035 * an airliner company in US and so on.</P> 00036 * 00037 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">So whats the big 00038 * difference between grouping objects or clustering & Topic Models? 00039 * The following example clarifies the distinction: 00040 * Consider objects of different colors. Clustering them is to find 00041 * that there are 3 prototypical colors RGB & each object can be grouped 00042 * by what its primary color is. That is we group object by prototypes. 00043 * On the other hand with topic models we try to find the the composition 00044 * of RGB in the color of each object, that is we say that this color 00045 * is composed of 80% R, 9% G & 11% B. So topic models are definitely 00046 * richer in the sense that any color can be decomposed into the 00047 * prototypical colors but not all colors can be unambiguously grouped</P> 00048 * 00049 * <H1 CLASS="western">What is Y!LDA topic modelling framework?</H1> 00050 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Though conceptually it 00051 * sounds very good, to make it work on 100s of millions of pages, 00052 * with 1000s of topics to infer & no editorial data is a very hard 00053 * problem. The state of the art can only handle sizes that are 10 00054 * to 100 times smaller. We would also like the solution to scale 00055 * with the number of computers so that we can add more machines 00056 * and be able to solve bigger problems.</P> 00057 * 00058 * <P>One way of solving the problem of Topic Modelling is called 00059 * <a href="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf">Latent Dirichlet Allocation</a>. 00060 * This is a statistical model which specifies a probabilistic procedure 00061 * to generate data. It defines a topic as a probability distribution over words. 00062 * Essentially think of topics as having a vocabulary of their own 00063 * with preference over words specified as a probability distribution.</P> 00064 * 00065 * <P>We have implemented a framework to solve the topic modelling 00066 * problem using LDA which can work at very large scale. Considerable 00067 * effort has also been spent on creating an architecture for the 00068 * framework which is flexible enough to allow the reuse of infrastructure 00069 * for the implementation fancier models and extension. One of the 00070 * main aims here is that scaling a new model should take minimal effort. 00071 * For more details please take a look at 00072 * <a href="http://portal.acm.org/citation.cfm?id=1920931">An Architecture for Parallel Topic Models</a> 00073 * </P> 00074 * 00075 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00076 * </P> 00077 * <H1 CLASS="western">What does the framework provide?</H1> 00078 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">It provides a fast C++ 00079 * implementation of the inferencing algorithm which can use both 00080 * multi-core parallelism and multi-machine parallelism using a hadoop 00081 * cluster. It can infer about a thousand topics on a million document 00082 * corpus while running for a thousand iterations on an eight core 00083 * machine in one day.</P> 00084 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00085 * </P> 00086 * <H1 CLASS="western">What are the requirements?</H1> 00087 * <H2 CLASS="western">Hardware Requirements:</H2> 00088 * <P><B>Small Corpus of the order of thousands of documents:</B> Dual 00089 * Core machines with 4-8 GB of RAM might be sufficient. Of course you 00090 * can run the code on larger document sets but you will have to wait 00091 * longer or cut down on the number of iterations.</P> 00092 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>Large Corpus of the 00093 * order of millions of documents:</B>50 to 100 multi-Core (quad cores, 00094 * dual quad cores, etc) machines with 8 to 16 GB of RAM can give good 00095 * performance.</P> 00096 * <H2 CLASS="western">Software Requirements:</H2> 00097 * <P>The code has been mainly tested on the linux platform. If you want to 00098 * install on other platforms, since the source and libraries along with 00099 * source are provided, you can try compiling them and let us know if it 00100 * works.</P> 00101 * <H3 CLASS="western">Dependencies:</H3> 00102 * <P>The code has dependencies on a number of libraries. To facilitate 00103 * distribution we have hosted the libraries separately. The install script 00104 * shipped with the code fetches the sources for the associated libraries and 00105 * builds them on the developer's machine. 00106 * The following is the list: <BR> 00107 * <OL> 00108 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>Ice-3.4.1.tar.gz </B><BR> 00109 * An efficient inter process communication framework which is used for 00110 * the distributed storage of (topic, word) tables.</P> 00111 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>cppunit-1.12.1.tar.gz </B><BR> 00112 * C++ unit testing framework. We use this for unit tests.</P> 00113 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>mcpp-2.7.2.tar.gz </B><BR> 00114 * C++ preprocessor</P> 00115 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>boostinclude.tar.gz </B><BR> 00116 * Boost libraries (various datatypes)</P> 00117 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>gflags-1.2.tar.gz </B><BR> 00118 * Google's flag processing library (used for commandline options)</P> 00119 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>protobuf-2.2.0a.tar.gz </B><BR> 00120 * Protocol buffers (used for serializing data to disk and as internal 00121 * key data structure). Google's serialization library</P> 00122 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>bzip2-1.0.5.tar.gz </B><BR> 00123 * Data compression</P> 00124 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>glog-0.3.0.tar.gz </B><BR> 00125 * Logfile generation (Google's log library).</P> 00126 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>tbb22_20090809oss.tar.gz </B><BR> 00127 * Intel Threading Building Blocks. Multithreaded processing 00128 * library. Much easier to use than pthreads. We use the pipeline 00129 * class.</P> 00130 * </OL> 00131 * All the libraries except Ice should install without 00132 * problems. With Ice, there are a lot dependencies involved and our 00133 * automated build script might not work in your set-up. If so please 00134 * install Ice manually and copy the required includes & libs into 00135 * the Yahoo_LDA directory</B></P> 00136 * <H1 CLASS="western">How do you install?</H1> 00137 * <OL> 00138 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Run make from the 00139 * directory where you have checked out the source.</P> 00140 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">This will first 00141 * install all the required libraries locally in the same directory. It 00142 * will then compile the source code to generate the binaries described 00143 * next.</P> 00144 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Please check for a 00145 * proper install before if you see compilation or linkage errors. 00146 * There may be issues with installation of Ice. Care has been taken 00147 * that the install happens on most well set-up machines but if it 00148 * fails, we recommend installing Ice manually and copying the include 00149 * files and the libraries to the include & lib directories in the 00150 * Y!LDA install path. After this the compilation should go through 00151 * fine.</P> 00152 * </OL> 00153 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00154 * </P> 00155 * <H1 CLASS="western">What are the binaries that get installed and what 00156 * do you use them for?</H1> 00157 * <OL> 00158 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">formatter : Used to 00159 * format the raw text corpus into a binary format on which learntopics 00160 * can be run. This is a preprocessing step that allows one to run 00161 * learntopics many times on the same corpus. It also decreases the on 00162 * disk size of the raw text corpus.</P> 00163 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">learntopics: Used to 00164 * learn/infer the topics from the corpus and represent the documents 00165 * in the corpus as mixture of these topics.</P> 00166 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">DM_Server: This is 00167 * the server that implements a distributed hash table that is used to 00168 * store the global counts table while running the multi-machine 00169 * version of the code.</P> 00170 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Merge_Dictionaries: 00171 * This is used to build the global dictionary by merging the local 00172 * dictionaries</P> 00173 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Merge_Topic_Counts: 00174 * This is used to dump the global counts table to disk.</P> 00175 * </OL> 00176 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00177 * </P> 00178 * <H1 CLASS="western">What are the scripts available and what do you do 00179 * with them?</H1> 00180 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Scripts are available for 00181 * the use of build system and the multi-machine setup with Hadoop. The 00182 * build system uses the bin/create*.sh scripts to find out files & 00183 * directories. The multi-machine setup has the runLDA.sh, Formatter.sh 00184 * & LDA.sh scripts that launch the hadoop-streaming job for doing 00185 * the distributed inference. Another script that helps you to organize 00186 * your corpus on hdfs is splitter.sh</P> 00187 * <P/> 00188 * <H1 CLASS="western">How do I use Y! LDA?</H1> 00189 * \ref usage 00190 * <P/> 00191 * <H1 CLASS="western">Where to find the developer documentation?</H1> 00192 * \ref architecture & the code documentation made available through 00193 * Doxygen 00194 * <P/> 00195 */