00001 /** \mainpage User Manual for Y!LDA Topic Modelling framework 00002 * <H1 CLASS="western">What is Topic Modelling?</H1> 00003 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">It is a statistical 00004 * technique to represent documents (or similar collections of tokens) 00005 * by a mixture of constituents. For example, a document on basketball 00006 * and betting can be represented as containing a mix of words from each 00007 * of the constituent topics basketball & betting. A given color say 00008 * Orange can be represented by a mixture of its constituent colors the 00009 * Red, Green & the Blue. This <STRONG><SPAN STYLE="font-weight: medium">is 00010 * quite different from a clustering model which represents documents as 00011 * belonging to one of the many classes. For example, the color orange 00012 * may be represented by its closest color Red.</SPAN></STRONG></P> 00013 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00014 * </P> 00015 * <H1 CLASS="western">What is Y!LDA topic modelling framework?</H1> 00016 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">It is Yahoo!'s 00017 * implementation of topic modelling that can be used to infer the 00018 * mixture of constituents given the documents. It uses a totally 00019 * unsupervised algorithm. The constituents are called topics. Each 00020 * topic is a distribution on words and each document is represented as 00021 * a distribution on topics.</P> 00022 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00023 * </P> 00024 * <H1 CLASS="western">What does the framework provide?</H1> 00025 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">It provides a fast C++ 00026 * implementation of the inferencing algorithm which can use both 00027 * multi-core parallelism and multi-machine parallelism using a hadoop 00028 * cluster. It can infer about a thousand topics on a million document 00029 * corpus while running for a thousand iterations on an eight core 00030 * machine in one day.</P> 00031 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00032 * </P> 00033 * <H1 CLASS="western">What are the requirements?</H1> 00034 * <P><B>Small Corpus of the order of thousands of documents:</B> Dual 00035 * Core machines with 4-8 GB of RAM might be sufficient. Of course you 00036 * can run the code on larger document sets but you will have to wait 00037 * longer or cut down on the number of iterations.</P> 00038 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><B>Large Corpus of the 00039 * order of millions of documents:</B>50 to 100 multi-Core (quad cores, 00040 * dual quad cores, etc) machines with 8 to 16 GB of RAM can give good 00041 * performance.</P> 00042 * <H1 CLASS="western">How do you install?</H1> 00043 * <OL> 00044 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Run make from the 00045 * directory where you have checked out the source.</P> 00046 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">This will first 00047 * install all the required libraries locally in the same directory. It 00048 * will then compile the source code to generate the binaries described 00049 * next.</P> 00050 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Please check for a 00051 * proper install before if you see compilation or linkage errors. 00052 * There may be issues with installation of Ice. Care has been taken 00053 * that the install happens on most well set-up machines but if it 00054 * fails, we recommend installing Ice manually and copying the include 00055 * files and the libraries to the include & lib directories in the 00056 * Y!LDA install path. After this the compilation should go through 00057 * fine.</P> 00058 * </OL> 00059 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00060 * </P> 00061 * <H1 CLASS="western">What are the binaries that get installed and what 00062 * do you use them for?</H1> 00063 * <OL> 00064 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">formatter : Used to 00065 * format the raw text corpus into a binary format on which learntopics 00066 * can be run. This is a preprocessing step that allows one to run 00067 * learntopics many times on the same corpus. It also decreases the on 00068 * disk size of the raw text corpus.</P> 00069 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">learntopics: Used to 00070 * learn/infer the topics from the corpus and represent the documents 00071 * in the corpus as mixture of these topics.</P> 00072 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">DM_Server: This is 00073 * the server that implements a distributed hash table that is used to 00074 * store the global counts table while running the multi-machine 00075 * version of the code.</P> 00076 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Merge_Dictionaries: 00077 * This is used to build the global dictionary by merging the local 00078 * dictionaries</P> 00079 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Merge_Topic_Counts: 00080 * This is used to dump the global counts table to disk.</P> 00081 * </OL> 00082 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00083 * </P> 00084 * <H1 CLASS="western">What are the scripts available and what do you do 00085 * with them?</H1> 00086 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Scripts are available for 00087 * the use of build system and the multi-machine setup with Hadoop. The 00088 * build system uses the bin/create*.sh scripts to find out files & 00089 * directories. The multi-machine setup has the runLDA.sh, Formatter.sh 00090 * & LDA.sh scripts that launch the hadoop-streaming job for doing 00091 * the distributed inference. Another script that helps you to organize 00092 * your corpus on hdfs is splitter.sh</P> 00093 * <P/> 00094 * <H1 CLASS="western">How do I use Y! LDA?</H1> 00095 * \ref usage 00096 * <P/> 00097 * <H1 CLASS="western">Where to find the developer documentation?</H1> 00098 * \ref architecture & the code documentation made available through 00099 * Doxygen 00100 * <P/> 00101 */