00001 /***************************************************************************** 00002 The contents of this file are subject to the Mozilla Public License 00003 Version 1.1 (the "License"); you may not use this file except in 00004 compliance with the License. You may obtain a copy of the License at 00005 http://www.mozilla.org/MPL/ 00006 00007 Software distributed under the License is distributed on an "AS IS" 00008 basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the 00009 License for the specific language governing rights and limitations 00010 under the License. 00011 00012 The Original Code is Copyright (C) by Yahoo! Research. 00013 00014 The Initial Developer of the Original Code is Shravan Narayanamurthy. 00015 00016 All Rights Reserved. 00017 ******************************************************************************/ 00018 /** 00019 * \page usage Using Y!LDA 00020 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">The main purpose of the 00021 * Y!LDA framework is to allow you to infer the set of constituents that 00022 * can be used to represent your documents as mixtures of the inferred 00023 * constituents. This is also called learning a model for you corpus. 00024 * This takes a significant amount of time.</P> 00025 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Once a model has been 00026 * learnt, you can also use it to find out what topics a new document is 00027 * made up by using LDA in test mode. In this mode, you provide the 00028 * learnt model as a parameter and use that model to infer the topic 00029 * mixture for new documents. This is very fast compared to learning the 00030 * model. This can be done in two ways:</P> 00031 * <OL> 00032 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Batch mode: If you 00033 * have a bunch of new documents for which you want to infer the topic 00034 * mixture you use this mode.</P> 00035 * <LI><P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Streaming mode: If 00036 * you have documents which are to be streamed through the binary 00037 * generating the topic mixture instantly, then this mode is for you.</P> 00038 * </OL> 00039 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">The main difference 00040 * between the two modes is that, in batch mode, the model is loaded for 00041 * every batch of documents whereas in streaming mode, the model is 00042 * loaded once and you can keep streaming the documents. In batch mode, 00043 * you wait till all the documents in the batch are assigned topic 00044 * mixtures whereas in streaming mode the topic mixture is assigned on a 00045 * per document basis. In streaming mode, only a part of the model is 00046 * loaded depending on how much memory you allocate for storing the 00047 * model. However, we ensure that the model for the most important words 00048 * (words that occurr more frequently) are loaded before loading the 00049 * model for lesser frequent words till the allocated is memory is used 00050 * up. The model loading time can be significant if its big. Hence the 00051 * case for a streaming mode for those of you who do not want to bear 00052 * the cost of loading a big model every time you want to infer topic 00053 * mixtures on new documents.</P> 00054 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">Each of these is 00055 * described in detail with an illustration below in the next 00056 * subsection.</P> 00057 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">The other dimension to 00058 * consider is the size of the data. If your corpus has a couple of 00059 * million documents and you are willing to wait for a couple of days 00060 * then all you need to know is the single machine setup where all the 00061 * data is on a single machine (usually multi-core) and you run LDA on 00062 * that machine.</P> 00063 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm">If your corpus is larger 00064 * than that, you have no choice but to resort to the multi-machine 00065 * setup. Here you split your data amongst multiple machines (possibly 00066 * each machine being a multi-core one) and run LDA on all those 00067 * machines. The Y!LDA framework currently provides a hadoop based 00068 * distributed LDA mechanism that uses a custom implemented distributed 00069 * hash table. The framework provides hadoop-streaming scripts that do 00070 * the entire thing (splitting, training, testing) in a distributed 00071 * fashion on the hadoop cluster backed by the HDFS as the distributed 00072 * file system where your data is stored. If you do not have a hadoop 00073 * cluster at your disposal, we currently do not provide any scripts to 00074 * run things automatically. You will have to distribute the data 00075 * manually, copy the binary version, if its a homogenous setup, else 00076 * install and run the framework on all the nodes containing data and 00077 * finally execute binaries to merge the distributed parts of the model. 00078 * We will only detail the steps to achieve them. Perhaps, one can 00079 * contribute some scripts back so that this can be done automatically!</P> 00080 * <P ALIGN=JUSTIFY STYLE="margin-bottom: 0cm"><BR> 00081 * </P> 00082 * <H2 CLASS="western">Single Machine Setup:</H2> 00083 * \ref single_machine_usage "Single Machine Usage" 00084 * <H2 CLASS="western">Multi-Machine Setup:</H2> 00085 * \ref multi_machine_usage "Multi Machine Usage" 00086 */