00001 /***************************************************************************** 00002 The contents of this file are subject to the Mozilla Public License 00003 Version 1.1 (the "License"); you may not use this file except in 00004 compliance with the License. You may obtain a copy of the License at 00005 http://www.mozilla.org/MPL/ 00006 00007 Software distributed under the License is distributed on an "AS IS" 00008 basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the 00009 License for the specific language governing rights and limitations 00010 under the License. 00011 00012 The Original Code is Copyright (C) by Yahoo! Research. 00013 00014 The Initial Developer of the Original Code is Shravan Narayanamurthy. 00015 00016 All Rights Reserved. 00017 ******************************************************************************/ 00018 /** 00019 * \page architecture Y!LDA Architecture 00020 * \section intro Introduction 00021 * Please refer the Main Page for an introduction 00022 * 00023 * \section goals Goals 00024 * The approach to do topic modelling is to have a graphical model representing 00025 * the generative assumptions the user has about the corpus. A graphical model is a 00026 * probabilistic model representing the joint distribution of the random 00027 * variables involved with a graph denoting the conditional independence 00028 * assumptions amongst them. Solving the model is to infer the parameters of 00029 * the model by processing the actual data. Doing this inference is the 00030 * hardest part of the approach. 00031 * 00032 * \subsection new_models Adding new Models 00033 * There are a lot of variations on the basic LDA model and with each variation 00034 * the inferencing logic changes. The parameters, the sufficient statistics that 00035 * need to be maintained everything will be slightly different. One of the main 00036 * goals of this framework is to make the job of adding new models simpler. 00037 * 00038 * \subsection infrastructure Common Infrastructure 00039 * One task that is mostly common across multiple models is the infrastructure 00040 * needed to store documents, load them, create a pipeline that can optimally 00041 * utilize multi-core parallelism. In the framework we aim to standardize on 00042 * proven infrastructure that is known to provide efficient implementations so 00043 * that the model writer just worries about adding only the parts that are 00044 * relevant for doing the inference 00045 * 00046 * \subsection scalability Scalability 00047 * Another main aspect of this framework is to substantially increase the scale 00048 * of the state of the art by utilizing parallelism both multi-core and 00049 * multi-machine. 00050 * 00051 * \section components Main components of the System 00052 * Y!LDA uses the Gibbs sampling approach popularized by 00053 * <a href="http://dx.doi.org/10.1073%2Fpnas.0307752101">Collapsed Gibbs Sampling</a>. 00054 * There are four main components in this approach: 00055 * <OL> 00056 * <LI><B>Model:</B><BR/> 00057 * This encapsulates the parameters of the model and the sufficient 00058 * statistics that are necessary for inference 00059 * <LI><B>Model_Refiner:</B><BR/> 00060 * This encapsulates the logic needed for refining the initial 00061 * model which involves streaming the documents from disk, sampling 00062 * new topic assignments, updating the model, performing diagnostics 00063 * and optimization and writing the documents back to disk 00064 * <LI><B>Pipeline:</B><BR/> 00065 * As can be seen above, the refiner does a sequence of operations on 00066 * every document of the corpus. Some of them have to run serially but 00067 * some others can be run parallely. To enable exploiting multi-core 00068 * parallelism, the Pipeline is defined to be composed of a set of 00069 * operations called filters which can either be declared to be run 00070 * serially or parallely. The Pipeline comes with a scheduler that 00071 * schedules the threads available on the machine to run these filters 00072 * in an optimal fashion 00073 * <LI><B>Execution_Strategy:</B><BR/> 00074 * This encapsulates the strategy that decides what filters 00075 * a pipeline is composed of, how many times the documents are passed 00076 * through the pipeline. 00077 * </OL> 00078 * \subsection builder Builder Pattern 00079 * The Builder pattern fits very well for this approach. We implement a Model_Builder 00080 * that builds the last three components depending on what model is needed and what 00081 * mode the model is supposed to operate in. 00082 * 00083 * The Model_Builder creates an initial Model and creates the required Model_Refiner by 00084 * passing the Model(or the necessary components of the Model). It then creates a Pipeline 00085 * and an Execution_Strategy as per the mode of operation. 00086 * 00087 * The Director is pretty straightforward. It directs the given Model_Builder to create 00088 * the necessary components and executes the defined Execution_Strategy. This refines the 00089 * initial Model created by the builder into one that reflects parameters tuned to the 00090 * corpus on which the Model was refined on. Then the Model is stored on disk for testing. 00091 * 00092 * \section multi-machine Distributed Set Up 00093 * To cater to the Scalability goals, as detailed in 00094 * <a href="http://portal.acm.org/citation.cfm?id=1920931">An Architecture for Parallel Topic Models</a>, 00095 * the framework implements a 00096 * Distributed Memory based multi-machine setup that exploits multi-machine parallelism 00097 * to the fullest. The main idea being that the inferencing happens locally while the 00098 * state variables are kept up-to-date with a global copy that stored using a Distributed 00099 * HashTable. To come up with an efficient distributed set up is a difficult thing and 00100 * we definitely do not want people reinvent the wheel here. So the framework tries to 00101 * abstract the mechanism of distribution, the implementation of an efficient distributed 00102 * HashTable and the mechanism needed for Synchrnoization. 00103 * 00104 * \subsection distributed_map Distributed_Map 00105 * The framework implements a Distributed_Map interface using Ice as a very efficient 00106 * middleware. It essentially provides both a Server and Client implementation. 00107 * <OL> 00108 * <LI> 00109 * <B>DM_Server:</B><BR/> 00110 * The server essentially hosts a chunk of the distributed hash table and supports the 00111 * usual map operations. It also supports three special operations: 00112 * <UL> 00113 * <LI>Put: Which accumulates the values instead of replacing</LI> 00114 * <LI>waitForAll: Which is a barrier implementation using AMD</LI> 00115 * <LI>PutNGet: which is an asynchronous call that accumulates 00116 * the passed value into the existing one and returns 00117 * the final value back to the caller through 00118 * a call back mechanism 00119 * </LI> 00120 * </UL> 00121 * </LI> 00122 * <LI> 00123 * <B>DM_Client:</B> <BR/> 00124 * A client that supports a single hash table view of the distributed system. The client 00125 * transparently supports a rate limited, sliding-window based Asynchronous Method Invocation 00126 * for the PutNGet which is a very useful operation to have for effective Synchronization. 00127 * Refer to the VLDB paper for more information. 00128 * </LI> 00129 * </OL> 00130 * 00131 * For most models, one need not worry about modifying the above. These only need to be 00132 * used most of the times without bothering much about their implementation. 00133 * 00134 * \subsection synchronizer Synchronizer 00135 * The framework provides a default implementation of the Synchronization strategy detailed 00136 * in [2]. The Synchronizer is run as a separate background thread apart from the main 00137 * threads that do the inferencing. The actual task of synchronization is left to the 00138 * implementation of a Synchronizer_Helper class. The Sychronizer only creates slots for 00139 * synchronization and asks the helper to synchronize in those slots. It also takes care 00140 * of running the Synchronization only till the inferencing is done. 00141 * 00142 * However, there is a strong assumption that the synchronization proceeds in a linear 00143 * fashion. That is the structures being synchronized are linear and can be synchronized 00144 * one after the other. This is implicit in the Synchronizer's creation of slots. 00145 * 00146 * \subsection synchronizer_helper Synchronizer_Helper 00147 * Every model has to only provide the Synchronizer_Helper implementation which spills 00148 * the logic for synchronizing the model's relevant structures, maintains copies of them 00149 * where needed and provides the callback function for the AMI putNGet. 00150 * 00151 * \section default_impl Default Implementations provided 00152 * The framework provides default implementations for the Pipeline interface and the 00153 * Execution_Stratgey interface. 00154 * <OL> 00155 * <LI><B>TBB_Pipeline:</B><BR/> 00156 * This implementation uses Intel's Threading Building Blocks for 00157 * providing the Pipeline interface. 00158 * <LI><B>Training_Execution_Strategy:</B><BR/> 00159 * The default implementation of Execution_Strategy for 00160 * LDA training. Assembles the following pipeline for data flow: 00161 * \image html data_flow.png 00162 * \image latex data_flow.eps 00163 * <LI><B>Synchronized_Training_Execution_Strategy:</B><BR/> 00164 * The default implementation of 00165 * Execution_Strategy that extends Training_Execution_Strategy and 00166 * adds Synchronization capability 00167 * <LI><B>Testing_Execution_Strategy:</B><BR/> 00168 * The default implementation of Execution_Strategy for 00169 * LDA testing. 00170 * 00171 * \section unigram Unigram Model 00172 * The framework also provides the Unigram_Model implementations of the various common 00173 * interfaces. This is the basic LDA model with the bag of words assumption. Please 00174 * take a look at how the various interfaces are implemented. The main implementation 00175 * needed is for Model & Model_Refiner. Additionally, it implements efficient sparse 00176 * data structures to store the sufficient statistics. 00177 * 00178 * \section new_model Adding a new Model 00179 * Please use the Unigram_Model implementation as an example to implement new models 00180 * 00181 * \section chkpt Checkpoints 00182 * The framework also provide checkpointing functionality for the multi-machine setup 00183 * in order to provide failure recovery. This is implemented by an external object 00184 * that knows how to do three things: a. Serialize metadata to disk b. load previously 00185 * serialized metadata on request c. Serialize the datastructures to disk 00186 * 00187 * An appropriate checkpointer is passed as an argument while creating an Execution_Strategy 00188 * The strategy uses checkpointers to checkpoint at regular intervals. At startup, it also 00189 * checks if any checkpoints are available and if so, it starts up from that checkpoint. 00190 * 00191 * Different checkpointers are needed for different setups. For ex., the framework uses 00192 * the Local Checkpointer when running in single machine mode which only involves writing 00193 * the iteration number as metadata. All other data needed for restart is already being 00194 * serialized. However, for the multi-machine setup, a different mechanism is needed and 00195 * a Hadoop Checkpointer is implemented. 00196 * 00197 * This is an ongoing effort and we will add more stuff both to the code and documentation. 00198 * We definitely need your help & contribution in making this better. 00199 * 00200 * Here is an initial set of TODOs: 00201 * 00202 * \todo Add unit tests to make the code more robust 00203 * \todo Add more code documentation for the Unigram_Model components 00204 * \todo Implement fancier models in later versions 00205 * \todo Implement extensions to the LDA model in later versions 00206 * 00207 * These are in no particular order and we might re-prioritize later. Please mail me if 00208 * you are interested in contributing 00209 * 00210 * We shall use the git pull request (fork + pull model) for collaborative development. 00211 */