Description |
The size and dynamism of the Web poses challenges for all its stakeholders, which include producers/consumers of content, and advertisers who want to place advertisements next to relevant content. A Critical piece of information for the stakeholders is the demographices of the consumers who are likety to visit a given web site. However, predicting the demogrphics of consumers who are likely to visit a given web site, while being essential, remains a challenging task. Hence in this dissertation we ask the following questions: Is it possible to deduce the audience demographics of a web site based solely on the local cues such as the design or the content of the web site? If so, is it design, content, or combination that provides a good predicitive model? In addition to the design or the content, is it also possible to use the semantics embedded within content to further improve the prediction performance? We explore these questions with statistical analyses as well as predictive models using various modeling schemes. From the results, we find that it is indeed possible to effectively predict demographics of consumers of a web site using cues embedded in the design or the content of its homepage. In addition, we build and evaluate an ensemble classifier that combines the predictions from both design and content cues. An analysis of the ensemble suggests the possible use of the approach for better prediction. In addition to the classification-based predictive model that predicts a discrete demographic class (e.g., female) within each demographic dimension (e.g., gender), we also explore a regression-based predictive model that predicts the demographic composition (e.g., 63.5 % female) of a web site, which is a continuous dependent variable. We show that this model also works effectively with good estimation performance. Finally, we suggest a feature selection approach using Latent Dirichlet Allocation (LDA) method and show that semantics extracted from web site content using the method can also be utilized to achieve a competitive prediction performance while significantly improving the prediction efficiency. The approaches in this study serve as low-burden complements to the more intrusive and costly registration/cookie based techniques. |