Managing Diverse Sentiments at Large Scale

Source Code, Datasets

Managing Diverse Sentiments at Large Scale
Mikalai Tsytsarau, Themis Palpanas

The large-scale aggregation and analysis of user opinions is becoming increasingly relevant to a variety of applications, from detecting social mood on some political topics to tracking their sentiment changes related to events. The analysis of diverse sentiments is another important application, which becomes possible based on the ability of modern methods to capture sentiment polarity on various topics with high precision and on the ever-growing scale. Therefore, there is a need for a scalable way of sentiment aggregation with respect to the time dimension, which stores enough information to preserve diversity, and which allows statistically accurate analysis of sentiment trends and opinion shifts.
In this paper, we are focusing on the novel problem of aggregating diverse sentiments at a large scale, based on data sources that are continuously updated. First, we develop a theoretical framework that models sentiment diversity (contradiction) and defines two types of contradictions, depending on the distribution of sentiments over time. Second, we introduce novel measures that capture sentiment diversity from aggregated sentiment statistics. Third, we develop robust and scalable indexing and storage methods for diverse sentiments. Finally, we propose an adaptive approach for identifying contradictions at different time scales. The experimental evaluation demonstrates the effectiveness of the proposed method of capturing contradictions and its superiority over relational databases in real-world scenarios.

Source Code

You may freely use this code for research purposes, provided that you properly acknowledge the authors using the following reference:

Mikalai Tsytsarau, Themis Palpanas. Managing Diverse Sentiments at Large Scale. IEEE Transactions on Knowledge and Data Engineering (TKDE), 28(11), 2016.

Zip file with source code for all the algorithms used in the paper.

Synthetic Datasets

For the evaluation of accuracy and performance of our method, the provided source code can generate a synthetic dataset containing time series of sentiments following an artificial trend with opinion shifts, contradictions and a controlled amount of noise. To create this dataset the algorithm simulates a large volume of sentiments with time stamps following the Poisson distribution with average rate from 1 to 10 sentiments per day, and with polarities sampled using normal distributions. A particular fraction of generated sentiments follows a planted trend with dispersion 0.125, while the rest, controlled by the noise parameter, are distributed randomly with dispersion 0.5 and mean 0.0. The relative amount of noise sentiments is set from 0% to 40% with a step of 10%. In the paper, we generated 1000 sentiment trends, and stored the corresponding original time series (with 0% of random noise) in the CTree, also duplicating them and adding noise for each of the above levels. Overall, we stored 5000 time series in the CTree.

Real Datasets

Our method of contradiction detection was tested on four real datasets.

The first dataset is on drug reviews collected from the DrugRatingz website. It contains 2701 positive, 352 neutral and 1616 negative reviews for 477 drugs. These reviews are provided by persons that took a specific drug. They describe their personal experience with the drug, including contra-indications that occurred.
The second dataset is derived from comments to YouTube videos, collected at L3S. It contains approximately 6 million comments to YouTube videos, with an average of 500 comments for each video.
The third dataset contains comments on postings from Slashdot, provided for the CAW2 workshop. Slashdot, is a popular website for people interested in reading and discussing about technology and its ramifications. It publishes short story posts, which often incite many readers to comment, and provoke discussions that may trail for hours, or even days. It contains about 140,000 comments under 496 articles, covering the time period from August 2005 to September 2006.
The fourth dataset, was created by selecting 30 trending topics from Twitter, which featured the most prominent events for the period of half a year, from June 2009 till December 2009.

The above datasets are contained in the following two files:

Zip file with comments from the first three data sources, with sentiment annotations and model training data.
Zip file with twitter dataset of time series for 30 contradictory topics.