Social Media, today, is playing a very important role in the politics of almost every nation. In India, there has been a boom in social media political campaigns, which was more marked in the 2014 General Elections. All major political parties used social media to promote their manifestos and interact in the public domain in a one-to-one manner. They analysed different views of the people, and worked on them. The Home Minister, in a seminar after the 2014 general elections, also said that “Through social media, the government is outlining its plan, its vision”. After assuming power, the NDA government has been using Twitter, Facebook and blogs to outline its plans, vision, showcase the progress, and other updates.
One of the major reason for this was to involve and cater to the interest of the youth of the country, that are active on most social media websites. Owing to this, political parties have invested huge amounts of funds into social media campaigning, hence also increasing their reach.
Formerly our work for the 2014 General Elections involved heavy analysis of Twitter data. The work was supplemented by developing a portal that captured the intricacies of the political campaign leading up to the election as a mirror of the social media.
In order to kick off the next sprint of analysis for the upcoming general elections of 2019, we decided to revisit the data gathered in 2014. We had over 21 million posts by roughly a million handles collected in the last sprint of analysis that ran in 2014. An inspection of the handles lead us to find that currently only 31.64% of the handles are still active (defining activity as the handle having posted at least one tweet in the year 2018). Thereafter, the deleted and suspended handles constitute upto 15.65% and 19.81% of the total users respectively.
We are building a portal to analyze the 2019 data and help see through the data that is getting generated on social media. Stay tuned as we unravel further analysis for the forthcoming election cycle. Below are some images from the portal landing page. We hope to do more analysis of the data that we are collecting from 2019 Elections.
If you have any questions for us to answer, please drop an email at pk[at]iiitd[dot]ac[dot]in we will be happy to answer it for you and credit the same for you.
With an increased online participation on Social Media, privacy concerns have risen to unprecedented levels. It has become extremely important to allow individuals the full control of their private information. Popular mobile applications integrated with Online Social Networks (OSNs) allow them to access user’s private information like their contact lists. This might allow OSNs to create shadow profiles of non-users using the data of existing users. We test this hypothesis for the first time on Twitter and further evaluate the predictability of location and biographical vector of a user from the information given by a friend who has created a Twitter Profile before our user.
To get an unbiased dataset, we collected 1,017 random twitter users which we call as ego users by random digit search method. We obtained their metadata and filtered spam users and celebrities by thresholding the follower to friends ratio in the range from 0.1 to 10. To maintain homogeneity, we collected only those users who have English as language on Twitter account and further obtained their timelines (up to 3,200 tweets). We identified the users mentioned at least 4 times by the ego users and used these links as an approximation to the underlying social network between Twitter users that is revealed when users share their contact lists through mobile phone apps. Thus, we generate a dataset of 68,447 alter users.
We identified the location of our users from their geotagged tweets and location provided by them in their Twitter Bio. We normalized these locations using Google’s Geocoding API and identified the City, State, and Country pertaining to each location. This way, we were able to locate 630 ego users and 38,936 alter users in our dataset. We further mapped each location to a unique set of geo-coordinates.
Figure 1 shows the locations of users in the dataset, illustrating that users come from a wide variety of countries but are generally located in countries where Twitter adoption is high and the users of similar locations are more associated with each other.
We processed the Twitter Bio of each user by removing stop words and converting the tokens into stems. We considered only those users which had at least 3 tokens in their bio to obtain 49,576 alters and 676 ego users. Over this text, we applied a pre-trained 100-dimensional Doc2Vec model and further reduced the vector to two most informative dimensions with Principal Component Analysis dimensionality reduction.
Twitter API also provides us the source of each tweet which identifies the way tweet was produced. We mark all the alters that produced at least one tweet with the source “Twitter for iPhone” or “Twitter for Android” as “disclosing alters” as they used a mobile application which accessed their mobile contact lists. This way we obtain 934 ego users and 53,724 alters which amounts to 78% of our dataset.
Shadow Profile Problem
For each ego user, we identified the preceding alters that had joined Twitter before ego user. Some of the alters disclosed their contact lists (red) and others did not (blue); see Figure 2. The shadow profile problem consists of the inference of personal information of the ego user based only on the information given by disclosing preceding alters, ignoring all data from non-disclosing preceding alters and alters that joined Twitter after the ego user.
To predict the location of ego users, we took the locations of all disclosing alters and identified the most frequent city among alters, i.e., the modal predictor. We used this location as the unsupervised prediction of location to be compared against the ground truth of the location of the ego user. We evaluated the quality of the prediction by measuring the Haversine distance in Km between the predicted point and the ground truth which is our error distance. We predicted the biographical vector of each alter as the average vector of its disclosing alters and evaluated this prediction through the cosine similarity of predicted and ground truth vectors. Therefore, a high similarity will mean a high accuracy of the predictor.
We evaluated both the predictors against a Random Null Model which took a uniform random sample of all users for prediction. For each projection, we generated 100 instances of Null Model and took the average result over those 100 predictions.
In Figure 3, the left panel shows the Cumulative Density Function (CDF) of the prediction error of user locations when using only the data of disclosing alters. Black lines indicate empirical errors and the blue line depicts the errors in the Null Model, revealing that empirical errors (median = 68.7 Km) are much lower than the Null Model errors (median = 6308.9 Km). The right panels show the regression profile of the empirical error versus the number of disclosing alters in Twitter. The line shows the model estimate and the shaded area its standard error. Prediction error decreases with the number of disclosing alters in Twitter.
To make stronger care for an actual scenario, we used the fact that all Twitter users do not have the Twitter mobile application installed or haven’t provided access to their contacts. We now made predictions, given a probability that the user will share his/her data. For each alter, we picked a random number in the range of 0 to 1 and compared it with our selected probability ρ. This allowed us to have only ρ*100 % alters for prediction for a particular ρ.
In Figure 4, the left panel shows the median error of location prediction in 1000 samples for each value of ρ∈[0.1,0.9]. The median error approaches the value of the error when ρ=1, using all alters, which is 72 Km. The inset shows the error of the Null Model, which is several orders of magnitude larger than the error of shadow profiles. The right panel shows stratified regression lines of median error as a function of the number of alters in the samples, revealing that error decreases with the number of alters for the different values of ρ.
Biographic Vector Prediction
In Figure 5, The left panel shows the median cosine similarity of predictions and the Null Model in 1000 samples for each value of ρ. The cosine similarity of predictions outperformed the Null Model for ρ>0.2 and increased with ρ. The right panel shows the regression analysis of cosine similarity versus the number of friends on Twitter, revealing a trend of growing similarity with the number of friends.
The error level for shadow profiles of location (68.7 Km) is comparable to error levels using full information, which are typically between 57.2 Km and 28.3 Km.
Our results demonstrate that even if as less as 30% of your network disclosed their information, your private information could be inferred with significant accuracy.
Limitations of our study :
Historical audit using future data as ground truth
Using mentions network to determine friendship link
Biographical vectors don’t allow the straightforward interpretation of user interests
The implications of our results are clear: individuals do not have full control over their privacy, and the decisions of other people mediate the decision not to share information with online services, which means that we cannot conceive online privacy as a purely individual phenomenon that can be reduced to the choices of a person.
Please find the full paper accepted at EPJ Data Science Journal 2018 here for detailed description of our work. This is joint work with Dr. David Garcia, Amod Agrawal, and PK.
Back in 2014, when I came to know about Dr. PK, he was associated with Backpack, FindAWay, IDEA and other cool things that were going around the campus. It was very intriguing because I did not know much about him except that and the courses that he took. Little later, I found out about Precog, the research group that he has at IIITD. For me, Precog was this intimidating elite group that I would never be able to be a part of. But oh! how wrong was I and so are you if you ever felt that. Trust me, I am an insider. 😛
Fast forward to 2015, I saw many of my seniors going for HCI and very soon after that I realised the direction I wanted to do something in or be closely related to. Ever since, I fell in love more with the field so there was no question when DHCS was offered by Dr.PK in Winter 2016! I was more than excited, and that followed all through the course! Dr. PK is such an amazing professor. He makes sure that lectures are interactive, interesting, and there are surprise activities too – so giving you plenty of reasons to get up in the morning out of your bed. He builds up your assignments to your final project, and helps students get feedback from each other through critiques and himself too! He makes a lot of efforts to make sure students are learning hands on, which is commendable. It was one-of-a-kind course at IIITD for me, at least before I graduated.
I really wanted to work on a HCID project in my summers, and I started interacting with PK time to time regarding that. What is great about Dr. PK is that he is so helpful – he will guide you about interests and tell you about resources where you could find opportunities to even offering you to apply for an internship at Precog. I could not believe when he did that but a task and conversation later when I was in, I really could.
Being a part of Precog gives you a sense of belonging and the pillars (the scholars of the lab) help Dr.PK provide a learning ground for everyone in the lab! It is always fascinating to listen to him and if you can decide to implement on anything you learn from him, it has the potential to work wonders! There are several good things that are a part of the Precog culture. One of them is the mailing lists! Even though it has the potential of overshadowing all your other emails on some days, I think those discussions and looking at everything from a ‘what can I do with this’ eye makes you critical of the things that are going on around you. It is just one of the really helpful things I have learnt and I take forward from Precog to everywhere I go.
On certain days, the lab feels like a festival while on others you’ll see people working hard on their desks and in the CERC lounge – where even a peep will sound like screaming in a crowd. 😛 Now, I know I have painted a certain picture here, but believe me it is not all that rosy. Being a part of Precog is certainly an adventure in itself. You get to have a lot of fun but the people here, work so hard – sometimes it amazed me. I have had a stretch of time where I was afraid of working on a certain thing and I procrastinated. It only lead to guilt because I could not contribute to WhatsUp (the weekly update meetings of the whole group), which pushes you to finally try what you fear and get better.
If you have known Dr. PK for even a little time, you’ll know he loves to be vague 😛 To be honest, I felt off with that approach at first and I got intimidated but with time I have not only accepted that methodology but I am trying to apply it to my life currently. There are so many little things that you will learn from PK if you become a part of Precog or interact with him ever, little things that will go a long way if you closely listen. Precog is not just this but a lot more, something that can not be put to words in this post. The best part about all this is – for you precog will be totally different, it will be what you make of it.
India is going digital in a big way; from banking to manufacturing to agriculture, each field is seeing the penetration of technology. Police organizations also have started using technology for effective policing. Most police organisations now have an official website, a Facebook page and a Twitter handle. Police not only use these new media services to showcase their organisation but also to interact with citizens very regularly. Police posts on Facebook or tweets on Twitter include a variety of topics ranging from traffic advisories, to awareness creation to bragging about their achievements. Similarly, the growing technology savvy population of India is using these mediums to share their grievances, concerns, etc. with the police. With a handful of police officers serving 1.25 billion people, it is no surprise that a lot of posts/tweets by the citizens go unnoticed by the police. Even features like tagging police commissioners and police accounts do not always yield the expected response, causing a sense of resentment. The police too find themselves helpless given the multitude of things.
With our continued interest in empowering police organizations with technology which can help them in their day-to-day activities, we have been working in the space of online social media and policing for some time now. For our research publications in this space, please visit here. For effective communication between the citizens and police, it is necessary for the police to understand the vast amount of content generated on their social media accounts. In this direction, we started thinking about how to break up the content into important versus unimportant, urgent versus non-urgent, etc. Our main aim in this research was to help police identify ‘serviceable’ content which can be served quickly and efficiently. Requests to which police should respond, evaluate or take action are considered as serviceable requests.
We analyzed 85 official Facebook pages of police organizations in India and studied the nature of posts that citizens share on police Facebook pages. Not all posts require the same amount of attention from the police, there are some cases where immediate action needs to be taken while some can wait. Based on this analysis, we came up with six textual attributes that can identify serviceable posts; posts that need some kind of police response. We find such posts are marked by high negative emotions, more factual, and objective content such as location and time of incidences.
We identify four types of response that citizens may get on their posts:
(a) Forward: Posts which had enough information and could be forwarded to appropriate authorities for action. For instance, a resident posted, Date : 4/11/2015 (Wednesday), Time : 10:17 pm, Number : [withheld], Location : [withheld], Violations : Crossing line by way too much obstructing the vehicles which were coming from [withheld] entrance later he jumped the signal ……..
(b) Give Solution: Posts mostly included queries by residents to police that could be answered without any detail; resident asks, Admin !! Can U Explain to Me How Two Challans On Same Date Same Time in Just 5 Minutes Gap !! How Its Possible ?? Any Thing Wrong ??
(c) Acknowledge with thanks: Posts to which the police wrote “thanks for sharing the information” or “thanks for the appreciation.” For instance, resident remarks, Chennai City Traffic Police a humble salute from a fellow Chennaiite for the commendable job in such rains!!
(d) Need more details: In these resident’s posts, police inquired more details so that action could be taken, e.g., a resident asks, Cops driving wrong side [of road] near XXX hotel .. what action will be taken against them ? This post lacks information such as time and date when the incident happened.
To enhance response to serviceable posts, we propose a request – response identification framework. The approach followed in the paper is shown below:
Understanding Requests from Citizens:
Residents often use different language styles in posts while expressing their concerns and asking queries to police. Our approach includes following six category of features to characterize serviceable posts:Emotional Attributes,Cognitive and Interpersonal Attributes, Linguistic Attributes, Question Asking Attributes, Entity-Based Attributes, and Topical Attributes. These include the both handcrafted features and LDA / NMF based features that help automatically discover the latent dimensions and induce semantic features in our data.
Our analysis shows some intriguing results:
Serviceable requests show significantly higher value of negative emotional states i.e. “anger” (+15.38%), “disgust” (+47.8%), “fear” (+60%), and “sadness” (+10%) in comparison to non-serviceable requests. Most frequent topic is includes queries / question posed to police (Complaints represents complaints against cops in- correct decisions).
Comparing serviceable sub-types, we observe that 93.10% posts in Thanks sub-type did not receive a response from police. Posts in Forward sub-type received the maximum number of responses from police (63.6%, 182 posts). Table 1 below summarizes the number of posts that did not receive police responses.
Table 1: Number of posts that received responses (N of Events) and censored event showing posts that did not get response from the police.
Automated Classifier for Serviceability:
Our work explores a series of statistical models to predict serviceable posts and its different types. The model makes use of the content based measures – emotions, cognitive attributes, linguistic, question posed, entity and topical attributes. We explore five different classification algorithms – Random Forest (RF), Logistic Regression (LR), Decision Trees (DT), Adaptive Boosted Decision Trees (ADT), and Gradient Boosting Classifier (GBC) using balanced class weights. Table 2 below reports the performance of different algorithms to correctly identify serviceable posts.
Table 2: Mean Performance after 10-fold CV of different algorithms to correctly identify serviceable posts.
Through our work, we believe technological interventions can help increase the interactions between police and citizens and thereby increase the trust people have on police. The police too may have a more directed and cost-labour efficient mechanism in dealing with any law and order situation reported on their Facebook page. This will increase the overall well-being and safety of society.
Full citation & link to the paper: Sachdeva, N., and Kumaraguru, P. Call for Service: Characterizing and Modeling Police Response to Serviceable Requests on Facebook. Accepted at the ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW), 2017. PDF
College students’ mental health concerns are a persistent issue; psychological distress in the form of depression, anxiety and other mental health challenges among college students is a growing health concern. However, very few university students actually seek help related to mental illness. This arises due to various barriers like limited knowledge about available psychiatric services and social stigma. Further, there is dearth of accurate, continuous and multi-campus data on mental well-being which presents significant challenges to intervention and mitigation strategies in college campuses.
Recent advances in HCI and social computing show that content shared on social media can enable accurate inference, tracking and understanding of the mental health concerns of users. There has also been work showing that college students appropriate social media for self-disclosure, support seeking and social connectedness. These facts, coupled with the pervasiveness of social media among college students, motivated us to examine the potential of social media as a “measure” for quantifying the mental well-being in a college population. Specifically, we focused on the following research goals:
Building and validating a machine learning model to identify mental health expressions of students in online communities
Analysing the lingusitic and temporal characteristics of the identified mental health content
Developing an index for the collective mental well-being in a campus, and examining it’s relationship with university attributes like academic prestige, enrollment size and student body demographics
We obtained a list of 150 ranked major universities in the US by crawling the US News website. We also obtained university metadata like gender distribution, tuition/fee during this crawl. Next, we crawled the Wikipedia pages for these 150 universities for extracting the student enrollment, type of university (public/private) and the setting (city/urban/suburban/rural) at each institute. Lastly, we obtained information on the racial diversity at each university from an article on Priceonomics. We study these universities in our work and use the metadata in our analysis.
For social media data, we focus on Reddit. Reddit is known to be a widely used online forum and social media sites among the college student demographic. It’s forum structure allows creation of public online communities (known as “subreddits”), including many dedicated to specific college campuses. This allowed us to collect a large sample of posts shared by university students in one place. Although Facebook is likely more popular/widespread among students, it is challenging to use Facebook in such studies since the content shared is largely private, making it challenging to obtain such large data from it. Further, the semi-anonymous nature of Reddit enables candid self-disclosure around stigmatized topics like mental health.
After a manual search for subreddits for each university, we were able to identify public subreddit pages for 146 of the 150 universities. Next, we focused on correcting the “under-adoption” bias in subreddits. Subreddits which had a small fraction of Reddit users (as compared to university enrollment) were filtered out due to being under-representated. This left us with 109 universities with adequate Reddit representation. We leveraged the data on Google BigQuery (combined with some additional data collection) to get all posts ranging from June 2011 to February 2016. The final dataset used for our analysis included 446,897 posts from 152,834 unique users.
Since Reddit data does not contain any gold standard information on whether a post in a university subreddit is a mental health expression, our first goal was to use an inductive transfer learning approach to build a model to identify such content in a university subreddit. First, we include (as ground truth data) Reddit posts made on various mental health support communities. Prior work has established that, in these communities, individuals self-disclose a variety of mental health challenges explicitly. We use these posts as the “positive” posts and, parallelly, we utilize another set of Reddit posts, made on generic subreddits unrelated to mental health, as “negative” posts. We obtain 21,734 posts for each category, which we use as the positive and negative class for building a classifier. We observed a validation accuracy of 93% and an accuracy of 97% on a test set of 500 unseen, expert-annotated posts from our university subreddit data. We then proceeded to use this classifier for labelling the 446,397 other posts across the 109 university subreddits. Our classifier identified 13,914 posts (3.1%) to be mental health expressions, whereas the rest of the 432,483 posts were marked not about the topic. This corresponded to 9010 unique users out of a total of 152,834.
Next, we looked at the linguistic characteristics of the posts identified to be mental health expressions by conducting a qualitative examination of the top n-grams uniquely occuring in these posts. We found that students appropriate the Reddit communities to converse on a number of college, academic, relationship, and personal life challenges that relate to their mental well-being (“go into debt”, “doing poorly in”, “only one homework”, “up late”, “the jobs i”). The n-grams also indicated that certain posts contained explicit mentions of mental health challenges (“psychiatric”, “depression”, “killing myself”, “suicidal thoughts”), as well as the difficulties students face in their lives due to these experiences (“life isnt”, “issues with depression”, “was doing great”, “ruin”, “cheated”). Some of the top n-grams were also used in the context of seeking support (“need help”, “i really need”, “could help me”).
For the temporal analysis of mental health content, we first study the proportion of posts with mental health expression across the years. The figure below shows the content per year (along with a least squares line fit). We observed that the proportion of posts with mental health expressions has been on the rise — there is a 16% increase in 2015, compared to that in 2011.
We then looked at how this trend varies over the course of an academic year. The plots below show the trend separately for universities following the semester system and the quarter system. Between August and April, for the universities in the semester system, we observed an 18.5% increase in mental health expression; this percentage was much higher: 78% for those in the quarter system, when compared between September and May. On the other hand, we observed a reverse trend in mental health content during summer months, for both semester and quarter system universities.
Lastly, as a part of our third research goal, we formulated an index we refer to as the Mental Well-Being Index (MWI), as a measure of the collective mental well-being in a university subreddit, based on the posts labelled as mental health related by the classifier. We then computed the MWI metric for all 109 subreddits and examined it’s relationship with the university attributes.
By visualising these relationships (as above), we gleaned several interesting observations. We found:
Universities with larger student bodies (enrollment) as well as greater proportion of undergraduates in their student bodies tend to be associated with lower MWI
MWI of the 66 public universities we consider, is lower, relative to that in the 43 private universities, by 332%
MWI is lower in the 7 rural and 33 suburban universities by 40-266% compared to others, while it is the highest in the 31 universities categorized to be in cities (by 29-77%)
Universities with higher academic prestige (or low absolute value rank) and higher tuition tend to be associated with higher MWI
MWI tends to be lower in universities with more females (or sex ratio, male to female <= 1) by 850%
Further, although our data shows a marginally lower MWI in universities with greater racial diversity, we did not find statistical significance to support this claim.
Our work here (the complete paper accepted at CHI 2017) further details our analysis in depth. Below is an infographic for our work.