bitly could do better!

Recently, we got to get our hands dirty on some URL data from bitly which comprised of suspected URLs that have been clicked by Internet users in October. We thank bitly and particularly Brian David Eoff (senior data scientist) and Mark Josephson (CEO) for sharing this data with us. We analysed about 269,973 URLs marked “suspicious” by bitly to understand how these links are posted and clicked. Figure 1 shows a graph of some of the most common domains for which multiple suspicious URLs were shortened using bitly. Domains like , , and had more that 1,000 URLs each which were marked suspicious by bitly.

bitly uses real-time spam detection services like Google safe-browsing and SURBL. However, there doesn’t seem to be a lot of measures to nail spam bitly users. There exist a lot of registered bitly users who shorten spam links regularly. From the 269,973 suspicious URLs, we extracted 4,469 registered bitly users who have posted one or more of these links. After some data crunching, we found that 4,457 bitly users have posted at least 113 suspicious bitly URLs or more (Figure 2). If we analyse the past history of shortened URLs of these users, then we may find more spam links in their profiles. We plan to do this in future. These users are allowed to stay on bitly though they regularly post spam links which are also heavily clicked by other Internet users through various media like emails, blogs and online social networks.

We look closely at top 20 users who have posted the maximum number of suspicious URLs in our dataset and observed that the highest number of suspicious links posted by a single user is as large as 500 URLs (Figure 3). Shortened URLs constitute a large fraction of spam on Internet. Sixty five percent of URLs targeting social media users are shortened URLs [1].

We believe that if bitly suspends the registered bitly users spreading spam constantly or publicly marks them as malicious, this would discourage the use of bitly as a spamming service and deter malicious URLs being shortened and spread on Internet. We hope to see more features from bitly in future which would help to curb out spam and malicious links to greater extent.

We are investigating this data in more detail to develop more insights. One of the student is pursuing her Masters thesis work on this topic. If you are interested in knowing more or want to give suggestions, please write to

Did you tweet your BBM PIN? We know about it

BlackBerry officially released BBM for Android and iOS on 22nd Oct, 2013. And within 24 hours, the net download of the BBM app had hit 10m. People are crazily sharing there BBM PINs all over the Internet, most popular places being Facebook, Twitter and blogs. Several hashtags related to BBM have been trending on Twitter at various locations and this is what caught our eye – people sharing their PINs recklessly, seeking for friends, bragging about getting a PIN finally before the others or after a long wait! Do you think there is no harm in posting your PIN publicly? Then go through several posts on BlackBerry forums about users being spammed through PIN to PIN messages. Publicly posting the PIN is a definite invitation to spam.

When such a huge amount of BBM PIN sharing caught our attention, Mayank and I (with some inputs from PK) quickly used some of our old scripts and put together this website which displays the most recent tweets where users have shared their BBM PINS in plain text

There might be few false positives or a couple of encoding errors here and there, but we have tried to keep the errors as low as possible. Have a look at the volume of tweets with BBM PINs. In a span of less than a day, we collected about 79K tweets talking about BBM and about 24K (and constantly increasing) tweets with BBM PINs in plain text. Some users are even sharing their PINs via Instagram images and screenshots of their phone screen.

If you are interested in knowing more, please write to pk [at] iiitd [dot] ac [dot] in.

A Day at Google, Banglore, winning the Anita Borg Memorial Scholarship 2012

Amongst 30 unread emails in my inbox, there was one about ‘Google Anita Borg Memorial Scholarship 2012’ application which I quickly opened and closed without really reading the content. I had heard about the scholarship before and knew that it was awarded to very few students; mostly from top-notch institutes like the IITs. There was no chance of me getting through, rather read the other emails I thought! But after a lot of persuasion from my father, I grumpily filled out the application form. The application asked for my academic details, achievements (I have a very few!) and a couple of essays, and it took me 8 long hours to complete it. Just a week after, the new semester started and I forgot all about the scholarship.

Around March end, I received a phone call from Google Office, and-voila! I had been declared a finalist for the scholarship and invited to their Banglore Office for a one-day conclave. And now, I was excited! Eager to see what happens at the conclave and hoping to get a lot of Google-goodies!

I arrived at Google Office on 20th April, and soon realized that it was going to be much more awesome than I had expected! The day kickstarted with a quick delicious breakfast followed by a talk by Yolanda Mangolini – the chief of Google Diversity Program. She talked about various opportunities and programs by Google which bring together people from across the globe and also help them to hold their identity in their own way. She briefly talked about the facilities and the work culture at Google, some of which was really amazing like the flexible work hours and the easy transition from one project to another. The next talk – ‘Making Magic at Google’ spawned over some of the most successful projects at Google and the journey of people involved in those projects. The next two talks were fairly technical – ‘Map Maker’ by Rachna Agarwal and ‘Android 4.0’ by Rajdeep Dua. Map Maker is one of the most brilliant and successful crowd-sourcing projects in which the native people of a place built the ‘geographical map’ of their region/country because there were no high quality satellite images available for those areas. This is how you actually see accurate Google maps for many places in India and countries like Pakistan. Rajdeep gave us some insightful tips on how to design a user friendly UI for mobile applications and some quick tips to build an Android application.

All the speakers encouraged us to ask questions and made their talks engaging in their own way, but what struck me the most about these Google researchers and engineers was the passion for their projects. The talks were followed by lunch at Google’s food court. There we got to talk with a lot of Google employees; we were free to roam around their offices and talk to just anybody! I had interesting chit-chat with a few people who told me about their projects, how they got into Google and what they plan next. Overall, it was quite a satisfying day till yet; little did we know that there was more to come!

Just after lunch was the ‘Icebreaker Session’. The 16 finalists were divided into four teams and each team was given a stick, cord and a ball and were asked to design and build an automated catapult in next one hour! We had to solve some puzzles like sudoku, crosswords and Soma cube to buy items to build the catapult. The team who could throw the ball farthest was to be declared the winner. Phew! The one-hour activity really showed us how important team-work is. Finally, our team won! After this, the last social activity of the day was a career panel discussion. On the panel were researchers and engineers we had interacted previously and HR – Keerthana Mohan. They gave us tips on how to achieve ones dream job/career. The discussion was less about Google and the speakers shared their experiences about PhD, work and pursuing research at industry.

The eventful day ended with the award distribution ceremony where they declared the winners. I was least expecting to get the award, and was taken aback with surprise when my name was announced. All of the participants at the conclave received a lot of Google goodies – much more than what I had expected!

The whole experience at Google was exhilarating. Interaction with other students at various institutes, talks by Google researchers, chit-chat sessions with other Google employees taught me a lot and was a very effective channel to know about various work being done at other places. I would highly encourage others to apply for this scholarship and try to attend the conclave.

Below is the pic of all the participants at the conclave (finalists).


PhishAri : Real-Time Phishing Detection on Twitter

We, at PreCog, not only do research but also try to build products based on our work for end-users. More often than not, developing scalable, real systems can be a challenging task; much more than just developing the underlying algorithm. It feels good to be part of a research group which has given me perspective to understand the need to create a bridge between research and real-world solutions. Here goes my first PreCog blog entry on one such product we (where I’m the lead) are developing, which aims to detect phishing on Twitter.

There has been a lot of research and publications on spam detection on online social media, but there do not exist many real-world products which use these intelligent solutions. When we started with detection of phishing on Twitter, we decided to build a real-time system for Internet users based on our research which we named – PhishAri. Before we move on to how we built PhishAri, any guesses on what the name means? Well, its a combination of two words – Phish + Ari. “Phish” stands for “phishing” in short and “Ari” means “enemy” in Sanskrit; PhishAri combats phishing by detecting phishing URLs spread through Twitter.

From our previous studies and some prior work in this area, we identified various features which we could use for phishing detection on Twitter. Some of these features include attributes of the URL, properties of the tweet and Twitter user who posts the tweet. We thought that the best way to reach out to most Internet users would be by using a browser extension. So, now after someone installs PhishAri browser extension, whenever he logs on to Twitter, he sees a small color-coded indicator in front of any URL in the tweets in his timeline or Twitter search results; green indicates that the URL is safe and red indicates a phishing URL. Since this solution is seamlessly built into the browser, it is hassle free and requires no other additional software or packages to be installed other than the browser you use and the PhishAri extension. Currently, PhishAri extension is available only for Chrome browser, but we’ll soon launch it for FireFox and other browsers too.

Now, let’s dive into the nitty-gritty of PhishAri. The browser extension (written in JavaScript) is the front-end of the entire system which does very little processing and only shows the appropriate indicator beside every URL. Now comes the meat of the solution : a web-application hosted on a separate server which the extension uses to make decisions on which indicator to show in front of each URL. The web-application is written in python using framework hosted on an Apache server. The extension takes the URL from tweet & the tweet id and sends it to the web-application as a GET request. The web-application takes this URL & the tweet id and creates the feature-vector based on the attributes of the URL and the tweet which are used for phishing detection. The web-application then uses machine learning classification to classify the URL as phish / legitimate. The extension again makes a GET request to the web-application to receive a JSON object which is a string, indicating class of the URL; accordingly, extension shows a red indicator if the class is ‘phishing‘ and a green indicator if it is ‘legitimate‘.

Currently, PhishAri works with an accuracy of 87.2%, we are still in process of making it stronger and more effective. The extension is easily downloadable from Chrome Web Store. We are trying to add more features and strengthen the underlying classifier to make PhishAri more efficient. Any feedback is warmly welcomed. If you use Twitter, do give it a try!