The following is a transcription of the interview of Hilary Mason, chief scientist of Bit.ly, a leading service for shortening lengths of links, for inclusion in tweets, postings, and e-mails, on Siliconangle.tv from the 2011 Strata Conference. The transcription was prepared in part by the Siliconangle.com staff.
John Furier (JF): Next up Chief Scientist from bit.ly, Big Data company.
Dave Vellante (DV): So, everybody knows bit.ly right? You also want big chunk URLs and makes some tiny. Hi Hilary, how are you? Dave Vellante of Wikibon.
Hilary Mason (HM): I’m well. Thank you.
JF: John Furier with SiliconAngle, nice to meet you!
JF: We love bit.ly. We know about your product.
DV: I use it all the time.
HM: I’m glad to hear it.
DV: Sometimes, I feel I can’t live without it.
JF: One of our contributors used to work there back in the day. Rex Dixon.
HM: Oh Yeah. Rex is amazing. He handles all our support email. Email support at bit.ly, Rex will reply.
JF: And we’ve been following you guys for a long time; I’ve been following the URL starting business for quite some time. You guys just have phenomenal growth. So, bit.ly for the folks out there who don’t know has a service that shortens the URLs down from these big long URLs to short ones so you can put them into Twitter or Facebook and essentially create the redirection to the actual Webpage- which is essentially an obstruction layer between the DNS URL and the click. With that, you get massive amounts of cool data.
JF: So, you know the kind of contents flowing around the Web. Your adoption rate has been pretty high. Talk about the challenges that you’re seeing with bit.ly right now and with data. How big is it? What are some of the stats?
HM: Sure. So, we use 301 redirects which are part of the HTTP standard. That means when you click on a bit.ly link it goes through this permanent redirect and takes you to the long URL. So, we see those events where people share URLs through bit.ly and a lot of those come through our API about 80%- which means that when you are using Twitter client like Tweetdeck you are still using bit.ly behind the scenes, and then we see all of those click on bit.ly links so we’re able to see how people share content, how people consume content and then how they re-share that content. And, we look at the content itself as well. So, we actually pull all that content and do some analysis of our own. So, some stats: We're doing a hundreds of millions of clicks a day on probably again hundreds of millions of new URLs a day – haven’t checked today. And we’re able to see the kinds of content people look at. Right now, obviously, the content around the conflict in Egypt is huge. And, if you saw that graph we released last night, you could see that we saw traffic in Egypt go almost to zero in the last few days, only the spring up on this amazing curve just yesterday. It seems completely amazing to me that we can see those kinds of world events and phenomena reflected in how people were clicking on links.
DV: And what you’re saying the keynote of interest from the outside world went through the roof when the internet shut down.
HM: Absolutely. It’s that one event. It caused people around the world just to be clicking on all of these news stories about Egypt.
JF: Your growth at bit.ly has been really driven by the whole megatrend of Twitter, Facebook, because of the social side of it. And also with cloud computing, you guys can deploy your server pretty quickly when you guys were start-up and acid mobility; Screen real estate is a premium. So, having shortened URLs is the key. So, we were talking yesterday and today we coined the phrase “data is the heartbeat of cloud, social and mobile.” You guys were a real living example of that. What is the direction for bit.ly cause the market is really in your favor? The business of bit.ly, you guys do all the service. How do you guys operate your business through distribution piece of people? How does bit.ly’s business work?
Bit.ly Products and Features
HM: The business side of bit.ly, first a small disclaimer: that’s not my area of specialty, I focus mainly on the data and the math. But, the business side of bit.ly, our current product is bit.ly Pro, which is a white-labeled solution for people who want to understand how their brands distribute socially on the Internet, and that gives people their own short URL that’s powered through bit.ly, and they get to see analytics about how their content that they share spreads, but also about how other people on the Internet are sharing their content; without them in the middle of it all. And so, people can really learn some interesting stuff from that. But, we're also building some new product on our data set, and I really do believe that it is our opportunity to take the data we see from other people and return it to them in a way that will really help them explore, discover, and learn things faster. So, one product that we’re really seeing soon is called news.miu which is an iPad app for reading the news, and we’re hoping to build more of those types of things in the near future.
JF: Cause you get all the data, you can see trending information at any level not just the most popular. You could see down to the micro level and then sections like a newspaper. Are you guys do that kind of thing with the… like a newspaper app? Is that what you’re saying?
HM: It’s not exactly the newspaper app. But, you’ll have to wait and see what exactly it will be.
The Dark Side of Links
JF: On the data side, you guys have massive amounts of data. What things… First of all we’re big fans of what you guys are doing with massive amounts of data. The good thing is that you guys can explore all these gestural data and real data clicks. So, the user experience. Talk about the user experience that you guys see enabling and then we’re gonna talk about the bad side which is you know, spam. Twitter has a lot of spam as you know. How do you detect the bad guys, fishing, spam- cause a lot of that is going on, and the communities are trying to police that. Talk about the user experience, the user enabling and with the data and talk about the dark side of the data.
HM: So, we see, I love the bit.ly user experience. I think our product design and frontline team is amazing. And, we’ve created a site where you can really easily share the content you wanna share and you can push it to other networks as well. Only recently, we released bit.ly bundles, where you can take several pieces of content, have them at one short link and you can share them together, which is pretty cool. We’re focusing on providing that kind of easy, frictionless user experience in order to help people share the data they wanna share and get the data back. One thing I should mention is that you can take any bit.ly link whether it’s yours or someone else’s, add a plus sign to the end of it and you can see all of the global statistics for that link. That’s completely public. See how many people click on it, where in the world they came from, websites they were referred from.
JF: So, after the bit.ly link, you put space plus…?
HM: No space; just a plus right after that.
JF: So, no space?
HM: No, bit.ly slash some letters and numbers – that’s the bit.ly hash – and the plus sign at the end, then hit enter. Very simple.
JF: So, you guys have a rich data set and the data world is all about how big the data has to get for some core process to work off a data set. Some people want small data and might not see the big picture. But, as you guys amass these massive amounts of user data, and as a transactional data, distribution data, you probably see the patterns emerging in the dark side. Can you comment about what’s going on there? That's a big problem on Twitter. we all know spam’s out there, these fishing acts. You guys have a good spot to go after that or look at that. Can you share?
HM: Yes. That’s a project that we work on. We spend a lot of resources on highlighting, finding the spam and malware in the bit.ly data, and preventing people from clicking through. You might have had the experience where you click through a bit.ly link. So, you get this page with a big stop sign on it, and I think a little angry puffer fish in it, that says you should be really careful about clicking through this. But, we do show you where that URL goes, so you make your own… We trust humans more than any other automated systems. You can make any judgment as to whether you wanna click through that. But, the way that system works is that whenever a link comes into our system, we pull the content of that link, we analyze the traffic patterns around that link, and we make sure it doesn’t resemble anything that we know to be malicious content. Now, of course there are things that are really on the line, especially shared socially. Somebody might just be a little bit too excited about a marketing campaign. So, we do have a human in the mix.
JF: Twitter has done that actually where too many follows from actually a human and they shut down your account.
HM: And we do work closely with social networks that rely on bit.ly to make sure that nothing damaging is…
JF: So, this is a top priority. You guys are all over it. And you guys have to harness the data that kind of pulled that out. HM: So, our spam detection system has two main parts: the first part is that we partner with a lot of security companies, we use Google safe browse list. Anything they know to be spam or fishing, will be blocked automatically. The second part is a home grown technology. We found that at bit.ly, we see the data a little bit before anybody else; about 6 hours. And a lot of clicks can happen on that initial 6-hour window. And so, we take the things we understand to be malicious and use them to train a statistical classifier that makes a judgment to say we believe that there is an 85% chance that this new thing might be spam and if the threshold is high enough, we’ll just block it automatically.
Moving toward Real-Time Analysis
DV: You were talking about in your keynote the state of the data union is very strong. and things are good. Your piece, self-proclaimed nerd – you are not nerdy by the way. Basically, your math friends are doing well, right? The wind is at their back. There’s cheap infrastructure, so start-ups can do their thing. So, it’s all good. But then you talked about, it’s not all good, we got some challenges here. One of them is real-time. I wanted to talk about that a little bit. Help us understand is that a math problem, is this a physics problem? How do we solve that? What do you mean by sort of real-time? How do we get there?
HM: That’s a great question. It is both a math problem, an infrastructure problem and a problem of the applications that we want to address. So, it is a product problem. In data analysis, historically, our conceit has been that you have all your data in a nice little package, and you can look at it as many times as you want. You can iterate through it, you can try different schemes in your algorithms and see which ones come out best. That could take hours to run or days, or sometimes longer. But when you work in a real-time environment, you have to be able to make the same high quality calculations immediately. That is, with milliseconds of latency. That means, we have to make some compromises; both mathematically and the kind of infrastructure that we use.
JF: What does that do for you guys on the security side? You mentioned spam and real-time has been a very big challenge. So how does that relate to the user experience. Can you give some examples around that latency, specifically?
HM: So we like to...instead of using the term “real time” we often use the phrase “relatively recent time” because real time means different things to different applications. If you're hedge fund trading in microseconds that is a little different than if you're just shortening a link for Twitter. So our goal is to prevent things like spam from getting through the system in about 30 seconds. And we do this by having our infrastructure set up so every time a new content item comes in we put it on a queue, and that queue is processed as quickly as possible. And because we use a lot of cloud machines we are able to spin out new machines quickly if we get flooded.
JF: Do you guys have relationships with some of these real-time search companies? Because real-time search about a year-and-a-half ago was just about the hottest thing. Topsy was one, Colectaco was one. They did some good work, but it never materialized. People really aren't searching in real time. Who wants to stare at the screen and watch things going. Where you guys seem to have a better angle on the discovery side. As you get data you have more knowledge around semantic analysis between a request kind of search query, if you will, to discovery and navigation. Are you guys looking at that area at all?
HM: I think we have to change the way we think of the word “search”. So we have this idea search is you go to a Web site and it's got a box on it, you type a query in and you get back a list of results. And this is an old metaphor for search. So when we think about real-time search, we're trying to think about helping you discover the information you will want to know as soon as possible. And that might not take the form of something where you just type in a query and get back results. If you're a logged-in user and we see the types of content that you like to click on and like to share, we might be able to alert you. But we are working on it, we have some infrastructure behind it, and we're able to use it to show things like: I had a slide in my talk yesterday showing the images coming out of Egypt in real time. But we haven't figured out what the product manifestation of that will be.
JF: The search phenomenon, coming out of Google if you will, is outdated – I'll say that, Google's outdated. It probably has some value if you want to get some things here or there, but the notion is to save people time. People use search because there's a lot of stuff to figure out, & they want something, & they want to get it fast. And/or they're discovering and browsing. In the social Web there's a lot of different ways to get that. There's a sea of information. How are you guys saving time for users. Have you thought about that piece of it? Obviously with Bit.ly you shorten the link, you get something faster, but in the aggregate, I want to look at Egypt, there's so much to look at. How do I know what's relevant?
HM: I think the real opportunity we have is to take the massive amount of data that's coming through your streams already & to help you filter that. & I think that's one of the biggest open problems in the tech industry right now, is not how do we get more data into the stream but how do we take what's there & help you find the most important things when it's important. & if there isn't anything important help you to not waste your time just reading things....
DV: You're essentially giving users incentives to participate & allow you to collect data about them & in return give them services & capabilities they can't get anywhere else. & that's sort of a real flip on the way we think about access to data & your personal information, isn't it?
What is a Data Scientist?
JF: My final question is about you personally. You're in the data business. The term “data scientist” is being kicked around, & a lot of people are really interested in math & science & are looking at career changes, whether they are in their 40s or they're coming out of MIT, Stanford, or whatever institution, or high school for that matter. What do you see as the profile of data scientists? Is it pure comp sci, is it a little bit of cognitive, is it physics, is it social science? It seems to be kind of a mashup.
HM: If you're thinking about it, & you're exciting about it, & you can understand the math, the logic, then yes, you can do it. I see data science as a combination of maths, computer science so you can code things that actually function, statistics, and finally just hacking. And I think that last one is by far the most important. If you're the kind of person who can say, “I have some cool data. I really am curious about some questions about that data. I'm going to figure this out.” Then yes, you can do it.
DV: So my last question is also of a personal nature. I want to know what these species are that you discovered. Tell us more about that.
HM: That was the first scientific adventure I ever had. In high school I was privileged to participate in a research expedition to Costa Rica. & we discovered high up in the canopy in the rain forest a kind of nematode & two bacteria that had never been identified before, living in these plants in the top of the rain forest.
DV: So given your statistical background, what are the odds of that?
HM: I believe the odds are very high. I think the rain forest is very full of things we just don't yet know about.
JF: You guys live in a start-up. What is your take on the start-up community. Bit.ly's out of the East Coast? Why are companies handing out money to people? There's a lot of money flowing around, a lot of creativity. What white spaces do you see out there that might be an opportunity for a young entrepreneur to develop around data?
HM: I think there are amazing opportunities right now, & as you mentioned the start-up community in New York has become powerful & very strong & very well connected in the last few years. A lot of those opportunities are in taking the systems that already exist – & we're doing a very good job now of solving the problems we had solved 20 years ago & 10 years ago quickly and efficiently, and as I said yesterday you can now do it for $100 at home in your underwear on your computer – but we still need to figure out what the new capacities are that we have to solve problems that we haven't been able to address. & I think there are huge opportunities around data management, data cleaning, helping people make better decisions from their personal data, sort of quantifying things about your life & understanding it in a very easy, frictionless way. & I hope we see a lot more of that, especially in New York, in the near future.
DV: So it's a good time to be a math geek and an even better time to be a hacker.