Big Data in Small Bytes: The Best Big Data Takeaways from Day One Keynotes at Strata 2012

Strata Conference 2012: Making Data Work. Three days of in-depth discussions, tutorials, and sessions on building a data-driven business.

How much data is big data? Consider This: #TheCube generates as much as a quarter of a terabyte every day, and that does not include viewership multiples (thousands of viewers per day).

Wednesday’s keynotes brought together thought leaders from across the information technology space. The potential explosion in collective knowledge they bring to the table feels like it could blow the roof off of the Santa Convention Center. With each keynote scheduled to last just ten minutes, the pace was brisk and energizing, much like the Big Data storm ahead of us.

Here are some of the key sound bytes and points from presenters for our audience looking to harness big data and provide the latest innovations in the information technology space.

Edd Dumbill (O’Reilly Media, Inc.) and Alistair Croll (Bitcurrent)

  • You know a given technology is at an early stage when the technological innovation is represented by cartoon character names like Hadoop and Mongo. However, in a very short time, Hadoop has become an industry standard.
  • Data science is a collaborative discipline incorporating math, science and visualization – among other parts.
  • The last mile between the computer and our brains (how business owners will need to make decisions) is data visualization

The Apache Hadoop Ecosystem – Doug Cutting (Cloudera)

  • Fitting intro from Edd Dumbill: “Anything you are working on, [Doug Cutting] has probably already thought of.”
  • There is exponential growith in CPU performance and storage, but traditional software systrems have not scaled as well. It is time for a new approach.
  • The old school of thought was proprietary hardware (exotic, central servers, RAID, SAN), with an emphasis on reliability, but much more expensive and unable to scale well.
  • Big data technologists accept unreliability in hardware and build reliability into software, which makes it more cost effective and scalable.
  • In the software space, what was monolithic is now distributed (storage and computer nodes). Because storage is so inexpensive, data can now be saved dynamically and we are moving from proprierary solutions to open source software.
  • While Hadoop has become the kernel – essentially the distributed operating system for Big Data – no one uses Hadoop alone.
  • There are a collection of projects at the Apache foundation in support, and the components (and Hadoop) are meant to all be interchangeable.
  • WHY Open Source at Apache? No strategic agenda, community based, allows competing projects, loose federation of project, insures against vendor lock-in.

Do We Have The Tools We Need To Navigate The New World Of Data? – Dave Campbell (Microsoft)

  • Pushed Microsoft to embrace Hadoop and lived to tell the tale.
  • It is important to look at how data can be used as a platform.
  • Refining Data: Signal –> Data –> Information –> Knowledge –> Insights and Action
  • The goal in navigating big data should be to reduce time to insight.
  • We need to be better at taking friction out of the process and combine data better (example, why ask for city, state, AND zip).
  • It is not just content but the services and models once data can be refined and combined.

Decoding the Great American ZIP Myth – Abhishek Mehta (Tresata)

  • Big Data: This thing is really big.
  • We are a data rich but information poor society because technology does not (yet) allow us to solve problems through our data as best as we could.
  • Remember that even though we all can have access, individually we still want different things. We are all equally different.
  • There is a new economic framework across industries that can be developed and there are enough people working their hardest to do so. We need to be able to recognize that with common platforms to store, process, and analyze.
  • We need to use data to solve problems at an individual level but for broad circumstances.

Learning Analytics: What Could You Do With Five Orders of Magnitude More Data About Learning? – Steve Schoettler (Junyo)

  • Educational levels are not improving because teachers are not able to quickly leverage and respond to feedback to make faster improvements.
  • We have the opportunity to change the learning experience at the individual level because the facets of human learning are different (cognitive abilities, multiple intelligences, knowledge, personality, background).
  • We are already using this type of big data analysis in social gaming – why not education?

A Big Data Imperative: Driving Big Action – Avinash Kaushik (Market Motive)

  • A Favorite Quote: “Information is powerful – but it is how we use it that will define us.”
  • We create Data Princelings – we don’t manage the execution of big data effectively on all levels (analysis, presentation, actionable results #inefficientandsucky)
  • Three understandings: known knowns, known unknowns, unknown unknowns. We need to find the unknown unknowns to take better action and more effectively address known unknowns.

Guns, Drugs and Oil: Attacking Big Problems with Big Data – Mike Olson (Cloudera)

  • Interesting and significant global problems are being solved through the collection, analysis, and presentation of big data.
  • Genome analysis is being better developed on the foundation of Hadoop and Apache
  • Predictive Policing with machine learning and social networking can be applied to drug trafficking and terrorism.
  • In oil exploration, we can combine seismic data acquisition in the ocean to form more meaningful visualizations of subsurface structures and reservoir maps.
  • Remember that truth and badness can be found with big data analysis.

90% of Your Big Data Problem is Not Big Data – Flavio Villanustre (HPCC)

  • The Big Data Value Chain: Collection –> Ingestion –> Discovery/Cleansing –> Integration –> Analysis –> Delivery
  • A fully parallel set of Machine Learning algorithms on Big Data gives you full insight.
  • Outliers matter, especially when those outliers are the exact reason for the discovery effort.
  • Dimensionality reduction can conduce information loss: why risk losing valuable information?

The Information Architecture of Medicine is Broken – Ben Goldacre (Bad Science)

  • We did not synthesize pharmaceutical trial data effectively enough to create good medicine.

#TheCube Coverage of Strata Conference 2012

We’re running complete coverage of Strata Conference 2012 live on Follow #TheCube on Twitter and our Twitter profiles @SiliconANGLE and @Wikibon for ongoing conference updates, analysis, and perspectives from Strata 2012. See all of the articles and videos from SiliconANGLE and Wikibon’s coverage here.


, , ,