ACM Data Mining Camp Silicon Valley, report

| 2 Comments

Since I moved to San Francisco, I am writing reports on knowledge gathered at different events for internal Zemanta audience. But then I realized, why not just post it to our tech blog?

So here’s the report from today’s ACM Data Mining Camp Silicon Valley. This is not exactly live blogging, but it is neither the deep thinking, so do further research and form opinions by yourself.

Dr PJ Patil Chief scientist from Linkedin:

  • lots of LinkedIn runs on data mining, all from recommendations of connections to ranking of groups, etc. It’s really deep and strategical
  • you don’t need to have a Phd to work at Linkedin datamining team, they need many different types of talents and skills (Google’s Dr. Rajan Patel: “I agree absolutely”)
  • LinkedIn is open sourcing lots of their tech in project Voldermort, but more is coming, including the reporting layer (if I understood correctly)
  • “data mining is moving from backend technology to frontend, becoming the product by itself”
  • LinkedIn is hiring
Image representing Ken Krugler as depicted in ...
Ken Krugler via CrunchBase

I learned about Cascading and Bixo. Cascading is a data processing workflow management for Hadoop and alike, while Bixo is a data mining toolkit working with Cascading and Hadoop. Talking with Ken Krugler from Bixo Labs he mentioned that we (at Zemanta) are not the only ones missing a good meta-management solution for dealing with data workflows, triggers, metadata about processing, etc. Everybody is missing it. He mentioned some project that might see the light of the day in the future (Krugler is actually the guy that created Krugle search engine for code, for those that remember it).

Greg Makowski (also one of #dmcamp organizers) did a presentation on Netflix challenge algorithms. The interesting tidbit is that Netflix is planning another competition, but this time it will be time limited (18 months) and not performance based (10% as the last one). It’s going to be fun to watch.

Image representing Hadoop as depicted in Crunc...
Hadoop logo via CrunchBase

Apache Hadoop seems to be talk of the day here. However some claim Apache Mahout is not mature enough. I like the Ted Dunning (man behind Mahout) is advocating: just do it. He says “tell us what people need done and we’ll help them”.”And if you have data to share that’s the fastest way to get people excited about the problem you have”

Semantic web session wasn’t very popular when proposed, but still got some people showing up. Not surprising. More surprisingly at the session about open sourced datasets (that could be used for machine learning) Freebase was something new to people.

All in all ACM Silicon Valley Data Mining camp was pretty good event, albeit some presentations could use some work and it was pretty noisy in some rooms.

Internal memo: While writing all this I have been fighting with Zemanta’s insertion of the Ken Krugler image and destroyed html layout. We have some fixing to do.

Reblog this post [with Zemanta]
  • gregmakowski

    Hello,
    I am starting to organize another Data Mining Camp, and I was seeking constructive suggestions to improve the next event (likely in March 6th or 20th).

    * encourage setting expectations on a topic as “beginner” “advanced” or “level not rated”. Some people wanted to have advanced discussions from experienced practitioners in an area, but many of the questions were of a “getting started” level.

    * we are looking for other locations, in case the next event has more people, and are seeking introductions or suggestions.

  • andraz

    Hi Greg,
    great having you on read this blog :)

    As far as topics are concerned I think that most of the them last time were for beginners and there was very little of advanced topics. Also it would be great to motivate people to prepare better for their presentations, since completely ad-hoc is not the best format.

    Oh, the opening panel absolutely rocked! It would be great if everybody on the panel had his own presentation afterwards (in regular program).

    Since I am from abroad, I have no ideas about locations, sorry :)