Baking a Zemanta Pie

Zemanta tech blog

LOD2 Webinar Series: Virtuoso Universal Server

Posted by mateja, under Uncategorized on December 19th, 2011

LOD2 logo

Image by okfn via Flickr

The time has come to announce another LOD2 webinar from the LOD2 webinar series. To remind you what LOD2 webinar series is all about: it’s about providing a monthly webinar about Linked (Open) Data tools and services around the LOD2 project, the LOD2 Stack and the Linked Open Data Life Cycle, also in the form of 3rd party tools.

Image representing OpenLink Software as depict...

Image via CrunchBase

While the previous webinar in November was an introduction to LOD2 Stack of tools, this one covers Virtuoso Universal Server as the Linked Data ingestion, storage and retrieval component of the LOD2 Stack. You’ll get an overview of the tool and its role in the linked data lifecycle, including live demos of its usage by OpenLink Software, UK.

If you are interested in Open (Linked) Data I suggest you take part in this free webinar on December 20, 2011 at 4.00pm CET. More information and registration is available at https://www2.gotomeeting.com/register/523785698.

I should also mention this:

The LOD2 webinar series is powered by LOD2 – Creating Knowledge out of Interlinked Data (http://lod2.eu), organised & produced by Semantic Web Company (http://www.semantic-web.at), Austria. We are looking forward for your registration and to meeting you at the LOD2 webinar on 20.12. 2011!

Enhanced by Zemanta

Tags: ,

Comments: (0)

Linked Open Data 2, a sequel

Posted by mateja, under Uncategorized on November 24th, 2011

Linking Open Data dataset cloud as of July 14t...

Linking Open Data cloud (Image via Wikipedia)

Until recently Open Data didn’t ring any bells for me, but joining Zemanta changed everything. From this september Zemanta is a partner in LOD2 consortium (http://lod2.eu), and now I have the opportunity to work along a bunch of cool people doing cool stuff to Open Data in the world of Semantic Web, where, according to Tim Berners-Lee, it’s not all about putting data on the web, but also about linking it and making it machine readable so we can actually make something useful with it, like e.g. find other, related, data?

Nowadays it’s quite easy to find bits and pieces of freely available data. The only problems is, these data are just this – more or less meaningful bits and pieces, until you connect them with other data. It’s almost like baking a cake: until you mix and combine all the ingredients and bake the darn thing, you don’t have one. If data is ingredients, information or knowledge from linked data is the cake.

Some institutions, companies and governments already grasped the idea of creating, sharing and linking data (e.g. Data.gov, Freebase, DBPedia, Infochimps), while others still need some encouragement – and this is what EU project Creating Knowledge out of Interlinked Data (LOD2) is all about. LOD2 is building on results of Linking Open Data (LOD) project, hence the sequel (in case you wondered).
One of the main goals of the LOD2 project is to provide the tools needed in the life-cycle of Linked data from extraction, authoring and enrichment, interlinking, to visualization and… sharing.

Few days ago LOD2 consortium announced the first release of the LOD2 Stack of tools (available at http://stack.lod2.eu). The official launch is on November 29, 2011. If you want to build the Future Web (or is it the future of Web?) with us, step in, grab a tool, link some data and contribute. If you are a professional working at government or industry, even better. Spread the word, join the Semantic Web crew, provide Open  Data.

If this seems too PR-ish for you, the least you can do is to register for free LOD2 Stack webinar.  LOD2 team prepared an one-hour webinar where LOD2 stack will be presented and demonstrated. It starts on Tuesday, November 29, 2011 at 04.00pm CET and it is a part of LOD2 webinar series. Once more: it’s free.

Further information and registration at http://lod2.eventbrite.com/ .

Enhanced by Zemanta

Comments: (0)

Our recruiting experiences

Posted by andraz, under Uncategorized on September 15th, 2011

SAN FRANCISCO - MARCH 19:  A 'now hiring' sign...

Image by Getty Images via @daylife

Lately at Zemanta we went into intensive recruiting mode on all fronts – UX, frontend, backend, research, system infrastructure and administration.  We tried a lot of different methods of recruiting.  So why not share the experience? What worked and what didn’t?

While we were also looking for UX, design and project management talent, this post is about looking for tech talent, maybe we’ll share other experiences in other posts.

First the non-news: using general local job boards gets you lots of candidates. We used Moje Delo free offer (a week long ad) which resulted in about 15 applications. We did phone interviews with about half of them, but decided to pass on all. The mix of candidates was really widespread, some were very good engineers, but with interests in very different fields, some were pretty awful engineers.

We also went to a job fair the Faculty of Computer Science in Ljubljana. This had zero effect. I speculate there were two reasons. One is organizers didn’t know how to promote the event to the right students, they really need to learn a lot. On top of that the most talented students don’t really hang out at the university (they are already working somewhere – this is a specific situation in Slovenia due to how higher education works). For next year we promised to help organizers with some advice.

Posting on our company blog got us good response – the applicants we got were interesting to interview. From almost-graduated students to applications from abroad and a lot of internship inquiries. We were pretty happy to get to know some really talented students through that. One is just finishing his undergrad degree and is joining Zemanta team in the fall. We are actually industry co-mentors for his graduation thesis on content extraction techniques.

Unsurprisingly the most effective approach are tips and referrals from friends. Whenever we have presentations at tech events we let people know we are hiring. That doesn’t generally result in any applications – people just don’t apply on their own. But they encourage their friends that they think should apply. This resulted in many good hires. And sometimes if someone is recommended to us directly, we just pick up the phone and ask.

We are also vigilant about other opportunities – Bostjan  saw a posting on NYC NewTech mailing list that a young techie is looking to work abroad in Europe. After bunch of skype calls and researching the work visa process we decided to bring him over to Slovenia. It was a bit risky, but it paid off. Sam didn’t just bring his great engineering skills and “get the job done” ethos, but he also brought more international perspective to the team. We got a taste of that and want more of it. So we started searching internationally.

So we first tried LinkedIn and it didn’t work. We posted on a few start-up job boards and also got no response. Then we found Stack Overflow Careers 2.0. We advertised a frontend JavaScript position there, making clear that candidates would be expected to move to Ljubljana, Slovenia. All we can say is it really works. Top notch people looking for new opportunities got in touch with us. And we’re relocating someone just now.

At about the same time I was reading Hackers News and saw “Who’s hiring” topic. I posted on impulse. We’ve got just three inquiries, but all of them are again top-notch.

So here are the our lessons learned:

View of Ljubljana's skyline at sunset from the...

Image via Wikipedia

  • let everybody know what you are looking for
  • post on your blog, spread the message
  • be on the outlook for unexpected opportunities
  • and if you have a bit of money to spend, put 350$ on Stack Overflow Careers 2.0

Up next: how does the interview process look like and what we test for.

Enhanced by Zemanta

Tags: , , , , ,

Comments: (3)

Precision vs. NDCG

Posted by tom, under Uncategorized on August 24th, 2011

Here at Zemanta, we use different metrics and benchmarks to continuously improve the article recommendations for our users.

Up until now, the main metric used to evaluate the accuracy of our recommendations was the precision metric.

Precision works by summing the manually assigned relevance (0 – 3) of the top 10 articles as determined by our recommendation engine. However, precision is the same regardless of the ranking of the top 10 articles – if the most relevant article is moved from the first place to the last (10th) place, the precision doesn’t change.

To get a more accurate feedback not only on the accuracy of our recommendations, but on the ranking as well, we implemented the measure known as Normalized Discounted Cumulative Gain (NDCG). To calculate NDCG, we first need to calculate the Discounted Cumulative Gain (DCG):


where reli is the manually-assigned relevance of the article at rank i. Then, we calculate the NDCG as


where iDCG refers to the idealized DCG, the DCG of the perfect ranking of the result. Since we assume that there always are at least 10 perfectly relevant articles in our index, we calculate the iDCG using maximum relevance for all 10 articles.

When we calculated NDCG scores for different versions of Zemanta’s recommendation engine, we noticed that precision and NDCG were almost perfectly correlated and that the difference between them was almost constant. We were surprised at first, but later realized that this was expected. As our engine’s recommendations improved, the more relevant results became better-ranking, and as a result the top 10 recommendations became more relevant, so both the precision and NDCG improved.

Therefore, we concluded that despite its increased complexity, NDCG was not a significantly better measure of our engine’s accuracy. This result is similar to the findings of an experimental study [1] which compared different measures of the quality of Google’s search results and discovered only a weak correlation between users’ satisfaction and NDCG (albeit calculated using a slightly different formula), while Cumulative Gain (equivalent to our precision) was correlated very strongly.

[1] The Relationship between IR Effectiveness Measures and User Satisfaction

Enhanced by Zemanta

Comments: (0)

New Zemanta developer portal launched

Posted by andraz, under Uncategorized on January 19th, 2011

The Zemanta team is happy to present heavily refreshed Zemanta developer portal!

In the last half a year many of you have observed that API documentation was getting out of date. At the time we needed to change the way how API works on the backend which required rewrite of the developer portal (long story). Then we decided to also redesign it visually and refresh all the documentation.

So now the best API out there for keyword extraction, entity extraction and disambiguation, related articles search, related images search, in-text link suggestions, and automatic categorization also has some pretty good documentation. The mailing list is also getting more and more lively. Let us know if you have questions.

We plan to keep on working on the API and make it even better in ‘understanding text on the web’. Let us know if you have any questions!

[caveat: due to port to the new platform, unfortunately you will have to change password next time you will try to log-in]

Enhanced by Zemanta

Comments: (7)

Python, gdb and a very large core dump

Posted by andraz, under Uncategorized on January 16th, 2011

Bezerk (album)

Image via Wikipedia

Over the weekend I’ve been trying to track down a memory leakage in a piece of our software. I learned a lot about state-of-the-art in python debugging, so I am sharing it here.

So, the problem: Approximately every two months one of our daemon processes goes bezerk, allocating memory left and right. Usually the machine has to be rebooted or kernel’s out-of-memory-killer does his job. This time Jure (our beloved admin) caught it. Now we had a 30GB core dump file.

The question becomes: so what’s allocating all that memory? Or more precisely how do we get something out of that core dump.

A few more details: Zemanta on production servers uses Python 2.5, a stock Debian Lenny version. We don’t use python-dbg builds to run the production.

After opening the dump with gdb the first thing I saw was that there are no debugging symbols. By installing python-gdb this can be solved – gdb knows how to load symbols from a separate libraries /usr/lib/debug even when loading core dumps that were generated without symbols. Naturally this works only if the distribution gives you the exact right symbols for the binary at hand.

When symbols are loaded after the fact, this is what you should see when loading a core dump into gdb:

Reading symbols from /lib/libc.so.6…Reading symbols from /usr/lib/debug/lib/libc-2.7.so…done.

Ok, so now we have symbols, let’s look at the backtrace first (this is thread #2, luckily we only have two threads in this daemon):

#0 update_refs (generation=2) at ../Modules/gcmodule.c:242
#1 collect (generation=2) at ../Modules/gcmodule.c:789
#2 0x00000000004be35b in collect_generations (basicsize=) at ../Modules/gcmodule.c:897
#3 _PyObject_GC_Malloc (basicsize=) at ../Modules/gcmodule.c:1336
#4 0x000000000045f569 in PyType_GenericAlloc (type=0x728a20, nitems=0) at ../Objects/typeobject.c:454
#5 0x0000000000426ddc in BaseException_new (type=0×2, args=, kwds=0x7f1505978010) at ../Objects/exceptions.c:35
#6 0x000000000046ddf3 in type_call (type=0×2, args=0x7f18d379d050, kwds=0×0) at ../Objects/typeobject.c:422
#7 0x00000000004187b3 in PyObject_Call (func=0×2, arg=0×740060, kw=0x7f1505978010) at ../Objects/abstract.c:1861
#8 0x000000000048b262 in PyEval_CallObjectWithKeywords (func=0x728a20, arg=0x7f18d379d050, kw=0×0) at ../Python/ceval.c:3444
#9 0x00000000004a011e in PyErr_NormalizeException (exc=0x7fffffff93d8, val=0x7fffffff93d0, tb=0x7fffffff93c8) at ../Python/errors.c:175
#10 0×0000000000491160 in do_raise (f=0x2b58190, throwflag=) at ../Python/ceval.c:3077
#11 PyEval_EvalFrameEx (f=0x2b58190, throwflag=) at ../Python/ceval.c:1632
#12 0x000000000049313d in PyEval_EvalCodeEx (co=0x1f46468, globals=, locals=, args=0×2939698, argcount=2, kws=0×0, kwcount=0, defs=0×0, defcount=0, closure=0×0) at ../Python/ceval.c:2838
#13 0x00000000004dd612 in function_call (func=0x1e8ec08, arg=0×2939680, kw=0×0) at ../Objects/funcobject.c:517
#14 0x000000000041e0eb in PyObject_Call (callable=0x1e8ec08) at ../Objects/abstract.c:1861
#15 PyObject_CallFunctionObjArgs (callable=0x1e8ec08) at ../Objects/abstract.c:2091
#16 0x000000000048920d in builtin_getattr (self=, args=) at ../Python/bltinmodule.c:737
#17 0x00000000004919c2 in call_function (f=0x2b09dd0, throwflag=) at ../Python/ceval.c:3575
#18 PyEval_EvalFrameEx (f=0x2b09dd0, throwflag=) at ../Python/ceval.c:2274
#19 0x000000000049313d in PyEval_EvalCodeEx (co=0xa34cd8, globals=, locals=, args=0×3, argcount=1, kws=0x7f18c809bf98, kwcount=0, defs=0x9c28d8, defcount=2, closure=0×0) at ../Python/ceval.c:2838
#20 0×0000000000491663 in fast_function (f=0x7f18c809bdd0, throwflag=) at ../Python/ceval.c:3671
#21 call_function (f=0x7f18c809bdd0, throwflag=) at ../Python/ceval.c:3596
#22 PyEval_EvalFrameEx (f=0x7f18c809bdd0, throwflag=) at ../Python/ceval.c:2274
#23 0x000000000049313d in PyEval_EvalCodeEx (co=0x1f4d210, globals=, locals=, args=0×3, argcount=2, kws=0x7f18c809bd90, kwcount=0, defs=0×2038228, defcount=1, closure=0×0) at ../Python/ceval.c:2838
#24 0×0000000000491663 in fast_function (f=0x7f18c809bbf0, throwflag=) at ../Python/ceval.c:3671
#25 call_function (f=0x7f18c809bbf0, throwflag=) at ../Python/ceval.c:3596
#26 PyEval_EvalFrameEx (f=0x7f18c809bbf0, throwflag=) at ../Python/ceval.c:2274
#27 0x0000000000492a52 in fast_function (f=0x29a9670, throwflag=) at ../Python/ceval.c:3661
#28 call_function (f=0x29a9670, throwflag=) at ../Python/ceval.c:3596
#29 PyEval_EvalFrameEx (f=0x29a9670, throwflag=) at ../Python/ceval.c:2274
#30 0x0000000000492a52 in fast_function (f=0x7f18c809b9e0, throwflag=) at ../Python/ceval.c:3661
#31 call_function (f=0x7f18c809b9e0, throwflag=) at ../Python/ceval.c:3596
#32 PyEval_EvalFrameEx (f=0x7f18c809b9e0, throwflag=) at ../Python/ceval.c:2274
#33 0x0000000000492a52 in fast_function (f=0x2ebbce0, throwflag=) at ../Python/ceval.c:3661
#34 call_function (f=0x2ebbce0, throwflag=) at ../Python/ceval.c:3596
#35 PyEval_EvalFrameEx (f=0x2ebbce0, throwflag=) at ../Python/ceval.c:2274
#36 0x0000000000492a52 in fast_function (f=0x2e90db0, throwflag=) at ../Python/ceval.c:3661
#37 call_function (f=0x2e90db0, throwflag=) at ../Python/ceval.c:3596
#38 PyEval_EvalFrameEx (f=0x2e90db0, throwflag=) at ../Python/ceval.c:2274
#39 0x0000000000492a52 in fast_function (f=0x3ca8090, throwflag=) at ../Python/ceval.c:3661
#40 call_function (f=0x3ca8090, throwflag=) at ../Python/ceval.c:3596
#41 PyEval_EvalFrameEx (f=0x3ca8090, throwflag=) at ../Python/ceval.c:2274
#42 0x0000000000492a52 in fast_function (f=0×4217550, throwflag=) at ../Python/ceval.c:3661
#43 call_function (f=0×4217550, throwflag=) at ../Python/ceval.c:3596
#44 PyEval_EvalFrameEx (f=0×4217550, throwflag=) at ../Python/ceval.c:2274
#45 0x000000000049313d in PyEval_EvalCodeEx (co=0x1dc7288, globals=, locals=, args=0x2f04f50, argcount=2, kws=0×0, kwcount=0, defs=0×0, defcount=0, closure=0×0) at ../Python/ceval.c:2838
#46 0x00000000004dd612 in function_call (func=0x1dd48c0, arg=0x2f04f38, kw=0×0) at ../Objects/funcobject.c:517
#47 0x00000000004187b3 in PyObject_Call (func=0×2, arg=0×740060, kw=0x7f1505978010) at ../Objects/abstract.c:1861
#48 0x000000000041f658 in instancemethod_call (func=0x1dd48c0, arg=0x2f04f38, kw=0×0) at ../Objects/classobject.c:2519
#49 0x00000000004187b3 in PyObject_Call (func=0×2, arg=0×740060, kw=0x7f1505978010) at ../Objects/abstract.c:1861
#50 0x000000000048fef2 in do_call (f=0x2df6a80, throwflag=) at ../Python/ceval.c:3786
#51 call_function (f=0x2df6a80, throwflag=) at ../Python/ceval.c:3598
#52 PyEval_EvalFrameEx (f=0x2df6a80, throwflag=) at ../Python/ceval.c:2274
#53 0x0000000000492a52 in fast_function (f=0x257d8c0, throwflag=) at ../Python/ceval.c:3661
#54 call_function (f=0x257d8c0, throwflag=) at ../Python/ceval.c:3596
#55 PyEval_EvalFrameEx (f=0x257d8c0, throwflag=) at ../Python/ceval.c:2274
#56 0x000000000049313d in PyEval_EvalCodeEx (co=0x1dbfd50, globals=, locals=, args=0×6, argcount=1, kws=0x28fe240,
kwcount=0, defs=0x1e29a28, defcount=5, closure=0×0) at ../Python/ceval.c:2838
#57 0×0000000000491663 in fast_function (f=0x28fe090, throwflag=) at ../Python/ceval.c:3671
#58 call_function (f=0x28fe090, throwflag=) at ../Python/ceval.c:3596
#59 PyEval_EvalFrameEx (f=0x28fe090, throwflag=) at ../Python/ceval.c:2274
#60 0x000000000049313d in PyEval_EvalCodeEx (co=0x1f32990, globals=, locals=, args=0×3, argcount=1, kws=0×2578738,
kwcount=2, defs=0x173cec0, defcount=2, closure=0×0) at ../Python/ceval.c:2838
#61 0×0000000000491663 in fast_function (f=0×2578580, throwflag=) at ../Python/ceval.c:3671
#62 call_function (f=0×2578580, throwflag=) at ../Python/ceval.c:3596
#63 PyEval_EvalFrameEx (f=0×2578580, throwflag=) at ../Python/ceval.c:2274
#64 0x000000000049313d in PyEval_EvalCodeEx (co=0x7f18d2870198, globals=, locals=, args=0x187b0f8, argcount=2,
kws=0x1e54310, kwcount=0, defs=0×0, defcount=0, closure=0×0) at ../Python/ceval.c:2838
#65 0x00000000004dd709 in function_call (func=0x273e848, arg=0x187b0e0, kw=0x2577ea0) at ../Objects/funcobject.c:517
#66 0x00000000004187b3 in PyObject_Call (func=0×2, arg=0×740060, kw=0x7f1505978010) at ../Objects/abstract.c:1861
#67 0×0000000000490297 in ext_do_call (f=0x25783b0, throwflag=) at ../Python/ceval.c:3855
#68 PyEval_EvalFrameEx (f=0x25783b0, throwflag=) at ../Python/ceval.c:2314
#69 0x0000000000492a52 in fast_function (f=0x24d8df0, throwflag=) at ../Python/ceval.c:3661
#70 call_function (f=0x24d8df0, throwflag=) at ../Python/ceval.c:3596
#71 PyEval_EvalFrameEx (f=0x24d8df0, throwflag=) at ../Python/ceval.c:2274
#72 0x0000000000492a52 in fast_function (f=0×2577160, throwflag=) at ../Python/ceval.c:3661
#73 call_function (f=0×2577160, throwflag=) at ../Python/ceval.c:3596
—Type to continue, or q to quit—
#74 PyEval_EvalFrameEx (f=0×2577160, throwflag=) at ../Python/ceval.c:2274
#75 0x000000000049313d in PyEval_EvalCodeEx (co=0x26c8a80, globals=, locals=, args=0x187af50, argcount=2, kws=0×0,
kwcount=0, defs=0×0, defcount=0, closure=0×0) at ../Python/ceval.c:2838
#76 0x00000000004dd612 in function_call (func=0x273d230, arg=0x187af38, kw=0×0) at ../Objects/funcobject.c:517
#77 0x00000000004187b3 in PyObject_Call (func=0×2, arg=0×740060, kw=0x7f1505978010) at ../Objects/abstract.c:1861
#78 0x000000000041f658 in instancemethod_call (func=0x273d230, arg=0x187af38, kw=0×0) at ../Objects/classobject.c:2519
#79 0x00000000004187b3 in PyObject_Call (func=0×2, arg=0×740060, kw=0x7f1505978010) at ../Objects/abstract.c:1861
#80 0x0000000000465ca6 in slot_tp_init (self=0×2740850, args=0×2740350, kwds=0×0) at ../Objects/typeobject.c:4943
#81 0x000000000046dfdb in type_call (type=0x24d93d0, args=0×2740350, kwds=0×0) at ../Objects/typeobject.c:436
#82 0x00000000004187b3 in PyObject_Call (func=0×2, arg=0×740060, kw=0x7f1505978010) at ../Objects/abstract.c:1861
#83 0x000000000048fef2 in do_call (f=0x2576dc0, throwflag=) at ../Python/ceval.c:3786
#84 call_function (f=0x2576dc0, throwflag=) at ../Python/ceval.c:3598
#85 PyEval_EvalFrameEx (f=0x2576dc0, throwflag=) at ../Python/ceval.c:2274
#86 0x0000000000492a52 in fast_function (f=0x7da510, throwflag=) at ../Python/ceval.c:3661
#87 call_function (f=0x7da510, throwflag=) at ../Python/ceval.c:3596
#88 PyEval_EvalFrameEx (f=0x7da510, throwflag=) at ../Python/ceval.c:2274
#89 0x000000000049313d in PyEval_EvalCodeEx (co=0x7f18d2870648, globals=, locals=, args=0×0, argcount=0, kws=0×0,
kwcount=0, defs=0×0, defcount=0, closure=0×0) at ../Python/ceval.c:2838
#90 0×0000000000493332 in PyEval_EvalCode (co=0×2, globals=0×740060, locals=0x7f1505978010) at ../Python/ceval.c:494
#91 0x00000000004b2cd8 in run_mod (fp=0x75e010, filename=0x7fffffffd522 “transfer.py”, start=, globals=0x7814f0, locals=0x7814f0, closeit=1,
flags=0x7fffffffc5c0) at ../Python/pythonrun.c:1273
#92 PyRun_FileExFlags (fp=0x75e010, filename=0x7fffffffd522 “transfer.py”, start=, globals=0x7814f0, locals=0x7814f0, closeit=1,
flags=0x7fffffffc5c0) at ../Python/pythonrun.c:1259
#93 0x00000000004b2f7b in PyRun_SimpleFileExFlags (fp=0x75e010, filename=0x7fffffffd522 “transfer.py”, closeit=1, flags=0x7fffffffc5c0) at ../Python/pythonrun.c:879
#94 0×0000000000414542 in Py_Main (argc=2, argv=) at ../Modules/main.c:532
#95 0x00007f18d29fa1a6 in __libc_start_main () from /lib/libc.so.6
#96 0×0000000000413989 in _start ()

Ok… this is ugly, there must be a better way?

libpython – Easier Python Debugging

And there is –  David Malcolm from RedHat has proposed and implemented a plug-in for gdb that enables you to easier debugging of python with gdb.

Now this plug-in needs gdb 7.2 and Lenny has 6.8 which means a new gdb had to be recompiled. Latest FSF gdb 7.2 will do the job. Just be careful to use the —prefix=/usr when running ./configure as the gdb somehow loses the ability to load symbols from debug libraries separately if you use your own prefix (the prefix hint and many others were curtesy of very very kind people on #gdb on Freenode, especially Jan Kratochvil).

Googling for debugging python under gdb is made much harder since gdb has lately chosen python to be its scripting language of choice. Everything gets rather

Interpretations
Image by .Bala via Flickr

confusing. The next question is where to get the famous libpython.py … you can choose the one coming with Python 2.7, with Python trunk or with Cython.

Then you finally type in the magic words inside gdb shell:

(gdb) python import libpython
(gdb) py-bt
#4 (unable to read python frame information)
#8 (unable to read python frame information)
#13 (unable to read python frame information)
#16 (unable to read python frame information)
#19 (unable to read python frame information)

Great, then you try just “bt”:

for addr_incr, line_incr in zip(co_lnotab[::2], co_lnotab[1::2]):
TypeError: ‘FakeRepr’ object is unsubscriptable

Ok, no luck here. With lots of testing I found out that basically you need Python 2.6 to make it work. I’ve written to David Malcolm to se if there are any easy fixes there, but I am not holding my breath.

gdbheap – python-friendly memory analyzer

Next came gdbheap, another Fedora-affiliated project, it promises to show you where your python memory is going – even from core dumps! The web page looks promising and I’ve gotten the new gdb python extensions to work already, so surely this will work:

(gdb) python import gdbheap
(gdb) heap
Missing debuginfo for glibc
Suggested fix:
debuginfo-install glibc

It seems the instructions are for Fedora, but supposedly Debian’s -dbg packages do the same trick. Digging deeper showed that gdbheap needs a pretty wide open look into glibc and especially into its memory allocation structures. Symbols for top-level structures are present in /usr/lib/debug/lib/libc-2.7.so, one of the important ones are “mp_” and “main_arena”. However structure of mp_ itself isn’t available in that file, so when gdbheap tries to resolve things like mp_.sbrk_base it fails. The symbols needed are present in a separate library /usr/lib/debug/libc-2.7.so, but that library isn’t exactly the same as ordinary libc, so symbols from the former cannot be used for the latter. The bottom line is, I can’t use gdbheap to take a look into my 30Gb core dump, due to libc6-dbg package having just some of the symbols needed.

And Now For Something Completely Different...
Image by me’nthedogs via Flickr

So there was at least some good news here – if gdbheap could give good answers, I’ll start running our problematic daemon with LD_LIBRARY_PATH=/usr/lib/debug and next time we get a dump, we’ll be able to take a look.
But just in case, I wanted to see how this gdbheap works. I’ve attached the gdb to a process running the same daemon in normal mode. It ate about 100mb of memory. Now I typed in the commands again and the progress bar started showing some kind of analysis being done. After two minutes, indeed it explained which types of python objects are eating the memory. Now there was a small issue though… I checked the memory usage of gdb itself doing this analysis, and for 100mb of resident memory it has eaten… 4Gb of memory. Scaling this to 30GB core dump that I have to analyze, I think might be feasible by the end of decade. ]

Anyway, I am still optimistic. gdbheap is written in python and at the first sight it should be possible to make it just sample the memory space instead of analyzing it in total. But until I get it working over my dump, that’s just a theory.

.gdbinit for gdb making Python bearable

Now there was another tool I’ve seen previously – .gdbinit. An init script for gdb that can print the stack trace in python-friendly way, with locals and globals. Now the problem is that for the variables part it relies into python binary actually being run, so I knew I can only have a stack traces.
The issue is that it is very, very python version dependent and naturally it didn’t work out of the box. For example one of the conditions for deciding if the function is the one from which the python-stack can be traced is:

if $pc > PyEval_EvalFrameEx && $pc < PyEval_EvalCodeEx

This basically compares program counter ($pc) with addresses of two specific functions. What it tries to do is to find out if $pc is inside PyEval_EvalFrameEx in a very clumsy way. This depends on the sequence of functions in python source file and on compiler preserving it when generating code. The same goes for termination condition which gets into infinite loops when using threads, etc.
For my specific case, I’ve found that I can do this:

if $pc > PyEval_EvalCodeEx && $pc < PyEval_EvalCodeEx+40596

and it works. The memory offset is purely hand-selected, I have no idea how large the function really is. Actually the right way to fix the problem would be to check if the “co” variable is available in local context, but nobody on #gdb would knew how to check for that programmatically.

Anyway at the end I’ve managed to piece together the .gdbinit file that gets me the following:

(gdb) pystack
/home/neith/lib/python/feedparser.py (237): __getattr__
/usr/lib/python2.5/copy.py (171): deepcopy
/home/neith/lib/python/feedparser.py (683): pop
/home/neith/lib/python/sgmllib.py (96): feed
/home/neith/lib/python/feedparser.py (2675): parse
/home/neith/zdev/neith/aggregator/finish.py (205): parse_sourcefeed
transfer.py (661): mp_process_feed
/usr/lib/python2.5/site-packages/multiprocessing/forking.py (99): __init__
transfer.py (692):


So now I have a culprit for the memory hunger — it’s sgmllib/feedparser. Unfortunately that’s third party code that is hard to understand and hard to debug. Last time we’ve tried to migrate to newer version of feedparser things fell apart. Maybe we have to do that again. Or try swapping the FeedParser for (hopefully) more mature Jakarta Feedparser or Rome, probably as a RPC service written in Java. Any suggestions?


Debugging python with gdb, final verdict

People are working on the projects that might lead us somewhere. But right now, for a developer needing to figure where the memory went inside a core dump, it doesn’t look very good. And integrated experience of debugging found in Microsoft’s tools is still far far away. But people are trying and that’s something.

Bus Errror, Core Dumped
Image by Joe Crawford (artlung) via Flickr

Oh, I must also mention a great piece of software I’ve used once to trace a memory hunger of slightly higher reproducability – heappy. However it seems the project is dead.

Your experience / advice / hints

I’d be happy to get more information on methods to debug python code. If you know about something I have missed, please leave a comment, it would be appreciated!

Enhanced by Zemanta

Comments: (4)

Term pruning

Posted by dusan, under Apache Lucene/Solr on September 16th, 2010

Pruning tools utilized by a pruning and tree-s...
Image via Wikipedia

At Zemanta we are constantly experimenting with new ideas how to improve our service. Most of our experiments are gainless, but quite often one learns more from failure than success.

One of the gainless but illuminating experiments we did lately is term pruning. Experimentally, we have observed that 52% of terms occur in only one document and that excluding terms occurring only once have had no influence on precision of our recommendations.  Our recommendation engine is computationally very demanding and make it more efficient is a never-ending process. A chance to prune 52% of terms seemed quite promising for increasing performance of our engine and reducing index size.

Our recommendation engine is based on Apache Lucene/Solr. At a recent Lucene EuroCon conference, Andrzej Bialecki presented a Lucene patch that provides an easy tool for index manipulation. Using this tool we have removed all terms occuring in only one document, and all postings and payloads belonging to such terms. It has turned out that efficiency of our engine did not change and also the index size decreased only slightly (by 1.5%).

In our opinion, this experiment has shown that Lucene is very efficient at storing terms and associated term data (postings & payloads), and that presence of rarely used terms in the index is not of a concern.

[This post was originally published at The Unreasonable Effectiveness of Data blog]

Enhanced by Zemanta

Tags: , ,

Comments: (3)

Semantic Tech Conference report

Posted by andraz, under Uncategorized on June 24th, 2010

Bernadette Hyland, Zepheira, at SemTech 2010
Image by mmmmmrob via Flickr

Here I am, third time at SemTech conference! Having some perspective, maybe a bit of reflection is due.

Has the topic evolved over the years? People in attendance are mostly from big enterprises, DoD contractors and lots and lots of providers of semantic stack solutions of all kinds, mostly for enterprises. The average age is pretty old compared to any other tech conference I visit. The topic stays mostly the same, with some direction toward slightly more non-enterpriseish stuff.

During all there years there has been a small amount of companies, that don’t declare themselves as capital letter Semantic Web, but were eager to be adopted as champions of the conference. In 2008 there were a few notable companies presenting PowerSet, Twine, Siri (named Stealth-Company at the time).

There’s an interesting contrast lurking here. It seems there is less and less (none?) VC activity in pure Semantic Web space – except for persisting Mark Greaves, I haven’t seen any other VC or angels at the conference this year. On the other hand some companies that just used ‘semantic web thinking’ as part of solving user’s problems have done rather well. Siri exited to Apple in April (200M+), and PowerSet exited to Microsoft in 2008 (100M for that deal seemed low to some, but with a luxury of hindsight it was perfect timing as PowerSet on its own would have had a very hard time). Twine seems to have failed and was eaten by Evri with which it shared the main investor. Probably there were more startups from that year that somehow faded, however the point is that commercial success is possible with the right (non ideological?) approach.

Linked Data (both open and enterprise one) is making the rounds and there seems to be some genuine progress there (for example its use in BBC, ViewChange player, etc). I’ve just sat through InfoChims and Factual panel led by Josh Dilworth, it seems people are now genuinely interested in linking their data with external databases. Freebase is integrating many third party sources too, OpenCalais is tauting great SEO results by adding semantic information, etc. Luckily in most cases semantic stuff stays hidden from end users, like Zemanta does.

In next post… more thoughts on my (controversial?) presentation: Semantic Web User Intrfaces – Do They Have To Be Ugly? Amazingly it received about hundred retweets!

Enhanced by Zemanta

Tags: , , , , , , , , ,

Comments: (0)

LinkTV evaluates entity extraction APIs, Zemanta comes out on top

Posted by andraz, under api on May 19th, 2010

This was already posted on our main blog, but since it is fairly technical news, I am reposting it here:

Some interesting news today! Recently LinkTV did a comprehensive study on quality of semantic APIs and Zemanta came out on top in many respects! Here’s the conclusion on Entity Extraction API:

In terms of quality and quantity of disambiguated Named Entities returned in these tests, Zemanta was the clear leader of the NLP API field. OpenCalais also returned highly relevant terms, but was lacking in disambiguation features. Of the other APIs tested, only AlchemyAPI returned quality, disambiguated results, though the quantity of entities returned was low.

and on Related Content API:

Zemanta and Daylife both provide unique API features, resulting in (the potential for) higher quality results than the other article/news APIs tested here.

Link TV
Image via Wikipedia

What’s the origin of the study? LinkTV is preparing a new video player named ViewChange and they did this study to figure out which text extraction API to use. They also evaluated other types of analysis such as news and image search.

Check out the full Entity Extraction & Content API Evaluation! Their conclusions might spare you lots of work.

Reblog this post [with Zemanta]

Tags: , , , , ,

Comments: (2)

Map-reduce with Disco

Posted by Tomaž Šolc, under Uncategorized on April 29th, 2010

Note: Reposted from personal blog of Tomaž Šolc.

One of the features of Zemanta API is image suggestion. For example, if you are writing an article about bowls of petunias, Zemanta will suggest nice photographs of that particular kind of plant life to go with your post.

A part of those suggested images comes from English Wikipedia and Wikimedia Commons. And since computer vision just isn’t there yet, Zemanta’s back-end can only learn about the content of the images from text and various other machine-readable data that is related to that image.

The problem is that while usable data is relatively abundant, it is scattered around English and Commons wikis. It’s present directly or in various more or less complicated templates that appear on image description pages. Articles that include an image and captions they use also hold clues to the image content. Sometimes it’s even necessary to go through 5 of more jumps between you can connect a useful piece of metadata, like for example between an article including a “Wikimedia Commons has more media related to” box and the actual picture.

Previously, this data extraction was performed in a traditional way: a series of Python scripts read the dumps provided by Wikimedia and stored the information in various MySQL tables. In it’s last incarnation, this system took around 30 hours to process both wikis. This might not appear much, but in every new dump something inevitably breaks. Perhaps it’s a critical template that has been renamed or a minor markup change exposed an odd bug in a Python script. This meant that often two or three sessions were required and a new dump quickly consumed a week worth of work.

Old image processing system using Python scripts and MySQL tables.

Two weeks ago I replaced this monster with a new one built upon the map-reduce paradigm that’s all the rage these days. It performs the job in a little over 2 hours and uses Disco framework. This basically trades indexed MySQL table access (lots of expensive hard-disk seeks) for multiple passes over sorted flat files. These use sequential reads and are thus much faster. In practice however, there’s still a lot of disk seeking going on, because Disco will sort these huge files on-disk (actually using GNU sort behind the curtains). But obviously the performance improvement is still significant.

I should also mention that Disco took some significant hacking before it became useful, so I can’t really say it’s a mature solution.

New image processing system using map-reduce.

As far as complexity of Zemanta’s system goes, it hasn’t gone down either. The part of the code that was directly affected by this change went from around 560 lines of Python to a bit over 1100. On the other hand the boxes in the graph above (representing individual map-reduce jobs) are now much more separated from each other. I guess only time will tell if this will be easier of harder to maintain. One thing is certain: the development cycle has become much faster.

Finally, the time improvement of an order of magnitude starts to look way less impressive when you take into account that the old system used one CPU on one machine while the new one takes two machines with 12 CPUs each. But I guess that doesn’t matter if you have processors sitting idle in the rack.

Tags: , ,

Comments: (1)