<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Big Data on keithrozario.com</title><link>https://keithrozario.com/categories/big-data/</link><description>Recent content in Big Data on keithrozario.com</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sun, 05 Apr 2015 20:00:17 +0000</lastBuildDate><atom:link href="https://keithrozario.com/categories/big-data/index.xml" rel="self" type="application/rss+xml"/><item><title>MDeC Private Meeting with ODI</title><link>https://keithrozario.com/2015/04/mdec-meeting-with-odi-open-data/</link><pubDate>Sun, 05 Apr 2015 20:00:17 +0000</pubDate><guid>https://keithrozario.com/2015/04/mdec-meeting-with-odi-open-data/</guid><description>&lt;p style="text-align: justify;">
&lt;p>&lt;img
 src="https://keithrozario.com/uploads/Mdec-Logo.jpg"
 alt="Mdec-Logo"
 
 loading="lazy"
 />&lt;/p>
&lt;p>Earlier this week I attended a MDeC organized private meeting with Richard Stirling from the &lt;a title="Open Data Institute" href="http://opendatainstitute.org/" target="_blank" rel="noopener noreferrer">Open Data Institute&lt;/a> (ODI).The ODI is an institution that hopes to promote the &amp;lsquo;open data&amp;rsquo; culture, and founded by a giant of the Tech world, Sir Tim Berners-Lee, which you might remember for inventing a small little thing we call the world wide web.&lt;/p>&lt;/p>
&lt;p style="text-align: justify;">The meeting was attended by just a handful of folks, some of whom I recognized from a previous &lt;a title="Seatti" href="http://www.seatti.org/" target="_blank" rel="noopener noreferrer">Seatti&lt;/a> conference I attended, with the audience and topic focus on Open Data (and Big Data) in Malaysia.&lt;/p>
&lt;p style="text-align: justify;">The conversation was really good, and broadly speaking touched on 3 key topics. Most of this post is a re-hash from my failing and aged memory, but there's a clearer version of the minutes &lt;a title="Sinar Project Meeting with ODI" href="https://sinarproject.hackpad.com/Meeting-ODI-with-stakeholders-KdImFSLAUvr" target="_blank" rel="noopener noreferrer">here&lt;/a> from the amazing people of Sinar Malaysia if you're interested in the specifics.</description></item><item><title>MyProcurement: All government tenders in one Excel file</title><link>https://keithrozario.com/2014/09/myprocurement-all-government-tenders-in-one-excel-file/</link><pubDate>Tue, 16 Sep 2014 22:54:40 +0000</pubDate><guid>https://keithrozario.com/2014/09/myprocurement-all-government-tenders-in-one-excel-file/</guid><description>&lt;p>&lt;img
 src="https://keithrozario.com/uploads/MyPROCUREMENT-Pusat-Maklumat-Perolehan-Kerajaan.png"
 alt="MyProcurement"
 
 loading="lazy"
 />&lt;/p>
&lt;blockquote>&lt;span style="color: #99ccff;">I've updated this post on 31-Mar-2015, to incorporate the latest changes, and to provide more up to data info on the procurement database. Left everything else in tact.&lt;/span>&lt;/blockquote>
Happy birthday Malaysia!! Just how awesome is our country, that we celebrate an Independence Day AND a Malaysia Day, not to mention 2 New years day, (or 3 if you count Awal Muharram).
&lt;p>So on that note, I decided to use my IT skills for the good of the country.&lt;/p>
&lt;p>To be honest, my IT skills have never been up to par, my day job is more managing/planning/documenting than actual execution of &amp;lsquo;real&amp;rsquo; IT work. But it was good for me to dust of the ol&amp;rsquo; programming fingers and learn Python to grab some publicly available information and make it more accessible to the less IT centric members of society.&lt;/p>
&lt;p>Since I had limited time, and sub-par skills, I decided to set my sights low, and aim to extract all the data from the Malaysian &lt;a title="Myprocurement" href="http://myprocurement.treasury.gov.my/" target="_blank">MyProcurement&lt;/a> portal, which houses all the results of government tenders (and even direct negotiations) in one single website for easy access. The issue I had with the portal though, was that it only displayed 10 records at a time&amp;ndash;from it&amp;rsquo;s 10,000+ record archive, so there was no way to develop insights into the data from the portal directly, you had to extract it out, but the portal provider did not provide a raw data dump to do this.&lt;/p>
&lt;p>So I wrote a simple Python script to extract all the data, and prettified the data in Excel offline. The result is a rather mixed one.&lt;/p>
&lt;p>I was happy that I could at least see which Ministeries or Government departments gave out the most contracts, and what the values of those contracts were. All in all, the excel spreadsheet has more than 10,000 tenders with a cumulative value of RM35 billion worth of contracts going back to 2009. The data allowed me to figure out which Ministry gave out the most contracts, the contracts with the highest and lowest value (including one for Rm0.00, and one for just Rm96.00). All in all it was quite informative.&lt;/p>
&lt;p>&lt;img
 src="https://keithrozario.com/uploads/Results_by_ministry_hu_7097349aaa97625.png"
 srcset="
 /uploads/Results_by_ministry_hu_18d57172f201e097.png 480w,
 /uploads/Results_by_ministry_hu_7097349aaa97625.png 768w,
 /uploads/Results_by_ministry_hu_c1117c7aa1d4aab4.png 1024w,
 /uploads/Results_by_ministry.png 1165w"
 sizes="(max-width: 480px) 480px, (max-width: 768px) 768px, (max-width: 1024px) 1024px, 100vw"
 alt="Results_by_ministry"
 
 loading="lazy"
 />&lt;/p></description></item><item><title>The root cause of crime</title><link>https://keithrozario.com/2013/07/the-root-cause-of-crime/</link><pubDate>Wed, 10 Jul 2013 13:04:19 +0000</pubDate><guid>https://keithrozario.com/2013/07/the-root-cause-of-crime/</guid><description>&lt;p>Crime has become a hot-button topic these days, and while a lot of fingerpointing and blame-shifting has been going on in political circles, I think it&amp;rsquo;s wise we took a step back and try to address the root problem rather than its symptoms.&lt;/p>
&lt;p>A brilliant piece by Evgeny Morozov from the Slate, points out the following:&lt;/p>
&lt;p>[box icon=&amp;ldquo;chat&amp;rdquo;]&lt;/p>
&lt;p>Forget terrorism for a moment. Take more mundane crime. Why does crime happen? Well, you might say that it&amp;rsquo;s because youths don&amp;rsquo;t have jobs. Or you might say that&amp;rsquo;s because the doors of our buildings are not fortified enough. Given some limited funds to spend, you can either create yet another national employment program or you can equip houses with even better cameras, sensors, and locks. What should you do?&lt;/p></description></item><item><title>Data guys versus Lawyers and Politicians</title><link>https://keithrozario.com/2012/11/data-guys-versus-lawyers-and-politicians/</link><pubDate>Thu, 22 Nov 2012 08:00:59 +0000</pubDate><guid>https://keithrozario.com/2012/11/data-guys-versus-lawyers-and-politicians/</guid><description>&lt;p>&lt;img
 src="https://keithrozario.com/uploads/Presidential-Election-predictions-300x147.jpg"
 alt="Presidential Election predictions"
 title="Presidential Election predictions"
 loading="lazy"
 />&lt;/p>
&lt;p>&lt;a title="Nate Silver" href="http://en.wikipedia.org/wiki/Nate_Silver" target="_blank">Nate Silver&lt;/a> is currently the internet darling of the big data folks, not only did he accurately predict the correct outcome in all 49 states for the US presidential election, he correctly pointed out that Florida would be a toss up before eventually leaning towards Obama. That&amp;rsquo;s like predicting a coin-toss would end on it&amp;rsquo;s side. While all of that may seem remarkable, this isn&amp;rsquo;t a story of a boy-genius but rather the dawn of a new age&amp;ndash;an age driven by data.&lt;/p>
&lt;p>Nate isn&amp;rsquo;t alone on this,  &lt;a title="2012 election predictions" href="http://www.slate.com/articles/news_and_politics/politics/2012/11/pundit_scorecard_checking_pundits_predictions_against_the_actual_results.html" target="_blank">The Slate reports that 2 different pundits got the entire analysis spot on as well&lt;/a>, not to mention a &lt;a title="3 Man team in NC correctly predict the Elections" href="http://www.businessinsider.com/ppp-election-prediction-nate-silver-obama-romney-2012-11" target="_blank">3 man team in North Carolina&lt;/a> armed only with robo-callers who also made a spot-on prediction. This isn&amp;rsquo;t some savant ability that Nate has, this is just pure hard core science at work, and the people that use the science are the ones making the accurate predictions, while the people that ignore it&amp;ndash;are left behind.&lt;/p>
&lt;p>And just who got left behind? The usual opinion writers, like Ann Coulter who predicted Romney would win by a 273-265 margin, Newt Gingrinch who predicted an ever igger margin for Romney and of course Jim Cramer who predicted such a insane number that it probably isn&amp;rsquo;t even worth typing here&amp;ndash;but I&amp;rsquo;ll type it anyway. Good ol&amp;rsquo; Jim predicted Obama would win by a &lt;a title="Jim Cramer margin prediction" href="http://www.businessinsider.com/jim-cramer-explains-why-hes-calling-a-blowout-for-obama-2012-11" target="_blank">whoooping 440-98 margin&lt;/a>, off by more than a 100 point margin&amp;hellip;&lt;em>but at least he got the winner right, and I&amp;rsquo;m sure predicting the stock market isn&amp;rsquo;t anything like predicting an election, and 100 points means nothing in the stock market.&lt;/em>&lt;/p>
&lt;em></description></item><item><title>Answering the tough questions: Watson vs. Humans</title><link>https://keithrozario.com/2012/08/answering-the-tough-questions-watson-vs-humans/</link><pubDate>Thu, 30 Aug 2012 08:05:19 +0000</pubDate><guid>https://keithrozario.com/2012/08/answering-the-tough-questions-watson-vs-humans/</guid><description>&lt;p>&lt;img
 src="https://keithrozario.com/uploads/17jeopardy_337-span-articleLarge-300x165.jpg"
 alt=""
 title="17jeopardy_337-span-articleLarge"
 loading="lazy"
 />&lt;/p>
&lt;p>IBM have always been on the cutting edge of innovation, they&amp;rsquo;ve moved more becoming merely a computer company to  what is probably the first truly all encompassing &lt;strong>technology company,&lt;/strong> they don&amp;rsquo;t just make fancy gadgets or shiny tinga-ma-jigs, they make actual solutions for real-world problems.&lt;/p>
&lt;p>In 1996, IBM introduced the world to Deep Blue. Kasparov met Deep Blue and wasn&amp;rsquo;t impressed, he had no reason to be, he defeated Deep Blue 4-2, and walked away comfortably.&lt;/p>
&lt;p>However, in 1997, IBM re-introduced the world to the 2nd version of Deep Blue (unofficially named Deeper Blue), and this time Kasparov was beaten &amp;ndash;but not by much. Kasparov is the Tiger Woods, Pele and Michael Jordan of the Chess world, and he was beaten by a super computer with 11.38 GFLOPs of power.&lt;/p>
&lt;p>In turns out though, we had nothing to be afraid off, Chess is after all a pretty simple game when you break it down, the number of possible moves are finite, together with the number of possible scenarios to play out. It&amp;rsquo;s not an easy game to master, but as it turns out playing chess is infinitely easier than just plain talking.&lt;/p>
&lt;p>In fact, of all the talking games, Jeopardy seems the most difficult. At the end of this post, I will make an argument to show that Jeopardy &amp;ndash; a simple talking game &amp;ndash; is about 6,500 times more difficult than Chess (a game we often associate with genius). Turns out Kasparov has to bow to Ken Jennings.&lt;/p></description></item><item><title>Is MAS updating it's own Wikipedia page?</title><link>https://keithrozario.com/2012/07/malaysian-airlines-wikipedia-mas/</link><pubDate>Thu, 19 Jul 2012 04:26:32 +0000</pubDate><guid>https://keithrozario.com/2012/07/malaysian-airlines-wikipedia-mas/</guid><description>&lt;p>&lt;img src="http://farm8.staticflickr.com/7267/7510596294_e1c737c963_n.jpg" alt="9M-MPL Boeing 747-400 MAS" />&lt;/p>
&lt;p>Continuing my series on bigdata and Google bigquery, I&amp;rsquo;ve decided to share a rather interesting snippet of information regarding our very own Malaysian Airlines and their wikipedia page.&lt;/p>
&lt;p>First, just to illustrate how important Wikipedia is in general, the &lt;a title="Malaysian Airlines Wikipedia Traffic" href="http://stats.grok.se/en/latest/Malaysia_Airlines" target="_blank">Malaysian Airlines Wikipedia page gets roughly 30,000 hits per month&lt;/a>. That&amp;rsquo;s just one page of Wikipedia getting more hits than my entire website, I can&amp;rsquo;t tell you how frustrated that makes me.&lt;/p>
&lt;p>Having a negative sounding Wikipedia page is pretty bad for business, particularly if 30,000 potential customers view it every month. That&amp;rsquo;s a web page that needs some serious attention if you&amp;rsquo;re the marketing manager of Malaysian Airlines.&lt;/p>
&lt;p>Unfortunately for MAS (and every business organization there is), Wikipedia has a policy about updating your own Wikipedia page&amp;ndash;&lt;strong>you&amp;rsquo;re not allowed to do it&lt;/strong>. Wikipedia has to keep to it&amp;rsquo;s original intention of being an online repository of information that is fair, balanced and neutral. Having marketing gurus or corporate big wigs updating their own Wikipedia entry isn&amp;rsquo;t exactly in the best intentions of anyone, however Wikipedia doesn&amp;rsquo;t strictly enforce the policy and leave it up to the crowd.&lt;/p>
&lt;p>Fortunately, the crowd have responded, sites like &lt;a title="Wikiscanner" href="wikiscanner.virgil.gr" target="_blank">WikiScanner&lt;/a> allow users to see which IP addresses updated which Wikipedia articles. Some have gone to the extent of correlating those IP addresses to the owners and &lt;a title="Wikipedia: Seeing red" href="http://www.nytimes.com/2007/08/19/technology/19wikipedia.html?pagewanted=all" target="_blank">determining if companies are updating their own Wikipedia pages against the general guidelines&lt;/a>. Let&amp;rsquo;s see if Malaysian Airlines can join that group of companies who&amp;rsquo;ve been slapped on the wrist for changing the Wikipedia pages of their organizations.&lt;/p></description></item><item><title>Wikipedia from a Malaysian perspective</title><link>https://keithrozario.com/2012/07/who-updates-wikipedia-malaysia/</link><pubDate>Wed, 18 Jul 2012 04:00:34 +0000</pubDate><guid>https://keithrozario.com/2012/07/who-updates-wikipedia-malaysia/</guid><description>&lt;p>&lt;img
 src="https://keithrozario.com/uploads/wikipedia_crowdsourcing.png"
 alt=""
 title="wikipedia_crowdsourcing"
 loading="lazy"
 />&lt;/p>
&lt;p>Wikipedia is quite possibly the greatest repository of information mankind has ever seen. It&amp;rsquo;s built around an amazing concept of allowing anyone the ability to create, document and moderate information in real-time, and so far the concept has proven successful&amp;ndash;some may even argue that it&amp;rsquo;s too successful.&lt;/p>
&lt;p>For the past two days, I&amp;rsquo;ve been writing about &lt;a title="Google bigquery" href="http://www.keithrozario.com/2012/07/google-bigquery-wikipedia-dataset-malaysia-singapore.html">Bigquery&lt;/a> and &lt;a title="What is big data" href="http://www.keithrozario.com/2012/07/what-is-big-data.html">Big Data&lt;/a> in general, and for the most part I&amp;rsquo;ve been using the freely available wikipedia dataset in Bigquery to perform some queries and analysis. The results were so interesting, that they warrant a post on their own&amp;ndash;and this is that post!&lt;/p>
&lt;p>For instance, I was curious who Aiman Abmajid was. For those who aren&amp;rsquo;t following the blog, Aiman is the undisputed King of Wikipedia in Malaysia. Aiman has single-handedly helped update Malaysian articles on Wikipedia a mind-blowing 13 THOUSAND times&amp;ndash;and that&amp;rsquo;s just the English articles. Almost 6 times more than his closest Malaysian rival.&lt;/p>
&lt;p>I was intrigued as to who he was and why was he updating so many Wikipedia entries (some more than 900 times per article), and more I dug the more intriguing it got.&lt;/p>
&lt;p>A quick Google search, brought me his Wikipedia which led me to the following:&lt;/p></description></item><item><title>What is big data</title><link>https://keithrozario.com/2012/07/what-is-big-data/</link><pubDate>Mon, 16 Jul 2012 07:00:13 +0000</pubDate><guid>https://keithrozario.com/2012/07/what-is-big-data/</guid><description>&lt;p>&lt;img
 src="https://keithrozario.com/uploads/big-data.jpg"
 alt="big-data-getting-bigger"
 title="big-data"
 loading="lazy"
 />&lt;/p>
&lt;p>It&amp;rsquo;s obvious that people have gotten bigger these past few decades, what&amp;rsquo;s less obvious is how data has grown bigger in the past few years. In fact, 90% of the digital data we have today, was created in the last 2 years. Put another way, in 2010 we had just 10% of the digital data we have today.&lt;/p>
&lt;p>In 2011, an estimated 1.2 TRILLION Gigabytes of data was created. That&amp;rsquo;s roughly 200GB for every man women and child in the world&amp;ndash;In just one year. That&amp;rsquo;s every person in the world watching almost 300 feature length films every day, and this is the average.&lt;/p>
&lt;p>The reason is simple, we now keep digital records of our transactions (e-banking and credit cards), our running patterns, our spending habits and even our wedding photos&amp;ndash;and that&amp;rsquo;s just commercial end user applications.&lt;/p>
&lt;p>What about corporations who track thousands of data points per second for their manufacturing plants and supermarkets tracking the purchases of customers. We&amp;rsquo;re creating and gobbling far more data than before, and the trend doesn&amp;rsquo;t look to be stopping. Every day, we create 2.5 quintillion bytes of data — &lt;strong>so much that 90% of the data in the world today has been created in the last two years alone.&lt;/p></description></item></channel></rss>