Tuesday, June 5, 2012

TECH SPECIAL..All the World is a Hard Drive


All the World is a Hard Drive

The spectacular technological and business innovations in data storage hide unheralded stars, like Amazon’s data centres and Google’s cheap hard disks, as well as tricky questions, for example, on whether an Indian user is liable under US laws. Here’s a quick guide to an ongoing revolution

    On the evening of April 11, 2012, Mike Krieger, the co-founder of Instagram, gave a talk in San Francisco. What he must have known, though his audience didn’t, was that in less than 24 hours, Mark Zuckerberg would announce that he was buying Instagram for a billion dollars.
Given that he was about to become A Very Rich Man, Krieger hid his excitement well. The talk called “Scaling Instagram” was a long and technical one about the challenges of growing the popular photo app. One of the final slides (number 176 of 185) had just two words: “Unprecedented Times”. The next one said: “2 backend engineers can scale a system to 30+ million users.” On the eve of the Facebook deal, Instagram had bumped up that total to five engineers.
If this doesn’t sound quite as remarkable as it should, it’s because we are now inured to Silicon Valley startup stories originating in garages with a couple of geeks, who end up making millions — or a billion. Look at this ‘growth’ in a different way then: it’s as if Steve Jobs and Steve Wozniak built the first Apple computer and within a day or two of building that first prototype, were able to ship several thousand models to customers.
So you might retort of course, that software is different. You can ‘scale’ bits and bytes in a way you can’t with metal or glass or plastic. Fair enough.
    So imagine this scenario: you create a piece of software which allows millions of users to upload millions of photographs and tweak them and tag them and share them. You need someplace to store those photographs and you need to be able to handle thousands or millions of users swarming over your website or app everyday without it going down.
    In a world where competition is intense, users will simply dump you, if your app slows down or freezes. So your ‘downtime’ has to be pretty much close to zero. For the user, whether they are in New York, or Tokyo, or Ankara or Mumbai, you have to be always up, and always running, 24 hours a day.
    One of the great, and less-than-heralded-except-inthe-tech-press innovations of the past decade or so, has been the growth of a huge infrastructure of hardware and software which enables a startup like Instagram to create a simple app, and have it service millions of users from day one, without spending a small fortune on servers and technicians to maintain them, not to mention the space the keep them.
    Take Pinterest, another ‘hot’ social media app. It handles 18 million users and 410 terabytes of data (that’s over 4 lakh GB), and, as of December 2011, had all of 12 people. How do they do it?
Amazon’s Hidden World
Amazon is better known to the vast majority of us as the world’s largest online retailer, but to the tech community it is also the equivalent of an electric utility. Both Instagram and Pinterest installed and ran their software on Amazon’s ‘cloud’ computing platform.
    All their data is stored on servers and in data centres (essentially vast warehouses with hundreds if not thousands of servers loaded with hard disks) owned and operated by Amazon and rented to companies like Instagram and Pinterest by the hour.
    But Amazon provides not just storage but applications that companies can run in the ‘cloud’ as well. It’s as if you had nothing but a keyboard, a screen, a mouse, and an internet connection but could run Windows and MS Office without noticing the difference. So ubiquitous has Amazon become, that it has been estimated that one of three internet users visits a site run off Amazon’s cloud service at least once every day.
    “The cloud has enabled us to be more efficient, to try out new experiments at a very low cost, and enabled us to grow the site very dramatically while maintaining a very small team,” Pinterest operations engineer Ryan Park told a conference in New York last month.
    “Imagine we were running our data centre, and we had to go through a process of capacity planning and ordering and racking hardware. It wouldn't have been possible to scale fast enough,” he told the conference, according to Techworld.com.
    What’s the key here though is not just the ability to rent hardware and software, rather than buy them, but the way Amazon prices that service, and what this enables companies like Pinterest to do. Customers pay only for what they use and by the hour. So, as Park pointed out, Pinterest pays about $52 an hour to Amazon during peak hours of the day, and about $15 during the night when traffic on the app is less (most of its customers are in the US).
    There are other services that Pinterest uses that add up to some few hundred dollars per month, but even then, paying that much to service 18 million users isn’t too bad a deal. And the ability to scale up and scale down when needed enables Pinterest to try out new services easily and at low cost. On the downside of course, and just like an electricity grid, if
Amazon’s cloud services were to be by technical snags, large chunks of the Net could go ‘dark’.
    The other facet of this world of cheap, pay-for-gigabytes-you-need world is of course cloud storage for consumers. When Google launched its much anticipated 5-GB-free storage plan earlier this month — the G-Drive — it jumped onto an already crowded bandwagon occupied by the likes of Box, Dropbo Mozy and others.
    Today’s free online storage market is such that
you could get up to 72 GB of free online storage by signing up with different online providers — and that’s only using one email ID and sticking to the some of the bigger players in the market. Incidentally, a major Amazon customer is Dropbox — so when you store your files on the popular cloud service, chances are that you are storing them on a hard disk owned by Amazon in one of its data centres in the US.
Cloud in India
The use of cloud storage, and its logical extension, actually opening the files you store and working with them online, is now widespread. Google for instance, offers businesses its Google Apps for Business, where for $5 per user per month, you get email, data storage and a standard office suite.
    Microsoft, for long dominant on the desktop, now offers Office 365, where a business can rent MS Office, along with email, storage and conferencing starting from about $6 per month. “Customers have seen significant cost savings from the use of Office 365,” claims Sanjay Manchanda, who heads Microsoft’s Information Worker Division in India. “Businesses save on power costs, software costs and the need to buy physical servers.” However, the company did not share details about the number of customers who have adopted the product in India.
    “In India, the biggest customers for cloud services come from technology and banking,” says Dharanibalan Gurunathan, executive, offerings management & development, global technology services, IBM. “They constitute 60-70% of the market.”
    Technology consultancy IDC forecasts that the Indian ‘public’ cloud market (where companies use storage and applications provided by others such as Amazon or IBM) will exceed $2.5 billion by 2015.
    T Srinivasan, managing director (India and rc) for IT services company VMware ays,“It’s the choice between buying all that IT infrastructure versus renting it.” VMware’s clients include Chitale Dairy in Maharashtra, which has about 400 employees and produces about 4,00,000 litres of milk a day. The company has just three physical servers, which run 20 ‘virtual’ servers on VMware’s cloud platform.
    But there are limits of course — in India for instance, the quality of bandwidth is an issue for any company wishing to run all its office software off a Google server in another continent, though that has become less of a constraint over the years.
Of Safety & Laws
The bigger problems relate to legal and security concerns, as well as a simple need to ensure that when it comes to vital infrastructure, you alone are in charge and in control. Maruti for instance, runs its own cloud platform for its dealers across 1,700 locations. It is also evaluating the need to outsource some of its IT needs and functions to a ‘public’ cloud.
    “The financial savings seem fairly attractive,” says Rajesh Uppal, chief information officer. “However, our core applications such as those relating to the shop floor will likely stay internal,” he says. “These are critical functions which support decisions that need to be made on a split second basis.”
    The concern with security is high for sectors such as banks since the data they store — financial transactions by customers — is about as sensitive as it gets. “We don’t do anything in the public cloud,” says Anil Jaggia, CIO of HDFC Bank.
    “More than large enterprises, public cloud in India is likely to get more traction and interest from medium and small enterprises, especially startups”, he says. For banks, it is not just the security of customer data which is an issue, but also the legality of it. Some countries and regions (such as the European Union) have strong restrictions, contained in privacy laws, on what kinds of data can be stored outside their borders.
    The term ‘cloud’ storage, or ‘cloud’computing is indeed especially misleading when it comes to the legal ground realities. All data is stored somewhere, on some type of physical media, in some part of the world.
    In the case of Dropbox for instance, the company stores customer files on Amazon’s servers “in multiple data centers located across the United States”. This brings your data within the ambit of US laws such as the Patriot Act.
    In its terms of service that every customer signs on to, Dropbox makes it clear that it will hand over customer files to law enforcement, “to comply with a law, regulation or compulsory legal request”. Even if you have nothing to hide and don’t particularly care if an American agent flips through your party photos from last night, or reads through your monthly bank statement, it can be annoying as the Megaupload case showed
earlier this year.
    When Megaupload was shut down by US law enforcement, on charges that it stored a large volume of pirated content, many users were denied access to their own personal non-pirated data and files as well. The Indian government, like many others, files data requests with Google and others for access to customers email and other accounts on a regular basis.
    There is another set of companies, apart
from banks, for whom data storage is a core function. There are the biggies of the tech universe — Google and Facebook (among others). For each of these companies, the data they collect about you from your emails, posts and files are a core competitive advantage. And another of the little-recognised but huge technical innovations in the past decade or so has been how companies such as Google manage the millions of gigabytes they store.
Google’s Cheap Machines
Try opening a 60-MB spreadsheet file (if you happen to have one lying around) on your laptop. If your computer is towards the middle of what’s available in the market in terms of power, it’s quite likely that there’s a wait of at least five to ten seconds (quite likely more), before the file loads fully. And no wonder — a 60-MB spreadsheet file can quite easily contain up to a million rows of data.
    Yet in those 10 seconds or so, Google serviced a few hundred thousand, or a million or more search requests. What’s surprising is not that it does this, but the kind of hardware it uses.
    Stephen Arnold, author of a series of studies on Google wrote in 2007: “The hardware in a Google data center can be bought at a local computer store. Google uses the same types of memory, disc drives, fans and power supplies as those in a standard desktop PC.” Google strings together hundreds of thousands of pieces of ‘commodity’ hardware across dozens of data centres across the world.
    The call to use ‘commodity’ hardware rather than top-of-the-line machines is as much a business decision, as it is a technical one and it was made very early on. With the exponential growth in the number of users and the size of the web Google has had to index over the years, the costs of high-end hardware, even for a company buying bulk, would have quickly spiralled out of control.
    So all of Google’s hardware, or at least most of it had to be cheap, standardised and widely available from multiple manufacturers — a bonus is that the cost of such hardware has fallen sharply in the past decade.
    Holding the hardware together is a remarkable feat of software programming — a technical innovation that is possibly as important to Google’s success as a company as the famous PageRank system that Larry Page and Sergey Brin developed.
    “It is important to keep in mind that PageRank is important only because it can run quickly in the real world, not in a sterile computer lab illuminated with the blue glow of supercomputers,” says Arnold.
    Hedged in with dozens of patents, the exact details of how this software-cum-hardware infrastructure works is still a secret. It’s not even clear how many data centres Google has around the world. But through technical papers published by Google insiders, researchers have been able to get a fairly deep insight into how the system works. Incidentally, Facebook too uses the same type of hardware as Google in its data centres.
    Google to Hadoop
When a user types in a search term, the task is farmed out to not one but thousands of machines. As Baseline, a trade magazine, described it in a study of Google’s technology, asking a single person to search out all the occurrences of a term in a magazine would take a long time. But farm that task out to hundreds of people each searching through one
    page, and the time taken to get a
    result falls sharply.
    But the metaphor can be taken further. If one of those persons drops out, their work can be reassigned to one of the others. Since commodity hardware can and does fail quite often, Google designs its software specifically to work around those failures.
    In 2006, the then-CIO of Google, Douglas Merrill told a conference(quoted in Baseline) that at then prevalent market conditions, “I can get about a 1,000-fold computer power increase at about 33 times lower cost if I go to the failure-prone infrastructure. So if can do that, I will.”
    The broad principle is to take a task (like an individual search), break it down into smaller tasks, have hundreds if not thousands of individual computers chew away at those smaller tasks, put the results together and serve them up to the user.
    Such a brief description doesn’t begin to describe this system that Google built. But the way Google has managed the millions of gigabytes that it stores has inspired a range of other software projects, such as the open source Hadoop, which is specifically used to handle enormous (millions of GB) sets of data. Companies who need to process such volumes of data (such as pharma companies doing drug research) can use Amazon to store all that data and Hadoop to process it.
    Hadoop is just one element of the continuing revolution in data management, just as G-Drive and Dropbox and others represent the consumer side. Expect more radical innovations — and more tricky questions.

:: Avinash Celestine SET 120527


No comments: