Sunday, January 15, 2012

Big Data

Image representing Hadoop as depicted in Crunc...Image via CrunchBaseBig Data: Big News
Facebook And Big Data

After reading this you appreciate your Facebook stream just a little more.

O'Reilly Radar: What is big data?
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures. ..... cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information ...... Today's commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. ...... analytical use, and enabling new products ...... Being able to process every item of data in reasonable time removes the troublesome need for sampling ...... by combining a large number of signals from a user's actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business. It's no coincidence that the lion's share of ideas and tools underpinning big data have emerged from Google, Yahoo, Amazon and Facebook. ....... The emergence of big data into the enterprise brings with it a necessary counterpart: agility. Successfully exploiting the value in big data requires experimentation and exploration. ........ Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. ....... the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. ........ Having more data beats out having better models ...... If you could run that forecast taking into account 300 factors rather than 6, could you predict demand better? ......... Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it. ...... data warehouses or databases such as Greenplum — and Apache Hadoop-based solutions ...... Apache Hadoop.. places no conditions on the structure of the data it can process. ...... First developed and released as open source by Yahoo, it implements the MapReduce approach pioneered by Google in compiling its search indexes. Hadoop's MapReduce involves distributing a dataset among multiple servers and operating on the data: the "map" stage. The partial results are then recombined: the "reduce" stage. ......... Hadoop is not itself a database or data warehouse solution, but can act as an analytical adjunct to one. ....... A MySQL database stores the core data. This is then reflected into Hadoop, where computations occur, such as creating recommendations for you based on your friends' interests. Facebook then transfers the results back into MySQL, for use in pages served to users. ............ the increasing rate at which data flows into an organization — has followed a similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialized companies such as financial traders have long turned systems that cope with fast moving data to their advantage. Now it's our turn. ......... Online retailers are able to compile large histories of customers' every click and interaction: not just the final sales. Those who are able to quickly utilize that information, by recommending additional purchases, for instance, gain competitive advantage. The smartphone era increases again the rate of data inflow, as consumers carry with them a streaming source of geolocated imagery and audio data. ......... The importance lies in the speed of the feedback loop, taking data from input through to decision. ........ you wouldn't cross the road if all you had was a five-minute old snapshot of traffic location. ......... "streaming data," or "complex event processing." ...... when the input data are too fast to store in their entirety: in order to keep storage requirements practical some level of analysis must occur as the data streams in. ........ At the extreme end of the scale, the Large Hadron Collider at CERN generates so much data that scientists must discard the overwhelming majority of it — hoping hard they've not thrown away anything useful. The second reason to consider streaming is where the application mandates immediate response to the data. Thanks to the rise of mobile applications and online gaming this is an increasingly common situation. ........ The velocity of a system's outputs can matter too. The tighter the feedback loop, the greater the competitive advantage. ....... Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesn't fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application. .......... the reality of data is messy. Different browsers send different data, users withhold information, they may be using differing software versions or vendors to communicate with you. And you can bet that if part of the process involves a human, there will be error and inconsistency. ....... Is this city London, England, or London, Texas? By the time your business logic gets to it, you don't want to be guessing. ...... a principle of big data: when you can, keep everything. There may well be useful signals in the bits you throw away. ....... documents encoded as XML are most versatile when stored in a dedicated XML store such as MarkLogic. Social network relations are graphs by nature, and graph databases such as Neo4J make operations on them simpler and more efficient. ....... a disadvantage of the relational database is the static nature of its schemas. In an agile, exploratory environment, the results of computations will evolve with the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it. ........ three forms: software-only, as an appliance or cloud-based. ...... IT is undergoing an inversion of priorities: it's the program that needs to move, not the data. .... Financial trading systems crowd into data centers to get the fastest connection to source data, because that millisecond difference in processing time equates to competitive advantage. ...... 80% of the effort involved in dealing with data is cleaning it up in the first place ...... data science, a discipline that combines math, programming and scientific instinct. ...... The art and practice of visualizing data is becoming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way. ...... advice to businesses starting out with big data: first, decide what problem you want to solve.

O'Reilly Radar: Big Data: An opportunity in search of a metaphor
Big data as a discipline or a conference topic is still in its formative years....... Long-time data enthusiasts watching with mixed emotions as their interest is legitimized, experiencing a feeling not unlike when a band that you've been following for years suddenly becomes popular. ...... Until recently, data was mainly an artifact of business processes. It now takes center stage; organizationally, data has left the IT department and become the responsibility of the product team. ...... "the new oil," "goldrush" and of course "data mining" ...... "data tornado," "data deluge," data tidal wave" ..... "data exhaust," "firehose," "Industrial Revolution"
O'Reilly Radar: The feedback economy
Companies that employ data feedback loops are poised to dominate their industries..... winning requires two things: being able to collect and analyze information better, and being able to act on that information faster, incorporating what's learned into the next iteration. ...... In an era of information obesity, we need to eat better. There's a reason they call it a feed, after all. ...... When interactions become digital, they become instantaneous, interactive, and easily copied. It's as easy to tell the world as to tell a friend, and a day's shopping is reduced to a few clicks. ...... Teenagers shun e-mail as too slow, opting for instant messages. The digitization of our world means that trips around the OODA loop happen faster than ever, and continue to accelerate. ...... In a society where every person, tethered to their smartphone, is both a sensor and an end node, we need better ways to observe and orient, whether we're at home or at work, solving the world's problems or planning a play date. And we need to be constantly deciding, acting, and experimenting, feeding what we learn back into future behavior. ...... retail traffic, call center volumes, product recalls, or customer loyalty indicators. ..... A modern smartphone can sense light, sound, motion, location, nearby networks and devices, and more, making it a perfect data collector. As consumers opt into loyalty programs and install applications, they become sensors that can feed the data supply chain. ...... In big data, the collection is often challenging because of the sheer volume of information, or the speed with which it arrives, both of which demand new approaches and architectures. ...... Virtualization, for example, allows operators to spin up many machines temporarily, then destroy them once the processing is over. ...... big data gives clouds something to do. ...... Big data may be big. But if it's not fast, it's unintelligible. ..... For decades, companies have relied on structured relational databases and data warehouses — many of them can't handle the exploration, lack of structure, speed, and massive sizes of big data applications. ......... natural language processing tries to read unstructured text and deduce what it means: Was this Twitter user happy or sad? Is this call center recording good, or was the customer angry? ...... Just as astronomers use algorithms to scan the night's sky for signals, then verify any promising anomalies themselves, so too can data analysts use machines to find interesting dimensions, groupings, or patterns within the data. Machines can work at a lower signal-to-noise ratio than people. ....... new interfaces and multi-sensory environments that allow an analyst to work alongside the machine, immersed in the data. ....... storage is a combination of cloud and on-premise storage, using traditional flat-file and relational databases alongside more recent, post-SQL storage systems. ..... The best companies tie big data results into everything from hiring and firing decisions, to strategic planning, to market positioning. While it's easy to buy into big data technology, it's far harder to shift an organization's culture. In many ways, big data adoption isn't a hardware retirement issue, it's an employee retirement one. ....... Big data, and its close cousin, cloud computing ...... A big data mindset is one of experimentation, of taking measured risks and assessing their impact quickly. It's similar to the Lean Startup movement, which advocates fast, iterative learning and tight links to customers. But while a small startup can be lean because it's nascent and close to its market, a big organization needs big data and an OODA loop to react well and iterate fast. ........ the big business answer to the lean startup ..... Software is eating the world. Verticals like publishing, music, real estate and banking once had strong barriers to entry. Now they've been entirely disrupted by the elimination of middlemen. The last film projector rolled off the line in 2011: movies are now digital from camera to projector. The Post Office stumbles because nobody writes letters, even as Federal Express becomes the planet's supply chain. ........ Companies that get themselves on a feedback footing will dominate their industries, building better things faster for less money. Those that don't are already the walking dead ...... They usher in a new era for humanity, with all its warts and glory. They herald the arrival of the feedback economy. ..... We're moving beyond an information economy. Information on its own isn't an advantage, anyway. Instead, this is the era of the feedback economy
O'Reilly Radar: The "Big Data Now" anthology
O'Reilly Radar: Building data startups: Fast, big, and focused
A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis. ...... The most successful of data startups must be fast (with data), big (with analytics), and focused (with services). ..... In 1980, a terabyte of disk storage cost $14 million dollars. Today, it's at $30 and dropping. ..... CPU and storage costs have fallen faster than that of network and disk IO. Thus data is heavy; it gravitates toward centers of storage and compute power in proportion to its mass. Migration to the cloud is the manifest destiny for big data, and the cloud is the launching pad for data startups. ...... analytics is the brains to cloud computing's brawn ..... Because data is heavy, and algorithms are light, one key strategy is to push code deeper to where the data lives .... this is a democratic force that promises to unleash a wave of innovation in the coming decade.
Enhanced by Zemanta

No comments: