数据新闻案例集锦

admin · 发表于 2014-6-22 09:55:42

【案例】@历史解密网站

中国姓氏最新排名，看看您的姓能排第几！

[url=]
收起[/url]|[url=]
查看大圖[/url]|[url=]
向左轉[/url]|[url=]
向右轉[/url]

[url=]

(89)

[/url]

admin · 发表于 2014-6-24 19:00:04

【案例】
独家编译 | 《纽约时报》支招如何运营数据新闻

2014-06-24 腾讯新闻 [url=]全媒派[/url]

编者按：在“数据新闻”炙手可热的今天，数据已深入媒体的骨髓血脉，成为记者无法剥除的领域。数据的使用可以丰富新闻的消息来源，方便记者进一步挖掘选题、拓展新闻深度。但面对枯燥无味的数据，长期与新讯息为伍的记者似乎总是无从下手。然而，《纽约时报》旗下数据网站的开发者Derek Willis却认为，“采访数据”比“采访人”更有趣。下面，就让我们看看在2014年新闻行业交流大会上，他对处理数据新闻有何高见。

2014年新闻行业交流大会近期在马里兰大学菲利普梅林新闻学院举行，在会议的第二天，《纽约时报》旗下数字数据网站The Upshot的开发者Derek Willis发表了演讲，与听众一同分享如何处理数据这一议题。

本次大会在马里兰大学举行，AJR（American Journalism Review）是这次大会的出版合作伙伴。

数据新闻运营五大建议

关于数据新闻，Willis说，记者并不需要很精通这些数据，他们需要的只是一个聪明的头脑。

他还表示，其实处理数据的方法和采访人很相似。记者都希望进一步了解这个对象，想发现这里面有什么内容。数据也是一样，即探索某个数据有何意义，与何相关。

在会上，Willis为如何处理数据新闻提出了几点建议：

1、记者必须始终对数据抱有怀疑态度。数据存在的问题往往不是从表面就能看出的，而是很根本性的问题。所以从一开始，记者心里就要有一个假设：这个数据可能存在差错。

2、尽快给数据分类。这里要记者要清楚你要处理的是什么数据。例如，处理警方逮捕记录的数据时，可能需要按照被逮捕者的年龄、罪名与地址分类。若有些数据存在缺失，还需要去除某些分类。他还指出，在Excel上做的数据统计有时会导致混乱，比如它会把警方标号的5-1-1改换成 5/1/01。

3、把数据视为消息来源。但是这同与个体的消息来源打交道不同，数据不会分辨你问的问题是否恰当。所以在处理数据前，你需要把需要了解的问题写下来，甚至大声读出来，看是否合理恰当。

4、多使用数据过滤工具。在这里，Willis建议应当摈弃低效率的Ctrl-F或Command-F组合键。尽管数据过滤工具只能在Excel中使用，但是在基于SQL的程序中有强大功能。记者需要先从宏观入手，再进一步简化问题。他说，“因为规则是很具体的，所以我们也必须从细微着手。”

5、注意数据的改动与翻译，并及时备份。Willis说，这是因为你输入的越多，出错的几率也就越大。

把数据视为采访对象

有时政府机构会把数据公开放在网络上，认为这样可以避免再跟记者打交道。但是这些数据通常不好解读，在这样的情况下，记者就需要另外做整合报道。

而数据的伟大之处在于它让你把新闻报道视为一个问题去探究，而不只是简单的文字陈述。

Willis因此指出，对待数据他有时候更像个局外人，而非一名记者。他之前报道国会新闻时，总是半开玩笑地说他更愿采访一堆数据而不是政客。

数据浪潮来袭

Willis说很多记者甚至不用学习就能很好的使用数据。尽管我们现在生活在一个前所未有的数据化时代，但其实了解Excel等软件一点都不难，只要你愿意下功夫。

但无论如何，数据新闻的大潮不可抗拒。如果没有学过怎么处理数据，记者会发现很多报道都超出他们的能力范围之外，他们根本没办法进行报道。

本文由腾讯新闻旗下产品“全媒派”独家编译，转载请注明出处。
http://mp.weixin.qq.com/s?__biz=MzA3MzQ1MzQzNA==&mid=201428051&idx=1&sn=305c3ddf77d38b8637c3a6a29abfc669#rd

admin · 发表于 2014-6-25 22:08:17

【案例】
数据化管理

【一图看懂房屋空置率到底多高】2013年中国住房空置率为22.4%，你相信吗？http://t.cn/RvOB9xX

[url=]
收起[/url]|[url=]
查看大圖[/url]|[url=]
向左轉[/url]|[url=]
向右轉[/url]

[url=]
(12)[/url]| [url=]轉發(13)[/url]| [url=]收藏[/url]| [url=]評論(5)[/url]
24分鐘前來自微博 weibo.com | [url=]檢舉[/url]

admin · 发表于 2014-6-29 12:00:09

【数据】
@时代迷思

【一张图让你掌握经济学的内涵】这是经济学必读的一张图，经典得一塌糊涂！

[url=]
收起[/url]|[url=]
查看大圖[/url]|[url=]
向左轉[/url]|[url=]
向右轉[/url]

[url=]
(784)[/url]| 轉發(2719) | 評論(217)
6月28日20 : 00來自皮皮时光机

admin · 发表于 2014-6-30 00:09:16

【案例】
REVIEW ARTICLESBig Data and Its Technical Challenges

By H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, Cyrus Shahabi
Communications of the ACM, Vol. 57 No. 7, Pages 86-94
10.1145/2611567
Comments

VIEW AS:

PrintACM Digital LibraryFull Text (PDF)In the Digital Edition

SHARE:

[url=]Send by email[/url][url=]Share on reddit[/url][url=]Share on StumbleUpon[/url]

[url=]Share on Tweeter[/url][url=]Share on Facebook[/url]

MORE SHARING SERVICES

Share

In a broad range of application areas, data is being collected at an unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly handcrafted models of reality, can now be made using data-driven mathematical models. Such Big Data analysis now drives nearly every aspect of society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences.

Back to Top

Key Insights

As an example, consider scientific research, which has been revolutionized by Big Data.1,12 The Sloan Digital Sky Survey23 has transformed astronomy from a field where taking pictures of the sky was a large part of an astronomer's job to one where the pictures are already in a database, and the astronomer's task is to find interesting objects and phenomena using the database. In the biological sciences, there is now a well-established tradition of depositing scientific data into a public repository, and also of creating public databases for use by other scientists. Furthermore, as technology advances, particularly with the advent of Next Generation Sequencing (NGS), the size and number of experimental datasets available is increasing exponentially.13

The growth rate of the output of current NGS methods in terms of the raw sequence data produced by asingle NGS machine is shown in Figure 1, along with the performance increase for the SPECint CPU benchmark. Clearly, the NGS sequence data growth far outstrips the performance gains offered by Moore's Law for single-threaded applications (here, SPECint). Note the sequence data size in Figure 1 is the output of analyzing the raw images that are actually produced by the NGS instruments. The size of these raw image datasets themselves is so large (many TBs per lab per day) that it is impractical today to even consider storing them. Rather, these images are analyzed on the fly to produce sequence data, which is then retained.

Big Data has the potential to revolutionize much more than just research. Google's work on Google File System and MapReduce, and subsequent open source work on systems like Hadoop, have led to arguably the most extensive development and adoption of Big Data technologies, led by companies focused on the Web, such as Facebook, LinkedIn, Microsoft, Quantcast, Twitter, and Yahoo!. They have become the indispensable foundation for applications ranging from Web search to content recommendation and computational advertising. There have been persuasive cases made for the value of Big Data for healthcare (through home-based continuous monitoring and through integration across providers),3 urban planning (through fusion of high-fidelity geographical data), intelligent transportation (through analysis and visualization of live and detailed road network data), environmental modeling (through sensor networks ubiquitously collecting data),4 energy saving (through unveiling patterns of use), smart materials (through the new materials genome initiative18), machine translation between natural languages (through analysis of large corpora), education (particularly with online courses),2 computational social sciences (a new methodology growing fast in popularity because of the dramatically lowered cost of obtaining data),14systemic risk analysis in finance (through integrated analysis of a web of contracts to find dependencies between financial entities),8 homeland security (through analysis of social networks and financial transactions of possible terrorists), computer security (through analysis of logged events, known as Security Information and Event Management, or SIEM), and so on.

In 2010, enterprises and users stored more than 13 exabytes of new data; this is over 50,000 times the data in the Library of Congress. The potential value of global personal location data is estimated to be $700 billion to end users, and it can result in an up to 50% decrease in product development and assembly costs, according to a recent McKinsey report.17 McKinsey predicts an equally great effect of Big Data in employment, where 140,000–190,000 workers with "deep analytical" experience will be needed in the U.S.; furthermore, 1.5 million managers will need to become data-literate. Not surprisingly, the U.S. President's Council of Advisors on Science and Technology recently issued a report on Networking and IT R&D22identified Big Data as a "research frontier" that can "accelerate progress across a broad range of priorities." Even popular news media now appreciates the value of Big Data as evidenced by coverage in the Economist,7the New York Times,15,16 National Public Radio,19,20 and Forbes magazine.9

While the potential benefits of Big Data are real and significant, and some initial successes have already been achieved (such as the Sloan Digital Sky Survey), there remain many technical challenges that must be addressed to fully realize this potential. The sheer size of the data, of course, is a major challenge, and is the one most easily recognized. However, there are others. Industry analysis companies like to point out there are challenges not just in Volume, but also in Variety and Velocity,10 and that companies should not focus on just the first of these. Variety refers to heterogeneity of data types, representation, and semantic interpretation. Velocity denotes both the rate at which data arrive and the time frame in which they must be acted upon. While these three are important, this short list fails to include additional important requirements. Several additions have been proposed by various parties, such as Veracity. Other concerns, such as privacy and usability, still remain.

The analysis of Big Data is an iterative process, each with its own challenges, that involves many distinct phases as shown in Figure 2. Here, we consider the end-to-end Big Data life cycle.

Back to Top

Phases in the Big Data Life Cycle

Many people unfortunately focus just on the analysis/modeling step—while that step is crucial, it is of little use without the other phases of the data analysis pipeline. For example, we must approach the question of what data to record from the perspective that data is valuable, potentially in ways we cannot fully anticipate, and develop ways to derive value from data that is imperfectly and incompletely captured. Doing so raises the need to track provenance and to handle uncertainty and error. As another example, when the same information is represented in repetitive and overlapping fashion, it allows us to bring statistical techniques to bear on challenges such as data integration and entity/relationship extraction. This is likely to be a key to successfully leveraging data that is drawn from multiple sources (for example, related experiments reported by different labs, crowdsourced traffic information, data about a given domain such as entertainment, culled from different websites). These topics are crucial to success, and yet rarely mentioned in the same breath as Big Data. Even in the analysis phase, which has received much attention, there are poorly understood complexities in the context of multi-tenanted clusters where several users' programs run concurrently.

In the rest of this article, we begin by considering the five stages in the Big Data pipeline, along with challenges specific to each stage. We also present a case study (see sidebar) as an example of the issues that arise in the different stages. Here, we discuss the six crosscutting challenges.

Data acquisition. Big Data does not arise in a vacuum: it is a record of some underlying activity of interest. For example, consider our ability to sense and observe the world around us, from the heart rate of an elderly citizen, to the presence of toxins in the air we breathe, to logs of user-activity on a website or event-logs in a software system. Sensors, simulations and scientific experiments can produce large volumes of data today. For example, the planned square kilometer array telescope will produce up to one million terabytes of raw data per day.

Much of this data can be filtered and compressed by orders of magnitude without compromising our ability to reason about the underlying activity of interest. One challenge is to define these "on-line" filters in such a way they do not discard useful information, since the raw data is often too voluminous to even allow the option of storing it all. For example, the data collected by sensors most often are spatially and temporally correlated (such as traffic sensors on the same road segment). Suppose one sensor reading differs substantially from the rest. This is likely to be due to the sensor being faulty, but how can we be sure it is not of real significance?

Furthermore, loading of large datasets is often a challenge, especially when combined with on-line filtering and data reduction, and we need efficient incremental ingestion techniques. These might not be enough for many applications, and effective insitu processing has to be designed.

Information extraction and cleaning. Frequently, the information collected will not be in a format ready for analysis. For example, consider the collection of electronic health records in a hospital, comprised of transcribed dictations from several physicians, structured data from sensors and measurements (possibly with some associated uncertainty), image data such as X-rays, and videos from probes. We cannot leave the data in this form and still effectively analyze it. Rather, we require an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. Doing this correctly and completely is a continuing technical challenge. Such extraction is often highly application-dependent (for example, what you want to pull out of an MRI is very different from what you would pull out of a picture of the stars, or a surveillance photo). Productivity concerns require the emergence of declarative methods to precisely specify information extraction tasks, and then optimizing the execution of these tasks when processing new data.

Most data sources are notoriously unreliable: sensors can be faulty, humans may provide biased opinions, remote websites might be stale, and so on. Understanding and modeling these sources of error is a first step toward developing data cleaning techniques. Unfortunately, much of this is data source and application dependent.

Data integration, aggregation, and representation. Effective large-scale analysis often requires the collection of heterogeneous data from multiple sources. For example, obtaining the 360-degrees health view of a patient (or a population) benefits from integrating and analyzing the medical health record along with Internet-available environmental data and then even with readings from multiple types of meters (for example, glucose meters, heart meters, accelerometers, among others3). A set of data transformation and integration tools helps the data analyst to resolve heterogeneities in data structure and semantics. This heterogeneity resolution leads to integrated data that is uniformly interpretable within a community, as they fit its standardization schemes and analysis needs. However, the cost of full integration is often formidable and the analysis needs shift quickly, so recent "pay-as-you-go" integration techniques provide an attractive "relaxation," doing much of this work on the fly in support of ad hoc exploration.

It is notable that the massive availability of data on the Internet, coupled with integration and analysis tools that allow for the production of derived data, lead to yet another kind of data proliferation, which is not only a problem of data volume, but also a problem of tracking the provenance of such derived data (as we will discuss later).

Even for simpler analyses that depend on only one dataset, there usually are many alternative ways of storing the same information, with each alternative incorporating certain trade-offs. Witness, for instance, the tremendous variety in the structure of bioinformatics databases with information about substantially similar entities, such as genes. Database design is today an art, and is carefully executed in the enterprise context by highly paid professionals. We must enable other professionals, such as domain scientists, to create effective data stores, either through devising tools to assist them in the design process or through forgoing the design process completely and developing techniques so datasets can be used effectively in the absence of intelligent database design.

Modeling and analysis. Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples. Big Data is often noisy, dynamic, heterogeneous, inter-related, and untrustworthy. Nevertheless, even noisy Big Data could be more valuable than tiny samples because general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge. In fact, with suitable statistical care, one can use approximate analyses to get good results without being overwhelmed by the volume.

Interpretation. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. Usually, this involves examining all the assumptions made and retracing the analysis. Furthermore, there are many possible sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data. For all of these reasons, no responsible user will cede authority to the computer system. Rather, she will try to understand, and verify, the results produced by the computer. The computer system must make it easy for her to do so. This is particularly a challenge with Big Data due to its complexity. There are often crucial assumptions behind the data recorded. Analytical pipelines can involve multiple steps, again with assumptions built in. The recent mortgage-related shock to the financial system dramatically underscored the need for such decision-maker diligence—rather than accept the stated solvency of a financial institution at face value, a decision-maker has to examine critically the many assumptions at multiple stages of analysis. In short, it is rarely enough to provide just the results. Rather, one must provide users with the ability both to interpret analytical results obtained and to repeat the analysis with different assumptions, parameters, or datasets to better support the human thought process and social circumstances.

While the potential benefits of Big Data are real and significant, and some initial successes have already been achieved, there remain many technical challenges that must be addressed to fully realize this potential.

The net result of interpretation is often the formulation of opinions that annotate the base data, essentially closing the pipeline. It is common that such opinions may conflict with each other or may be poorly substantiated by the underlying data. In such cases, communities need to engage in a conflict resolution "editorial" process (the Wikipedia community provides one example of such a process). A novel generation of data workspaces is needed where community participants can annotate base data with interpretation metadata, resolve their disagreements and clean up the dataset, while partially clean and partially consistent data may still be available for inspection.

Back to Top

Challenges in Big Data Analysis

Having described the multiple phases in the Big Data analysis pipeline, we now turn to some common challenges that underlie many, and sometimes all, of these phases, due to the characteristics of Big Data. These are shown as six boxes in the lower part of Figure 2.

Heterogeneity. When humans consume information, a great deal of heterogeneity is comfortably tolerated. In fact, the nuance and richness of natural language can provide valuable depth. However, machine analysis algorithms expect homogeneous data, and are poor at understanding nuances. In consequence, data must be carefully structured as a first step in (or prior to) data analysis.

An associated challenge is to automatically generate the right metadata to describe the data recorded. For example, in scientific experiments, considerable detail regarding specific experimental conditions and procedures may be required in order to interpret the results correctly. Metadata acquisition systems can minimize the human burden in recording metadata. Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline. This is called data provenance. For example, a processing error at one step can render subsequent analysis useless; with suitable provenance, we can easily identify all subsequent processing that depends on this step. Therefore, we need data systems to carry the provenance of data and its metadata through data analysis pipelines.

Inconsistency and incompleteness. Big Data increasingly includes information provided by increasingly diverse sources, of varying reliability. Uncertainty, errors, and missing values are endemic, and must be managed. On the bright side, the volume and redundancy of Big Data can often be exploited to compensate for missing data, to crosscheck conflicting cases, to validate trustworthy relationships, to disclose inherent clusters, and to uncover hidden relationships and models.

Similar issues emerge in crowdsourcing. While most such errors will be detected and corrected by others in the crowd, we need technologies to facilitate this. As humans, we can look at reviews of a product, some of which are gushing and others negative, and come up with a summary assessment based on which we can decide whether to buy the product. We need computers to be able to do the equivalent. The issues of uncertainty and error become even more pronounced in a specific type of crowdsourcing called participatory-sensing. In this case, every person with a mobile phone can act as a multi-modal sensor collecting various types of data instantaneously (or example, picture, video, audio, location, time, speed, direction, acceleration). The extra challenge here is the inherent uncertainty of the data collection devices. The fact that collected data is probably spatially and temporally correlated can be exploited to better assess their correctness. When crowdsourced data is obtained for hire, such as with Mechanical Turks, the varying motivations of workers give rise to yet another error model.

Even after error correction has been applied, some incompleteness and some errors in data are likely to remain. This incompleteness and these errors must be managed during data analysis. Doing this correctly is a challenge. Recent work on managing and querying probabilistic and conflicting data suggests one way to make progress.

Scale. Of course, the first thing anyone thinks of with Big Data is its size. Managing large and rapidly increasing volumes of data has been a challenging issue for many decades. In the past, this challenge was mitigated by processors getting faster, following Moore's Law. But there is a fundamental shift under way now: data volume is increasing faster than CPU speeds and other compute resources.

Due to power constraints, clock speeds have largely stalled and processors are being built with increasing numbers of cores. In short, one has to deal with parallelism within a single node. Unfortunately, parallel data processing techniques that were applied in the past for processing data across nodes do not directly apply for intranode parallelism, since the architecture looks very different. For example, there are many more hardware resources such as processor caches and processor memory channels that are shared across cores in a single node.

Another dramatic shift under way is the move toward cloud computing, which now aggregates multiple disparate workloads with varying performance goals into very large clusters. This level of sharing of resources on expensive and large clusters stresses grid and cluster computing techniques from the past, and requires new ways of determining how to run and execute data processing jobs so we can meet the goals of each workload cost-effectively, and to deal with system failures, which occur more frequently as we operate on larger and larger systems.

This leads to a need for global optimization across multiple users' programs, even those doing complex machine learning tasks. Reliance on user-driven program optimizations is likely to lead to poor cluster utilization, since users are unaware of other users' programs, through virtualization. System-driven holistic optimization requires programs to be sufficiently transparent, for example, as in relational database systems, where declarative query languages are designed with this in mind. In fact, if users are to compose and build complex analytical pipelines over Big Data, it is essential they have appropriate high-level primitives to specify their needs.

In addition to the technical reasons for further developing declarative approaches to Big Data analysis, there is a strong business imperative as well. Organizations typically will outsource Big Data processing, or many aspects of it. Declarative specifications are required to enable meaningful and enforceable service level agreements, since the point of outsourcing is to specify precisely what task will be performed without going into details of how to do it.

Timeliness. As data grow in volume, we need real-time techniques to summarize and filter what is to be stored, since in many instances it is not economically viable to store the raw data. This gives rise to the acquisition rate challenge described earlier, and a timeliness challenge we describe next. For example, if a fraudulent credit card transaction is suspected, it should ideally be flagged before the transaction is completed—potentially preventing the transaction from taking place at all. Obviously, a full analysis of a user's purchase history is not likely to be feasible in real time. Rather, we need to develop partial results in advance so that a small amount of incremental computation with new data can be used to arrive at a quick determination. The fundamental challenge is to provide interactive response times to complex queries at scale over high-volume event streams.

Another common pattern is to find elements in a very large dataset that meet a specified criterion. In the course of data analysis, this sort of search is likely to occur repeatedly. Scanning the entire dataset to find suitable elements is obviously impractical. Rather, index structures are created in advance to find qualifying elements quickly. For example, consider a traffic management system with information regarding thousands of vehicles and local hot spots on roadways. The system may need to predict potential congestion points along a route chosen by a user, and suggest alternatives. Doing so requires evaluating multiple spatial proximity queries working with the trajectories of moving objects. We need to devise new index structures to support a wide variety of such criteria.

Privacy and data ownership. The privacy of data is another huge concern, and one that increases in the context of Big Data. For electronic health records, there are strict laws governing what data can be revealed in different contexts. For other data, regulations, particularly in the U.S., are less forceful. However, there is great public fear regarding the inappropriate use of personal data, particularly through linking of data from multiple sources. Managing privacy effectively is both a technical and a sociological problem, which must be addressed jointly from both perspectives to realize the promise of Big Data.

Consider, for example, data gleaned from location-based services, which require a user to share his/her location with the service provider. There are obvious privacy concerns, which are not addressed by hiding the user's identity alone without hiding her location. An attacker or a (potentially malicious) location-based server can infer the identity of the query source from its (subsequent) location information. For example, a user may leave "a trail of packet crumbs" that can be associated with a certain residence or office location, and thereby used to determine the user's identity. Several other types of surprisingly private information such as health issues (for example, presence in a cancer treatment center) or religious preferences (for example, presence in a church) can also be revealed by just observing anonymous users' movement and usage patterns over time. In general, it has been shown there is a close correlation between people's identities and their movement patterns.11 But with location-based services, the location of the user is needed for a successful data access or data collection, so doing this right is challenging.

Another issue is that many online services today require us to share private information (think of Facebook applications), but beyond record-level access control we do not understand what it means to share data, how the shared data can be linked, and how to give users fine-grained control over this sharing in an intuitive, but effective way. In addition, real data are not static but get larger and change over time; none of the prevailing techniques results in any useful content being released in this scenario.

Privacy is but one aspect of data ownership. In general, as the value of data is increasingly recognized, the value of the data owned by an organization becomes a central strategic consideration. Organizations are concerned with how to leverage this data, while retaining their unique data advantage, and questions such as how to share or sell data without losing control are becoming important. These questions are not unlike the Digital Rights Management (DRM) issues faced by the music industry as distribution shifted from sales of physical media such as CDs to digital purchases; we need effective and flexible Data DRM approaches.

The human perspective: Visualization and collaboration. For Big Data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans. We have to make sure the end points—humans—can properly "absorb" the results of the analysis and not get lost in a sea of data. For example, ranking and recommendation algorithms can help identify the most interesting data for a user, taking into account his/her preferences. However, especially when these techniques are being used for scientific discovery and exploration, special care must be taken to not imprison end users in a "filter bubble"21 of only data similar to what they have already seen in the past—many interesting discoveries come from detecting and explaining outliers.

If users are to compose and build complex analytical pipelines over Big Data, it is essential they have appropriate high-level primitives to specify their needs.

In spite of the tremendous advances made in computational analysis, there remain many patterns that humans can easily detect but computer algorithms have a difficult time finding. For example, CAPTCHAs exploit precisely this fact to tell human Web users apart from computer programs. Ideally, analytics for Big Data will not be all computational—rather it will be designed explicitly to have a human in the loop. The new subfield of visual analytics is attempting to do this, at least with respect to the modeling and analysis phase in the pipeline. There is similar value to human input at all stages of the analysis pipeline.

In today's complex world, it often takes multiple experts from different domains to really understand what is going on. A Big Data analysis system must support input from multiple human experts, and shared exploration of results. These multiple experts may be separated in space and time when it is too expensive to assemble an entire team together in one room. The data system must accept this distributed expert input, and support their collaboration. Technically, this requires us to consider sharing more than raw datasets; we must also consider how to enable sharing algorithms and artifacts such as experimental results (for example, obtained by applying an algorithm with specific parameter values to a given snapshot of an evolving dataset).

Systems with a rich palette of visualizations, which can be quickly and declaratively created, become important in conveying to the users the results of the queries in ways that are best understood in the particular domain and are at the right level of detail. Whereas early business intelligence systems' users were content with tabular presentations, today's analysts need to pack and present results in powerful visualizations that assist interpretation, and support user collaboration. Furthermore, with a few clicks the user should be able to drill down into each piece of data she sees and understands its provenance. This is particularly important since there is a growing number of people who have data and wish to analyze it.

A popular new method of harnessing human ingenuity to solve problems is through crowdsourcing. Wikipedia, the online encyclopedia, is perhaps the best-known example of crowdsourced data. Social approaches to Big Data analysis hold great promise. As we make a broad range of data-centric artifacts sharable, we open the door to social mechanisms such as rating of artifacts, leader-boards (for example, transparent comparison of the effectiveness of several algorithms on the same datasets), and induced reputations of algorithms and experts.

Back to Top

Conclusion

We have entered an era of Big Data. Many sectors of our economy are now moving to a data-driven decision making model where the core business relies on analysis of large and diverse volumes of data that are continually being produced. This data-driven world has the potential to improve the efficiencies of enterprises and improve the quality of our lives. However, there are a number of challenges that must be addressed to allow us to exploit the full potential of Big Data. This article highlighted key technical challenges that must be addressed, and acknowledge there are other challenges, such as economic, social, and political, that are not covered in this article but must also be addressed. Not all of the technical challenges discussed here arise in all application scenarios. But many do. Also, the solutions to a challenge may not be the same in all situations. But again, there often are enough similarities to support cross-learning. As such, the broad range of challenges described here make good topics for research across many areas of computer science. We have collected some suggestions for further reading at http://db.cs.pitt.edu/bigdata/resources. These are a few dozen papers we have chosen on account of their coverage and importance, rather than a comprehensive bibliography, which would comprise thousands of papers.

Back to Top

Acknowledgment

This article is based on a white paper5 authored by many prominent researchers, whose contributions we acknowledge. Thanks to Divyakant Agrawal, Philip Bernstein, Elisa Bertino, Susan Davidson, Umeshwar Dayal, Michael Franklin, Laura Haas, Alon Halevy, Sam Madden, Kenneth Ross, Dan Suciu, Shiv Vaithyanathan, and Jennifer Widom.

H.V.J. was funded in part by NSF grants IIS 1017296, IIS 1017149, and IIS 1250880. A.L. was funded in part by NSF IIS-0746696, NSFOIA-1028162, and NSF CBET-1250171. Y.P. was funded in part by NSF grants IIS-1117527, SHB-1237174, DC-0910820, and an Informatica research award. J.M.P. was funded in part by NSF grants III-0963993, IIS-1250886, IIS-1110948, CNS-1218432, and by gift donations from Google, Johnson Controls, Microsoft, Symantec, and Oracle. C.S. was funded in part by NSF grant IIS-1115153, a contract with LA Metro, and unrestricted cash gifts from Microsoft and Oracle.

Any opinions, findings, conclusions or recommendations expressed in this article are solely those of its authors.

Back to Top

References

1. Computing Community Consortium. Advancing Discovery in Science and Engineering. Spring 2011.

2. Computing Community Consortium. Advancing Personalized Education. Spring 2011.

3. Computing Community Consortium. Smart Health and Wellbeing. Spring 2011.

4. Computing Community Consortium. A Sustainable Future. Summer 2011.

5. Computer Research Association. Challenges and Opportunities with Big Data. Community white paper available at http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf

6. Dobbie, W. and Fryer, Jr. R.G. Getting Beneath the Veil of Effective Schools: Evidence from New York City. NBER Working Paper No. 17632. Issued Dec. 2011.

7. Economist. Drowning in numbers: Digital data will flood the planet—and help us understand it better. (Nov 18, 2011); http://www.economist.com/blogs/dailychart/2011/11/big-data-0

8. Flood, M., Jagadish, H.V., Kyle, A., Olken, F. and Raschid, L. Using data for systemic financial risk management. In Proc. 5th Biennial Conf. Innovative Data Systems Research (Jan. 2011).

9. Forbes. Data-driven: Improving business and society through data. (Feb. 10, 2012);http://www.forbes.com/special-report/data-driven.html

10. Gartner Group. Pattern-Based Strategy: Getting Value from Big Data. (July 2011 press release);http://www.gartner.com/it/page.jsp?id=1731916

11. González, M.C., Hidalgo, C.A. and Barabási, A-L. Understanding individual human mobility patterns.Nature 453, (June 5, 2008), 779–782.

12. Hey, T., Tansley, S. and Tolle, K., eds. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.

13. Kahn, S.D. On the future of genomic data. Science 331, 6018 (Feb. 11, 2011), 728–729.

14. Lazar, D. et al. Computational social science. Science 323, 5915 (Feb. 6, 2009), 721–723.

15. Lohr, A. The age of Big Data. New York Times (Feb. 11, 2012);http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html

16. Lohr, S. How Big Data became so big. New York Times (Aug. 11, 2012);http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html

17. Manyika, J. et al. Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. May 2011.

18. National Science and Technology Council. Materials Genome Initiative for Global Competitiveness. June 2011.

19. Noguchi, Y. Following the Breadcrumbs to Big Data Gold. National Public Radio (Nov. 29, 2011);http://www.npr.org/2011/11/29/142521910/the-digital-breadcrumbs-that-lead-to-big-data

20. Noguchi, Y. The Search for Analysts to Make Sense of Big Data. National Public Radio, (Nov. 30, 2011);http://www.npr.org/2011/11/30/142893065/the-search-for-analysts-to-make-sense-of-big-data

21. Pariser, E. The Filter Bubble: What the Internet Is Hiding From You. Penguin Press, May 2011.

22. PCAST Report. Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology (Dec. 2010);http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.pdf

23. SDSS-III: Massive Spectroscopic Surveys of the Distant Universe, the Milky Way Galaxy, and Extra-Solar Planetary Systems (Jan. 2008); http://www.sdss3.org/collaboration/description.pdf/

Back to Top

Authors

H. V. Jagadish (jag@umich.edu) is the Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor.

Johannes Gehrke (johannes@cs.cornell.edu) is the Tisch University Professor of the Department of Computer Sciences a Cornell University, Ithaca, NY.

Alexandros Labrinidis (labrinid@cs.pitt.edu) is an associate professor in the Department of Computer Science at the University of Pittsburgh and co-director of the Advanced Data Management Technologies Laboratory.

Yannis Papakonstantinou (yannis@cs.ucsd.edu) is a Professor of Computer Science and Engineering at the University of California, San Diego.

Jignesh M. Patel (jignesh@cs.wisc.edu) is a professor of computer science at the University of Wisconsin, Madison.

Raghu Ramakrishnan (raghu@microsoft.com) is a Technical Fellow and CTO of Information Services at Microsoft, Redmond, WA.

Cyrus Shahabi (shahabi@usc.edu) is a professor of computer science and electrical engineering and the director of the Information Laboratory at the University of Southern California as well as director of the NSF's Integrated Media Systems Center.

Back to Top

Figures

Figure 1. Next-gen sequence data size compared to SPECint.

Figure 2. The Big Data analysis pipeline. Major steps in the analysis of Big Data are shown in the top half of the figure. Note the possible feedback loops at all stages. The bottom half of the figure shows Big Data characteristics that make these steps challenging.

Back to Top

Sidebar: Case Study

Since fall 2010, as part of a contract with Los Angeles Metropolitan Transportation Authority (LA-Metro), researchers at the University of Southern California's (USC) Integrated Media Systems Center (IMSC) have been given access to high-resolution spatiotemporal transportation data from the LA County road network. This data arrives at 46 megabytes per minute and over 15 terabytes have been collected so far. IMSC researchers have developed an end-to-end system called TransDec (for Transportation Decision-making) to acquire, store, analyze and visualize these datasets (see the accompanying figure). Here, we discuss various components of TransDec corresponding to the Big Data flow depicted in Figure 2.

Acquisition: The current system acquires the following datasets in real time:

Traffic loop-detectors: About 8,900 sensors located on the highways and arterial streets collect traffic parameters such as occupancy, volume, and speed at the rate of one reading/sensor/min.
Bus and rail: Includes information from about 2,036 busses and 35 trains operating in 145 different routes in Los Angeles County. The sensor data contain geospatial location of each bus every two minutes, next-stop information relative to current location, and delay information relative to predefined timetables.
Ramp meters and CMS: 1851 ramp meters regulate the flow of traffic entering into highways according to current traffic conditions, and 160 Changeable Message Signs (CMS) to give travelers information about road conditions such as delays, accidents, and roadwork zones. The update rate of each ramp meter and CMS sensor is 75 seconds.
Event: Detailed free-text format information (for example, number of casualties, ambulance arrival time) about special events such as collisions, traffic hazards, and so on acquired from three different agencies.

Cleaning: Data-cleaning algorithms remove redundant XML headers, detect and remove redundant sensor readings, and so on in real time using Microsoft's StreamInsight, resulting in reducing the 46MB/minute input data to 25MB/minute. The result is then dumped as simple tables into the Microsoft Azure cloud platform.

Aggregation/Representation: Data are aggregated and indexed into a set of tables in Oracle 11g (indexed in space and time with an R-tree and B-tree). For example, the data are aggregated to create sketches for supporting a predefined set of spatial and temporal queries (for example, average hourly speed of a segment of north-bond I-110).

Analysis: Several machine-learning techniques are applied, to generate accurate traffic patterns/models for various road segments of LA County at different times of the day (for example, rush hour), different days of the week (for example, weekends) and different seasons. Historical accident data is used to classify new accidents to predict clearance time and the length of induced traffic backlog.

Interpretation: Many things can go wrong in a complex system, giving rise to bogus results. For example, the failures of various (independent) system components can go unnoticed, resulting in loss of data. Similarly, the data format was sometimes changed by one organization without informing a downstream organization, resulting in erroneous parsing. To address such problems, several monitoring scripts have been developed, along with mechanisms to obtain user confirmation and correction.

TransDec.

2014 ACM 0001-0782/14/07

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

http://cacm.acm.org/magazines/2014/7/176204-big-data-and-its-technical-challenges/fulltext

admin · 发表于 2014-7-5 09:33:12

【案例】
五月 7, 2014 by 马金馨
数据新闻工作坊教学感想
由华媒基金会组织、IREX资助监督的“数据新闻工作坊”，是目前大陆唯一一个长期举办、面向全国跨领域招募的数据新闻培训。每一期培训招收来自全国的20名学员，包括记者、编辑、程序员、设计师等。每期会有特定的主题——第一期培训在广州举办，主题为财经；第二期培训在北京举办，主题为环境。

讲师有固定的两位——Jonathan Stray和我。Jonathan是美国人，程序员出身，在转行新闻前曾经写过七八年的代码。2010年去了美联社任Interactive Technology Editor，后来得到一笔资金开发了文本分析软件Overview，现在在哥伦比亚大学作访问学者，同时教授Computational Journalism。我则陆续从事过南华早报的社交媒体编辑、国际记者网的中文主编、路透社数据新闻产品助理等职，从2011年底开始做数据新闻和信息可视化方面的培训。

工作坊学员的组成非常多元，有来自21世纪经济报道、南方周末、东方早报、凤凰周刊等市场化媒体的记者编辑，有重庆晨报、新疆都市消费晨报、甘肃经济日报等地方媒体的采编人员，也不乏新华社、中国日报等官方背景的同行，亦有新浪、腾讯、凤凰网等门户的信息图编辑/产品经理，甚至有百度的工程师和传统媒体转型出来的创业者。在招募时会侧重不同背景的平衡，设计师、程序员、管理人员都很受青睐。两次入选学员的具体信息可以参见华媒基金会网站：第一期，第二期。

五天的工作坊采取脱产封闭式学习，课程紧凑，被学员戏称为地狱式培训。内容包括数据搜集与挖掘、数据清理与理解、基础统计学、基础HTML/ CSS/ JavaScript代码、设计原理、地图可视化等（以及针对程序员的D3应用和针对需要回机构分享的Train-the-Trainers）。授课形式除了传统讲课之外，结合了工具实操、学员分享、小组项目等。每天朝九晚五的课程之后，许多学员还继续为了小组项目连夜奋战——当然到最后一天，曾经一行代码都没写过的人可以在小伙伴们的协助下一起完成一个酷炫的动态新闻选题，成就感应该爆棚吧。当然亦有吃喝玩乐的环节，这里掠过不表。
两期工作坊部分学员合影

作为讲师，既和学员直接交流互动，又和两边的组织方接洽参与项目设计，有些感受颇深：

数据新闻在大陆的发展前景不可小觑。财新团队已经出产几个不错的数据新闻产品（例如诺贝尔可视化），南方都市报在尝试打破各团队的壁垒以理顺工作流程，各门户的原信息图栏目都在走交互方向（例如腾讯的外交发言人），央视的大数据栏目借助其雄厚的财力和背景在寻找各种可大手笔操作的选题，21世纪下有NID实验室出产数据产品（马航事件时的手机端尝试很受关注），东早的新媒体项目亦在紧锣密鼓的招募人才以备发展。我个人也知道几家尚未公开开始数据新闻尝试的机构也在打量和筹划。如果说2013年下半年是数据新闻在大陆媒体开始被认识和萌芽的阶段，那么2014年到2015年会是蓬勃发展、优胜劣汰的黄金时期。

大陆的数据新闻与欧美顶尖媒体的操作差距颇大，但是会走一条“中国特色”的道路。资金方面倒不是大问题——东早的“澎湃”APP前期投入就3个亿，这放到世界级媒体也令人咂舌。机构壁垒的问题也其实是个全球化的问题，记者编辑与程序员设计师沟通不畅也是放之全球皆准，否则就没有Hacks/Hackers之类的活动什么市场了。所谓中国特色，更多是在于选题的方向和工具的选择（很多国际化工具大陆用不了，你懂的），最重要的则是对创新的认可与对失败的接纳程度。

对于个人记者编辑来说，一个被问的最多的问题就是：要学到什么程度？第二多的问题就是：上哪儿去学？对第一个问题，如果无论是机构还是资源的原因，已经无法向全能型选手发展，那么就做一个什么都懂一些、但是有一技之长的人——至少，要懂得如何和程序员、设计师好好沟通与协作。对第二个问题，很多在线的学习资源很多，只是很多都是英文的。中文里氢请关注数据新闻网！或者就，报名参加下一期工作坊咯！

（数据新闻工作坊下一期将于七月下旬举办，报名信息请关注华媒基金会网站）

http://djchina.org/2014/05/07/data-journalism-workshop/

admin · 发表于 2014-7-8 23:54:32

【数据】
媒体人王晖军

//@央视小丸子:新闻与数据结合将是未来新闻变革的必经之路。
◆◆
@清华史安斌

美国网站推出的“数据新闻教学”系列文章，新闻学院教授分享开设数字新闻课程的经验。现在新闻系的学生再也不能说“我因为不愿意上数学课才来学新闻”，同样道理，“如果大数据或全数据不能变成具有公共性的新闻和应用性的资讯，那么就成了坏数据”，新闻与数据的结合势在必行http://t.cn/RveBaki

[url=]
收起[/url]|[url=]
查看大圖[/url]|[url=]
向左轉[/url]|[url=]
向右轉[/url]

[url=]
(4)[/url]| 轉發(36) | 評論(5)
今天 20:45來自360安全浏览器

admin · 发表于 2014-7-12 10:12:37

【案例】

祝建华：数据新闻的前世今生

2014年07月10日19:21 新浪传媒

　　祝建华：数据新闻的前世今生

　　2014第五届中国传媒领袖大讲堂于7月5日至19日在上海交通大学举办。本届大讲堂邀请50多位传媒领军人物，一线编辑、记者、主持人和著名专家学者，为来自海内外160余所高校的350余名学子讲授传媒业改革创新的经验与教训，帮助学子们了解传媒业界和学界的最新发展动态，深化对传媒业和新闻传播学科的认识。以下为香港城市大学媒体与传播学系教授祝建华7月7日上午在第五届中国传媒领袖大讲堂上的演讲。

　　近几年内地逐渐兴起了对大数据的讨论，大数据以及数据新闻成为了传媒学界关注的热点。对此，祝建华采用了对比讨论、图标展示、举例说明等形式鲜明直观地向学员们介绍了大数据以及数据新闻的相关知识。

　　随着时代的进步，传统的数据统计来源如政府的统计机构、经济金融、天文地理、传统媒体、交通运输等已经落后，而与之相对应的新型数据来源正日益丰富，如互联网、移动网、智能家居、物联网、生物工程等，造成数据量呈现出几何增长的态势。

　　为使学员们能够能好的体会到数据统计方式的变迁，祝建华向同学们讲述了自己早年在上海工作时以日记卡记录的形式对电视的收视率进行统计的经历。同时还与学员们共同分享了利用大数据进行现实预测的故事，例如百度大数据对语文高考题的预测和眼下的世界杯结果预测等，帮助同学更好地理解大数据的价值。

　　在详细讲述了大数据的相关知识后，祝建华又利用图示向学员们展示了数据新闻的演化路径。从精确新闻的出现，到电脑辅助新闻的兴起，发展到数据库新闻以及数据驱动新闻，直至目前的可视化新闻。对数据新闻的这一演化路径，祝建华特别强调，演化的过程并不是替代而是一种增量关系，数据新闻早于互联网和大数据。同时，祝建华还对精确新闻与电脑辅助报道进行了精确的对比分析，并对可视化新闻进行了四大分类说明，即可视化作为新闻主体、可视化作为新闻主题、可视化作为新闻导语、可视化作为新闻插图等。(王欣)

http://news.sina.com.cn/m/news/roll/2014-07-10/192130501044.shtml

admin · 发表于 2014-7-19 10:03:54

【案例】
数据化管理

【Excel如何隐藏数据】不会隐藏数据的人就不会用Excel建模。用Excel建模一定要学会隐藏数据。需要隐藏的包括不方便给使用者看的数据，数据源区域的数据，辅助计算过程中的数据，计算逻辑，影响模板美观的数据等... #数据化管理：洞悉零售及电子商务运营# http://t.cn/RPwkSds

[url=]
收起[/url]|[url=]
查看大圖[/url]|[url=]
向左轉[/url]|[url=]
向右轉[/url]

[url=]
(11)[/url]| [url=]轉發(17)[/url]| [url=]收藏[/url]| [url=]評論(3)[/url]
23分鐘前來自分享按钮

admin · 发表于 2014-7-20 10:41:21

【案例】
刘洪的围脖

观厕所见文明，从便溺知发展！
◆◆
@喻国明

全世界在野地里拉屎的人口密集度。大数据研究的又一应用实例。

[url=]
收起[/url]|[url=]
查看大圖[/url]|[url=]
向左轉[/url]|[url=]
向右轉[/url]

帐号		自动登录	找回密码
密码			实名注册

数据新闻案例集锦

浏览过的版块