Challenges of Big Data analysis

Jianqing Fan, Fang Han, Han Liu

Research output: Contribution to journalArticle

Abstract

Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

Original languageEnglish (US)
Pages (from-to)293-314
Number of pages22
JournalNational Science Review
Volume1
Issue number2
DOIs
StatePublished - 2014

Fingerprint

Population Characteristics
Sample Size
Noise

Keywords

  • Big Data
  • Data storage
  • Incidental endogeneity
  • Noise accumulation
  • Scalability
  • Spurious correlation

ASJC Scopus subject areas

  • General

Cite this

Challenges of Big Data analysis. / Fan, Jianqing; Han, Fang; Liu, Han.

In: National Science Review, Vol. 1, No. 2, 2014, p. 293-314.

Research output: Contribution to journalArticle

Fan, J, Han, F & Liu, H 2014, 'Challenges of Big Data analysis', National Science Review, vol. 1, no. 2, pp. 293-314. https://doi.org/10.1093/nsr/nwt032
Fan, Jianqing ; Han, Fang ; Liu, Han. / Challenges of Big Data analysis. In: National Science Review. 2014 ; Vol. 1, No. 2. pp. 293-314.
@article{a81adf2baca740b3bceea53593d01ef4,
title = "Challenges of Big Data analysis",
abstract = "Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.",
keywords = "Big Data, Data storage, Incidental endogeneity, Noise accumulation, Scalability, Spurious correlation",
author = "Jianqing Fan and Fang Han and Han Liu",
year = "2014",
doi = "10.1093/nsr/nwt032",
language = "English (US)",
volume = "1",
pages = "293--314",
journal = "National Science Review",
issn = "2053-714X",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - Challenges of Big Data analysis

AU - Fan, Jianqing

AU - Han, Fang

AU - Liu, Han

PY - 2014

Y1 - 2014

N2 - Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

AB - Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

KW - Big Data

KW - Data storage

KW - Incidental endogeneity

KW - Noise accumulation

KW - Scalability

KW - Spurious correlation

UR - http://www.scopus.com/inward/record.url?scp=84919389078&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84919389078&partnerID=8YFLogxK

U2 - 10.1093/nsr/nwt032

DO - 10.1093/nsr/nwt032

M3 - Article

C2 - 25419469

AN - SCOPUS:84919389078

VL - 1

SP - 293

EP - 314

JO - National Science Review

JF - National Science Review

SN - 2053-714X

IS - 2

ER -