The activities of Greater Data Science are classified into 6 divisions:
- Data Exploration and Preparation: 80% of the effort in data science goes in diving into the messy data to learn the basics of what’s in them, so that data can be made ready for further exploitation. Two subactivities:
- Exploration: Also known as ‘Exploratory Data Analysis’ (EDA), it comprises of exploring data to sanity-check its most basic properties, and to expose unexpected features.
- Preparation: Identify and address anomalies and artifacts (something observed in a scientific investigation or experiment that is not naturally present but occurs as a result of the preparative or investigative procedure) contained in datasets. Also called data cleaning and include tasks such as
- reformatting and recoding the values themselves,
- pre-processing, such as grouping, smoothing, and subsetting.
- Data Representation and Transformation: Data comes from many different data sources in a very wide range of formats. Transformation and restructuring may be required on the originally given data into a new and more revealing form.
Skills in following two areas may be required:
- Modern Databases: The scope of today’s data representation includes everything from homely text files and spreadsheets to SQL and noSQL databases, distributed databases, and live data streams. Data scientists need to know the structures, transformations, and algorithms involved in using all these different representations.
- Mathematical Representations: These are interesting and useful mathematical structures for representing data of special types, including acoustic, image, sensor, and network data. For example, to get features with acoustic data, one often transforms to the cepstrum or the Fourier transform; for image and sensor data the wavelet transform or some other multi scale transform (e.g. pyramids in deep learning).
- Computing with Data: Every data scientist should know and use several languages for data analysis and data processing. These can include popular languages like R and Python, but also specific languages for transforming and manipulating text, and for managing complex computational pipelines.
Beyond basic knowledge of languages, data scientists need to keep current on new idioms for efficiently using those languages and need to understand the deeper issues associated with computational efficiency.
Cluster and cloud computing and the ability to run massive numbers of jobs on such clusters has become an overwhelmingly powerful ingredient of the modern computational landscape. To exploit this opportunity, data scientists develop workflows which organize work to be split up across many jobs to be run sequentially or else across many machines. Data scientists also develop workflows that document the steps of an individual data analysis or research project.
Finally, data scientists develop packages that abstract commonly-used pieces of workflow and make them available for use in future projects.
- Data Modeling: Each data scientist in practice uses tools and viewpoints from both of Leo Breiman’s modeling cultures:
- Generative modeling, in which one proposes a stochastic model that could have generated the data, and derives methods to infer properties of the underlying generative mechanism. This roughly speaking coincides with traditional Academic statistics and its offshoots.
- Predictive modeling, in which one constructs methods which predict well over some some given data universe – i.e. some very specific concrete dataset. This roughly coincides with modern Machine Learning, and its industrial offshoots.
- Data Visualization and Presentation: Data visualization at one extreme overlaps with the very simple plots of EDA – histograms, scatterplots, time series plots – but in modern practice it can be taken to much more elaborate extremes. Data scientists often spend a great deal of time decorating simple plots with additional color or symbols to bring in an important new factor, and they often crystallize their understanding of a dataset by developing a new plot which codifies it. Data scientists also create dashboards for monitoring data processing pipelines that access streaming or widely distributed data. Finally they develop visualizations
to present conclusions from a modeling exercise or CTF challenge.
- Science about Data Science: Data scientists are doing science about data science when they identify commonly-occuring analysis/processing workflows, for example using data about their frequency of occurrence in some scholarly or business domain; when they measure the effectiveness of standard workflows in terms of the human time, the computing resource, the analysis validity, or other performance metric, and when they uncover emergent phenomena in data analysis, for example new patterns arising in data analysis workflows, or disturbing artifacts in published analysis results.