Data Science

Understanding data in the age of machine learning, artificial intelligence & Co.

Make the right decisions using skillful analysis and evaluation of data.

With digitalization, not only new challenges, but also new careers are emerging. On this page, you’ll learn more about how businesses are dealing with big data from the perspective of data science.

We will show you the most important subdisciplines of data science and which skills and competencies are necessary to excel in this field.

DATA SCIENCE: AT THE CENTER OF THE FOURTH INDUSTRIAL REVOLUTION

Big Data. As part of the digital revolution we are being confronted with a new kind of resource – data.

Companies resisting a transition into the data economy have a call for concern. Thanks to modern technology, computer and cloud applications, and more, businesses of all sizes – from small businesses to international corporations – have access to more data than ever. Well, these data are at least theoretically available.

Big Data by itself doesn’t generate insights. There has to be a way to evaluate these enormous amounts of data.

To draw insightful conclusions, there has to be someone to investigate, analyze, and interpret the data. And perhaps most importantly, specialists have to prepare the data to be used in a useful way.

The interpretation and visualization of data can help management or company leadership make evidence-based decisions – or, better yet, ask deeper questions. In this way, Big Data becomes tangible, comprehensible, and useful: with Data Science.

Illustrations_Vierte Industrie Revolution

What is Data Science?

Data science is an interdisciplinary science that seeks to generate insights and knowledge through the collection, study, and evaluation of data.

This knowledge enables businesses to make fact-based decisions and drive competitive optimization, such as streamlining operations, reaching more customers, simplifying logistics solutions, and much more.

Data Science is therefore an applied science: It combines industry-specific approaches with knowledge and methods from statistics, informatics, mathematics, and stochastics.

DATA SOURCING – AN INTEGRAL COMPONENT

Data science would be nothing without data sourcing.

The process of data sourcing entails, on the one hand, the extraction of data from different so-called primary and secondary sources and the integration of these data into proprietary data infrastructures, on the other. Data sourcing is the first step in making data usable and actionable for business processes.

Primary data are data that the business itself generates. A classic example of primary data is the customer survey.

Secondary data sources come from third parties and can be accessed in diverse ways – data governance and data security are, however, additional concerns when using these kinds of data. Internal (e.g., customer data in CRM programs) and external (from third party data vendors) data are two possible sources of secondary data.

DATA CLEANSING – ORDERLINESS AND STRUCTURE

If you have been following along carefully and perhaps even had a CRM export in your hands – or had the pleasure of processing and analyzing customer survey results – you are likely asking yourself if unfiltered data can be used to generate insights as is. The short answer is: not quite.

Data cleansing, a.k.a. data cleaning or data scrubbing, is the process of documenting and removing inconsistencies (e.g., outlier values) in a dataset. Common sources of unwanted mistakes are structural issues, duplicate datasets, irrelevant data, and the like.

Data integrity – that is, consistent and complete data – is the goal of data cleansing. Here’s why: Only clean data can be subject to  meaningful analysis

Illustrations_Collection_Data Cleansing

EXPLORATORY DATA ANALYSIS (EDA)

Exploratory Data Analysis (EDA) comes before any actual data analysis starts. This data recognition method was developed by John Tukey, a mathematician from the United States, in the 1970s.

EDA helps to make data comprehensible with visuals. With the help of these visualizations, data science experts can start identifying interesting patterns or optimal analysis methods before they even begin formulating a research question.

DATA MINING: UNCOVERING KNOWLEDGE

In broad terms, Data Mining is a computer-mediated data analysis process. Here it’s first and foremost about identifying patterns in the datasets.

Careful: Data Mining and Data Sourcing are not interchangeable. Even if both processes ultimately serve the acquisition and analysis of data, data sourcing merely refers to collecting raw data. Data mining, on the other hand, refers to pulling data from a clean, consistent data bank. For this reason, Data Mining is also known as “Knowledge Discovery.”

Illustrations_Cleansing

Data Mining Methods and Algorithms

In general, there are four different methods that support generating insights for a broad spectrum of applications – from medicine and research, manufacturing and logistics, all the way to trade, banking, finance, and insurance:

Classification

Exemplary application: Predicting customer interests and consumer behavior

Prognosis

Exemplary application: Profit predictions

Grouping (Segmenting and Clustering)

Exemplary application: Segmenting a databank of newsletter subscribers

Identifying dependencies (Association and Sequence)

Exemplary application: “Customers that bought X also bought Y.”

This is where Data Mining Algorithms come into play. Popular examples of algorithms include decision trees, regression analysis, artificial neural networks as well as association analysis.

Important Subdivision: Predictive (Data) Analytics

As a subdiscipline of Data Mining, Predictive Analytics focuses on making predictions about future events. The goal is using data patterns to minimize business risk and, in turn, maximize business success.

ARTIFICIAL INTELLIGENCE (AI) IN DATA SCIENCE

The concept of artificial intelligence (AI) describes the ability of computers to make decisions independently, similar to how a human would.

Most of the time when AI experts talk about AI, they actually mean a subsection of AI called Machine Learning, which uses artificial neural networks. However, computers have not yet reached the point of acting fully independently.

Illustrations_AI

Machine Learning

Machine Learning is the primary method used in data mining. But what exactly does it mean?

Machine Learning makes it possible for us to “teach” computers how to extract facts and correlations from data. Over time, the computer “learns” using a predefined set of training data to watch out for specific patterns and assign these patterns to specific kinds of input data.

Machine-Learning

Machine Learning (Unsupervised Learning)

In the case of unsupervised (machine) learning, a “ground truth,” or predefined output value, remains unknown.

Both supervised and unsupervised learning are learning algorithms that aim to train artificial intelligence to perform a specific task.

Illustrations_Unsupervised learning

Machine Learning (Supervised Learning)

There is an important distinction between unsupervised and supervised (machine) learning.

In supervised learning, known results are used to facilitate the machine learning process.

Illustrations_Supervised Learning

Pattern Recognition

At this point, the concept of pattern recognition is also worth mentioning.

Pattern Recognition describes the ability of a so-called cognitive system (i.e., the human mind) to identify regularities, repetitions, and similarities.

Humans are able to recognize language, images, and faces intuitively. Even when no patterns exist, the human mind continues looking for them. It is indeed pattern recognition, whether it be visual or linguistic, that is critical to our survival. While this faculty is important to our very survival, it can also result in faulty assumptions that take the form of superstitions or prejudices.

So, when we apply this situation to computers and say that they don’t possess the same level of pattern recognition as humans, it’s limited to real-world contexts. When it comes to a predefined set of data, however, a well trained computer’s speed and efficiency in recognizing patterns and correlations far surpass those of a human. This is ultimately one of the main challenges in AI development.

Illustrations_Pattern Recognition

Artificial neural networks (ANN)

For ease, it’s common to talk about AI and data processing in terms of “computers.” In reality, however, AI-facilitated data processing is based on artificial neural networks. The origins of ANN extend back to the 1940s and since 2009, they have been making their way into the scientific and economic realm.

As the name suggests, these networks are built on the same premise as biological neural networks.

Artificial neurons are the building blocks of ANNs. In most ANN models, these neurons are organized into layers. Multi-layered networks consist of an input layer, one or more hidden layers, and a visible output layer.

Possible applications of artificial neural networks include text and image recognition, error detection such as in early warning systems, image processing, machine translation, synthesis of images and language, medical diagnostics, and, of course, data mining.

Multi-layered ANN and Deep Learning

These specialized methods represent a subset of machine learning and come into effect when the ANN contains hidden layers between its input and output layers. The hierarchical organization of computational concepts allows data processing literally “layer for layer.” There’s currently no strict definition of Deep Learning in terms of the number of layers the model has to have.

CODING – THE MOST IMPORTANT CODING LANGUAGES IN DATA SCIENCE

Without computers, there wouldn’t be data science – That much we know is true. But which coding languages are important and why.

Rapid development in the field of data science means data scientists have to know more than just a single coding language.

There are at least five coding languages that are typically used for completing tasks in data science such as data processing, analysis, and preparation:

Python

(more on that in a second)

R

Important for developing Machine Learning models

C++

Used for developing scalable Big-Data libraries

SQL

For creating data banks

Java

Complementary to Python and very versatile

Moreover, other coding languages like Haskell, Matlab and Perl are becoming more common to complete different tasks in data science. The importance and popularity also depends on region (Germany, USA, China, India).

Python

Python was developed in 1991 as an “all-in-one” coding language. Common applications of Python include the development of web applications, games, and so-called scripts. The libraries and clear-cut syntax of this language named after the British comedy troupe Monty Python make it very important for data scientists.

Since Python was developed under the premise of being a “tidy” coding language, it is capable of supporting multiple coding paradigms (i.e., object oriented, aspect oriented). This flexibility offers coders a lot of freedom when it comes to solving problems.

Additional advantages of Python are its platform-independence and rapid processing of large amounts of data.

JOB PROFILE: DATA SCIENTIST DATA ANALYST

There is currently a lack of professional data scientists, which bodes well for future job prospects. As a data scientist, you’ll be practicing what Harvard Business Review has called the “sexiest job” of the 21st century.

So what do you need in your repertoire to be a successful data scientist?

In this article, our discussion has been heavy on informatics. Nevertheless, data science is interdisciplinary: As a data science expert, you won’t only be tinkering with code, but also interacting a lot with people.

Here’s why: Extracting relevant knowledge from business processes resembles being a detective. It requires industry-specific knowledge, a business-oriented mindset, and excellent communication skills. Preparing and interpreting complex data without knowledge of coding and math is just as important as sourcing and mining data.

Even so, coding skills are an obvious prerequisite for getting into data science. Collecting, cleaning, and preparing data requires at the very least basic knowledge in the coding languages above. For creating complex models, additional knowledge in math and statistics is also necessary.

HOW DO I BECOME A DATA SCIENTIST?

Typically, getting into a career in data science requires a university degree in math, statistics, or informatics.

But this isn’t the only way: Since data scientists often need industry-specific knowledge, there are also other opportunities for getting into the field. A mid-career switch or continuing education are both possible. Either with in-person courses or online with E-learning: There’s an option for everyone looking to break into data science.

A willingness to learn and work in interdisciplinary teams tops out the list of requirements for a hopeful data scientist.

Illustrations_Data mining

Data Science courses

Coming sooner or later . . .