Note that this post is heavily inspired by Andrew NG’s content on the topic.

So whats behind the “buzzwords”? Let’s jump straight into it.

- The Misconception about AI
- What is Machine Learning?
- Terminology of AI
- What is Data?
- How do you get Data?
- Misuse of Data
- Summary

(Image Source: (Image Source: https://unsplash.com/photos/1K6IQsQbizI))

There is a lot of unnecessary hype around AI, which is mostly due to a common misconception that many people have. Artificial Intelligence can be separated into two parts or two ideas:

This describes **AI’s that are good at one specific task**, which they are trained and developed on. This can be, for example, an AI system that predicts the prices of houses based on historical data or the algorithm that recommends youtube videos to you. Other examples are predictive maintenance, quality control etc. ANI is a very powerful tool and it will add a lot of additional value to our society over the next years. All of the progress that we saw in recent years, and what we constantly hear about in the news, happened in the field of ANI. These catchy news articles lead people to wrongly assume that science made a lot of progress in AGI, but in reality **we only made progress in AN**I.

This is the end goal of AI: **a computer system that it is as smart or smarter than a human**. An AGI could successfully perform any intellectual task that a human being can do. This is also the part of AI that raises the most fear in people. They imagine a world where computers are much smarter than humans, where nearly every job is automated, or even Terminator like scenarios. This is what I mean with unnecessary hype. It is leading to irrational fears about the future of humanity, while in reality, **we are still many technological breakthrough away from reaching real AGI**.

You could say that Machine Learning is** the backbone technology of AI**. It uses **statistical techniques to give the computer program the ability to learn (e.g. to progressively improve its performance on a specific task) from data, without being explicitly programmed**.

Machine Learning is the tool of AI that caused all the hype and that **enabled nearly all of the value that is created through AI systems.** It can also be separated into different parts, but only one part is responsible for 80% of the value that is created through Machine Learning. What I talk about is **Supervised Learning**.

Supervised learning algorithms simply just learn **input (A) to output (B) mappings**, by learning relationships within huge amounts of data. Imaging that you want to build a system that can classify e-mails into spam and non-spam mails. You would need to accumulate a lot “labeled” examples of e-mails. This means you have a label for every e-mail, that tells whether it is spam or not. You would need to accumulate thousands of e-mails with labels and then you can feed this data into a supervised machine learning algorithm. In the training process, the algorithm would analyze all the e-mails that you gave him and it would iteratively improve its understanding about what attributes differentiate spam from non-spam emails. In this example, the system has to map e-mails (A) to a label that tells if the mail is spam or not (B).

Like I said, you train the algorithm by giving him thousands of labelled e-mails. After you trained the algorithm on that data, you can give him a completely new email (that the algorithm has never seen before) as input and it will tell you whether it thinks that the email is spam or not.

Another example is online advertising, where the input is information about a user (A) and the output of the system is a label that tells whether a user will click an add or not (B). Another example is speech recognition, where the input is speech as an audio file (A) and the output is a transcript of what is said in the audio file (B). Another example is when you give the algorithm an image of a steel plate (A) and it has to tell whether it is defect or faultless (B).

This can seem like a quite limiting technology at the first glance, but it is incredibly powerful if you find the right application for it. It is the single major cause of additional value that is created through AI for our society. The number of different use cases for this technology seems endless and people discover new ones every day.

Artificial Intelligence is a very complex field with a lot of terms that can be quite confusing in the beginning. You probably heard about Neural Networks, Deep Learning or Data Science. We will now take a look at the most important terms of AI and uncover their meaning, so that you are able to to talk about AI with other people and to think about how you could apply AI within your business.

I am giving you the most commonly used definitions of AI terms, but be aware that **AI is a very opaque field where many terms are used interchangeably** and sometimes inconsistent.

Artificial Intelligence is an an **area of computer science** **that emphasizes the creation of intelligent machines** **that work and react like humans**. Like I already mentioned, when people talk about AI, they mostly mean Artificial General Intelligence (AGI). You should see AI as the whole Area, and Machine Learning and Deep Learning as the techniques used to make computers act intelligently.

Machine Learning is **a subfield of AI**. It is a field of study **that gives computers the ability to learn from data** without being explicitly programmed. So with Machine Learning, you basically train the program to perform a certain task. Therefore, machine learning often results in a running AI system, which is basically a piece of software.

Example of a Machine Learning Project:

Imagine that you are a real estate company and that you have a lot of data about houses. You partner with a Machine Learning company to build a Machine Learning system to predict the future prices of houses. A system like that enables you to make better decision regarding in which houses you want to invest and to figure out the right time to liquidate your investments.

Deep Learning is a sub-part of Machine Learning, which is responsible for all the media hype and most of the breakthroughs in ANI, that we saw in recent years and still see today.

It is basically the same thing as Machine Learning: You give the algorithm labelled data and it learns to predict the label. **The difference to Machine Learning is that you use more modern and more sophisticated algorithms, called Neural Networks**. In contrast: in Machine Learning, you use more simple, traditional algorithms.

Due to their complexity, new technical discoveries and the availability of enough data and computational power, Deep Learning algorithms where able to break the previous benchmarks on many tasks and even to out-perform humans on some of them (for example: Histopathological Image Analysis, or recommending movies on Netflix).

Although, Neural Networks (e.g. Deep Learning algorithms) almost always perform better than traditional algorithms, they have certain disadvantages. If you want to know more about that, check out my post: “Pros and Cons of Neural Networks” (https://towardsdatascience.com/hype-disadvantages-of-neural-networks-6af04904ba5b).

You often hear that Neural Networks are built like the human brain or inspired by it, but in reality, they have almost nothing to do with it. It is true that they where initially inspired by the brain, but **the details of how they work are completely unrelated to how biological human brains work**.

Note that many people use the terms Deep Learning and Neural Networks interchangeably.

Example of a Deep Learning Project:

A deep learning project does not differ that much from a Machine Learning project when you look at it from a high level view. You only need much more data, more computational power and highly skilled engineers.

The output of a Data Science project is usually a set of insights that help you to make better business decisions, such as deciding whether to invest in something, whether you should acquire certain equipment, or if your website should be re-structured. You could say that **Data Science is the science of extracting knowledge and insights from data by analyzing it with statistical methods**, visualizations etc. The output are often presentations or slide decks that summarize conclusions for executives, leaders or product teams to make certain decisions.

Example of a Data Science Project:

Imagine that you are in the online advertising industry. By analyzing the sales data of your company, your data scientists found out that the travel industry does not buy many adds from you. As a result you could switch your sales teams focus to companies of the travel industry.

Another example:

Imagine that you are running an e-commerce business and you hired a few Data Scientist to get some more insights into your business. The outcome of this project could be a slide deck presenting a plan on how to modify pricing in order to increase overall sales or insights on how to market specific products more efficiently.

Some people say that AI is a subset of data science and some people say it is the other way around. So, it depends on who you are talking to, but I would say that data science is an interdisciplinary field that uses many tools from AI machine learning and deep learning, but it also has its own separate tools. Its goal is mostly to drive business insights.

You probably also heard about other buzzwords, like Reinforcement Learning, Generative Adversarial Networks (Gans) etc. These are just other tools to make AI systems act intelligently, or said in other words, to do Machine Learning and sometimes Data Science.

You now know about AI, Machine Learning, Data Science and Deep Learning (e.g. Neural Networks). I hope this gives you a sense of the most common terms used in AI, and that you can start thinking about how these things might apply to your business.

Data can take on many forms: spreadsheets, images, audio, sensor data etc. These are split into two main categories: structured & unstructured data.

**Structured Data (“data that lives in a giant spreadsheet”)**

Structured data, is like its name already implies, **data that is stored in a structured format following a pre-defined schema**. It refers to any data that resides in a fixed field within a record or a file. It can be textual or non-textual.

Below you can see an example of structured data from the popular Titanic dataset. It contains information about each passenger who was on the Titanic.

**Unstructured Data**

Unstructured data is essentially everything else that is not structured via a pre-defined schema. It can be textual or non-textual, but **when people talk about unstructured data, they mostly mean images, videos, audio files, documents etc.**

I already explained what supervised learning is. Since supervised learning ist the most commonly used type of Machine Learning,** when people say “data”, they mostly mean labelled data**. Example: You have a dataset with photos from 100,000 dogs and cats where each photo has a label, either “Cat” or “Dog”.

Another example is a dataset that contains information about housing prices. Here you would have information about houses (like square meter, number of bedrooms, location etc.) and also their price as a label.

You can find many dataset for a lot of problems in the internet (some for free and some cost money), but most of the time you need to create your own dataset (if you don’t already have it) that is specifically tailored to the problem that you are trying to solve with AI.

There are three main ways to get data:

Imagine that you want to build a classifier that can detect whether there is a man or a woman on a given picture. To train such a classifier, you would need the create or get many images of men and women. Then, you need to assign a label to every image: men (label 1) or woman (label 2). You can also pay people to do the labeling work for you (Ex.: Amazon Mechanical Turk: mturk.com).

Imagine that you ran a e-commerce business and want to predict when a customer will make a purchase, which in turn enables you to manage your stocks better etc. You could create a dataset by simply observing how your users behave on your website and how they make purchases. This would result in a dataset that describes the actions of each user (described by some variables like for example: time of the day, where they clicked etc.), together with a label: purchase (label 1) or no purchase (label 2).

Another example is that you observe the behavior of machines , which could enable your to predict when they need maintenance etc.

There are many free sources for datasets like Kaggle. You can also use Google Data Search, which works like google but only for datasets. If you do not find anything, you can look for datasets on a data market place or get it from a partner.

(Image Source: https://unsplash.com/photos/1K6IQsQbizI)

Acquiring data maybe seems simple to you at a first glance, but theres a lot that can go wrong. **In AI & Machine Learning we say: “Garbage in garbage out”, which means nothing more that you get the quality out of your AI system, that you put into it during training.**

Imagine you know that you want to create a specific AI application and you start acquiring Data (which you think is useful). Your plan is to accumulate data for two years and then build an AI system. This is very bad practice. In this scenario, the correct way would be to acquire the data that you are able to get and give it to an AI expert as soon as possible. He can tell you after some evaluation, what parts of it are useful, what parts are completely useless, and what data you should add additionally. By doing that, you don’t have the risk that you acquire data over two years and then you realize that it was the wrong data and that you can’t do anything with it. To save money and time: Evaluate the quality of your data quickly together with experts.

Another big problem are incorrect labels. Example: Cat images that are labelled as dogs and dogs that are labelled as cats etc. You get what I mean. This prevents your algorithm from learning what really separates cats from dogs and totally confuses it. The good thing is that the problem with incorrect labels gets less and less important the more data you have in total. If you have a huge dataset with over 2 million labelled cats and dogs images, a few incorrect labels won’t hurt its performance.

Another problem is that some people assume that because their company has a lot of data, that this data is useful or that an AI team can make it useful. Thats completely wrong. **Although, more data is usually better, you can have billions of data entries, that are worth nothin and not even the best AI engineers of the world can create value out of something that has no value.** So please don’t throw data at an ai team and assume it will be valuable somehow. You maybe think that this is common sense, but I saw it happening many times in the industry because of a misunderstanding about data and AI. There are even startups founded because people thought that they posses useful data, when in fact they didn’t. Other issues are missing values, multiply types of data (can be solved — but costly) and much more.

I hope that this post gave you a solid introduction to the field of Artificial Intelligence from a high level perspective and that you now have a better understanding of how AI works and what it can really do. If you think there is something missing or not explained clearly enough, you can let me know in the comments. To summarize: You learned about the common misconception about AI (e.g. that people often confuse AGI with ANI) and what Machine Learning and Data really is. You are now familiar with the most common terms of the field: Data Science, Deep Learning, AI and Machine Learning. Additionally, you learned where you can get data, how you should not approach data acquisition and that having a lot of data does not necessarily mean that you can do AI with it.

Markov Solutions can help you. We build AI-powered software solutions & advise companies on the topic.

Just write a message to: **donges@markov-solutions.com**

**Machine Learning Yearning is about structuring the development of machine learning projects. The book contains practical insights that are difficult to find somewhere else, in a format that is easy to share with teammates and collaborators. Most technical AI courses will explain to you how the different ML algorithms work under the hood, but this book teaches you how to actually use them. If you aspire to be a technical leader in AI, this book will help you on your way. Historically, the only way to learn how to make strategic decisions about AI projects was to participate in a graduate program or to gain experience working at a company. Machine Learning Yearning is there to help you quickly acquire this skill, which enables you to become better at building sophisticated AI systems.**

**About the Author****Introduction****Concept 1: Iterate, iterate, iterate…****Concept 2: Use a single evaluation metrics****Concept 3: Error analysis is crucial****Concept 4: Define an optimal error rate****Concept 5: Work on problems that humans can do well****Concept 6: How to split your dataset****Summary**

Andrew NG is a computer scientist, executive, investor, entrepreneur, and one of the leading experts in Artificial Intelligence. He is the former Vice President and Chief Scientist of Baidu, an adjunct professor at Stanford University, the creator of one of the most popular online courses for machine learning, the co-founder of Coursera.com and a former head of Google Brain. At Baidu, he was significantly involved in expanding their AI team into several thousand people.

The book starts with a little story. Imagine, you want to build the leading cat detector system as a company. You have already built a prototype but unfortunately, your system’s performance is not that great. Your team comes up with several ideas on how to improve the system, but you are confused about which direction to follow. You could build the worlds leading cat detector platform or waste months of your time following the wrong direction.

This book is there to tell you how you can decide and prioritize in a situation like this. According to Andrew NG, most machine learning problems will leave clues about the most promising next steps and about what you should avoid doing. He goes on explaining that learning to “read” those clues is a crucial skill in our domain.

**In a nutshell, ML Yearning is about giving you a deep understanding of setting the technical direction of machine learning projects.**

Since your team members could react skeptically when you propose new ideas of doing things, he made the chapters very short (1–2 pages), so that your team members could read it in a few minutes to understand the idea behind the concepts. If you are interested in reading this book, note that it is not suited for complete beginners, since it requires basic familiarity with supervised learning and deep learning.

In this post, I will share six concepts of the book in my own language out of my understanding.

NG emphasizes throughout the book that it is crucial to iterate quickly since machine learning is an iterative process. Instead of thinking about how to build the perfect ML system for your problem, you should build a simple prototype as fast as you can. This is especially true if you are not an expert in the domain of the problem since it is hard to correctly guess the most promising direction.

You should build a first prototype in just a few days and then clues will pop up that show you the most promising direction to improve the performance of the prototype. In the next iteration, you will improve the system based on one of these clues and build the next version of it. You will do this again and again.

He goes on explaining that the faster you can iterate, the more progress you will make. Other concepts of the book, build upon this principle. Note that this is meant for people who just want to build an AI-based application and not do research in the field.

This concept builds upon the previous one and the explanation about why you should choose a single-number evaluation metrics is very simple: It enables you to quickly evaluate your algorithm and therefore you are able to iterate faster. Using multiple evaluation metrics simply makes it harder to compare algorithms.

Imagine you have two algorithms. The first has a precision of 94% and a recall of 89%. The second has a precision of 88% and a recall of 95%.

Here, no classifier is obviously superior if you didn’t choose a single evaluation metrics, so you would probably have to spend some time to figure it out. The problem is, that you lose a lot of time for this task at every iteration and that it adds up over the long run. You will try a lot of ideas about architecture, parameters, features, etc. If you are using a single-number evaluation metric (such as precision or the f1-score), it enables you to sort all your models according to their performance, and quickly decide which one is working best. Another way of improving the evaluation process would be to combine several metrics into a single one, for example, by averaging multiple error metrics.

Nevertheless, there will be ML problems that need to satisfy more than one metric, like for example: taking running time into consideration. NG explains that you should define an “acceptable” running time, which enables you to quickly sort out the algorithms that are too slow and compare the satisfying ones with each other based on your single-number evaluation metrics.

**In short, a single-number evaluation metrics enables you to quickly evaluate algorithms, and therefore to iterate faster.**

Error analysis is the process of looking at examples where your algorithms output was incorrect. For example, imagine that your cat detector mistakes birds for cats and you already have several ideas on how to solve that issue.

With a proper error analysis, you can estimate how much an idea for improvement would actually increase the system’s performance, without investing months of your time on implementing this idea and realizing that it wasn’t crucial to your system. This enables you to decide which idea is the best to spend your resources on. If you find out that only 9% of the misclassified images are birds, then it does not matter how much you improve your algorithm’s performance on bird images, because it won’t improve more than 9% of your errors.

Furthermore, it enables you to quickly judge several ideas for improvement in parallel. You just need to create a spreadsheet and fill it out while examining, for example, 100 misclassified dev set images. On the spreadsheet, you create a row for every misclassified image and columns for every idea that you have for improvement. Then you go through every misclassified image and mark with which idea the image would have been classified correctly.

Afterward, you know exactly that, for example, with idea-1 the system would have classified 40 % of the miss-classified images correctly, idea-2 12%, and idea-3 only 9%. Then you know that working on idea-1 is the most promising improvement that your team should work on.

Also, once you start looking through these examples, you will probably find new ideas on how to improve your algorithm.

The optimal error rate is helpful to guide your next steps. In statistics, it is also often called the Bayes error rate.

Imagine that you are building a speech-to-text system and that you find out that 19% of the audio files, you expect users to submit, have so dominant background noises that even humans get can’t recognize what was said in there. If that’s the case, you know that even the best system would probably have an error of around 19%. In contrast, if you work on a problem with an optimal error rate of nearly 0%, you can hope that your system should do just as well.

It also helps you to detect if you are algorithm is suffering from high bias or variance, which helps you to define the next steps to improve your algorithm.

But how do we know what the optimal error rate is? For tasks that humans are good at, you can compare your system’s performance to those of humans, which gives you an estimate of the optimal error rate. In other cases, it is often hard to define an optimal rate, which is the reasons why you should work on problems that humans can do well, which we will discuss at the next concept.

Throughout the book, he explains several times why it is recommended to work on machine learning problems that humans can do well themselves. Examples are Speech Recognition, Image Classification, Object Detection and so on. This has several reasons.

First, it is easier to get or to create a labeled dataset, because it is straightforward for people to provide high accuracy labels for your learning algorithm if they can solve the problem by themselves.

Second, you can use human performance as the optimal error rate that you want to reach with your algorithm. NG explains that having defined a reasonable and achievable optimal error helps to accelerate the team’s progress. It also helps you to detect if your algorithm is suffering from high bias or variance.

Third, it enables you to do error analysis based on your human intuition. If you are building, for example, a speech recognition system and your model misclassifies its input, you can try to understand what information a human would be using to get the correct transcription, and use this to modify the learning algorithm accordingly. Although algorithms surpass humans at more and more of tasks that humans can’t do well themselves, you should try to avoid these problems.

To summarize, you should avoid these tasks because it makes it harder to obtain labels for your data, you can’t count on human intuition anymore, and it is hard to know what the optimal error rate is.

NG also proposes a way on how to split your dataset. He recommends the following:

**Train Set: **With it, you train your algorithm and nothing else.

**Dev Set: **This set is there to do hyperparameter tuning, to select and create proper features and to do error analysis. It is basically there to make decisions about your algorithm.

**Test Set: **The test set is used to evaluate the performance of your system, but not to make decisions. It’s just there for evaluation, and nothing else.

The dev set and test set allow your team to quickly evaluate how well your algorithm is performing. Their purpose is to guide you to the most important changes that you should make to your system.

**He recommends choosing the dev and test set so that they reflect data which you want to do well on in the future once your system is deployed.**This is especially true if you expect that the data will be different than the data you are training it on right now. For example, you are training on normal camera images but later on your system will only receive pictures taken by phones because it is part of a mobile app. This can be the case if you don’t have access to enough mobile phone photos to train your system. Therefore, you should pick test set examples that reflect what you want to perform well on later in reality, rather than the data that you used for training.

**Also, you should choose dev and test sets that come from the same distribution.** Otherwise, there is a chance that your team will build something that does well on the dev set, only to find that it performs extremely poor on the test data, which you care about the most.

In this post, you’ve learned about 6 concepts of Machine Learning Yearning. You now know, why it is important to iterate quickly, why you should use a single-number evaluation metrics and what errors analysis is about and why it is crucial. Also, you’ve learned about the optimal error rate, why you should work on problems that humans can do well and how you should split your data. Furthermore, you learned that you should pick the dev and test set data so that they reflect the data which you want to do well on in the future, and that dev and test sets should come from the same distribution. I hope that this post gave you an introduction to some concepts of the book and I can definitely say that it is worth reading.

- Machine Learning Yearning https://www.mlyearning.org/
- Img: “Andrew NG”: T
*aken by the NVIDIA Corporation under the “CC BY-NC-ND 2.0” license. No changes have been made. Link:*https://www.flickr.com/photos/nvidia/16841620756 - Img: “Metric”: https://pixabay.com/de/antrieb-auto-verkehr-stra%C3%9Fe-44276/
- Img: “Math Error”: https://pixabay.com/de/fehler-mathematik-1966460/

The story of speech recognition is as much about the application of different *approaches* as the development of raw technology, though the two are inextricably linked. Over a period of decades, researchers would conceive of myriad ways to dissect language: by sounds, by structure — and with statistics.

Human interest in recognizing and synthesizing speech dates back hundreds of years (at least!) — but it wasn’t until the mid-20th century that our forebears built something recognizable as ASR.

**1961** — IBM Shoebox

Among the earliest projects was a “digit recognizer” called Audrey, created by researchers at Bell Laboratories in 1952. Audrey could recognize spoken numerical digits by looking for audio fingerprints called *formants — *the distilled essences of sounds.

In the 1960s, IBM developed Shoebox — a system that could recognize digits *and *arithmetic commands like “plus” and “total”. Better yet, Shoebox could pass the math problem to an adding machine, which would calculate and print the answer.

Meanwhile, researchers in Japan built hardware that could recognize the constituent parts of speech like vowels; other systems could evaluate the structure of speech to figure out where a word might end. And a team at University College in England could recognize 4 vowels and 9 consonants by analyzing phonemes, the discrete sounds of a language.

But while the field was taking incremental steps forward, it wasn’t necessarily clear where the path was heading. And then: disaster.

**October 1969 **—** **The Journal of the Acoustical Society of America

The turning point came in the form of a letter written by John R. Pierce in 1969.

Pierce had long since established himself as an engineer of international renown; among other achievements he coined the word *transistor *(now ubiquitous in engineering) and helped launch *Echo I*, the first-ever communications satellite. By 1969 he was an executive at Bell Labs, which had invested extensively in the development of speech recognition.

In an open letter³ published in *The Journal of the Acoustical Society of America, *Pierce laid out his concerns. Citing a “lush” funding environment in the aftermath of World War II and Sputnik, and the lack of accountability thereof, Pierce admonished the field for its lack of scientific rigor, asserting that there was too much wild experimentation going on:

“We all believe that a science of speech is possible, despite the scarcity in the field of people who behave like scientists and of results that look like science.” — J.R. Pierce, 1969

Pierce put his employer’s money where his mouth was: he defunded Bell’s ASR programs, which wouldn’t be reinstated until after he resigned in 1971.

Thankfully there was more optimism elsewhere. In the early 1970s, the U.S. Department of Defense’s ARPA (the agency now known as DARPA) funded a five-year program called *Speech Understanding Research. *This led to the creation of several new ASR systems, the most successful of which was Carnegie Mellon University’s *Harpy, *which could recognize just over 1000 words by 1976.

Meanwhile, efforts from IBM and AT&T’s Bell Laboratories pushed the technology toward possible commercial applications. IBM prioritized speech transcription in the context of office correspondence, and Bell was concerned with ‘command and control’ scenarios: the precursors to the voice dialing and automated phone trees we know today.

Despite this progress, by the end of the 1970s ASR was still a long ways from being viable for anything but highly-specific use-cases.

This hurts my head, too.

A key turning point came with the popularization of *Hidden Markov Models*(HMMs) in the mid-1980s. This approach represented a significant shift “from simple pattern recognition methods, based on templates and a spectral distance measure, to a statistical method for speech processing”—which translated to a leap forward in accuracy.

*A large part of the improvement in speech recognition systems since the late 1960s is due to the power of this statistical approach, coupled with the advances in computer technology necessary to implement HMMs.*

HMMs took the industry by storm — but they were no overnight success. Jim Baker first applied them to speech recognition in the early 1970s at CMU, and the models themselves had been described by Leonard E. Baum in the ‘60s. It wasn’t until 1980, when Jack Ferguson gave a set of illuminating lectures at the Institute for Defense Analyses, that the technique began to disseminate more widely.

The success of HMMs validated the work of Frederick Jelinek at IBM’s Watson Research Center, who since the early 1970s had advocated for the use of statistical models to interpret speech, rather than trying to get computers to mimic the way humans digest language: through meaning, syntax, and grammar (a common approach at the time). As Jelinek later put it: “Airplanes don’t flap their wings.”

These data-driven approaches also facilitated progress that had as much to do with industry collaboration and accountability as individual eureka moments. With the increasing popularity of statistical models, the ASR field began coalescing around a suite of tests that would provide a standardized benchmark to compare to. This was further encouraged by the release of shared data sets: large corpuses of data that researchers could use to train and test their models on.

In other words: finally, there was an (imperfect) way to measure and compare success.

**November 1990**, Infoworld

For better and worse, the 90s introduced consumers to automatic speech recognition in a form we’d recognize today. Dragon Dictate launched in 1990 for a staggering $9,000, touting a dictionary of 80,000 words and features like natural language processing (see the *Infoworld* article above).

These tools were time-consuming (the article claims otherwise, but Dragon became known for prompting users to ‘train’ the dictation software to their own voice). And it required that users speak in a stilted manner: Dragon could initially recognize only 30–40 words a minute; people typically talk around four times faster than that.

But it worked well enough for Dragon to grow into a business with hundreds of employees, and customers spanning healthcare, law, and more. By 1997 the company introduced Dragon NaturallySpeaking, which could capture words at a more fluid pace — and, at $150, a much lower price-tag.

Even so, there may have been as many grumbles as squeals of delight: to the degree that there is consumer skepticism around ASR today, some of the credit should go to the over-enthusiastic marketing of these early products. But without the efforts of industry pioneers James and Janet Baker (who founded Dragon Systems in 1982), the productization of ASR may have taken much longer.

**November 1993, **IEEE Communications Magazine

25 years after J.R. Pierce’s paper was published, the IEEE published a follow-up titled *Whither Speech Recognition: the Next 25 Years⁵*, authored by two senior employees of Bell Laboratories (the same institution where Pierce worked).

The latter article surveys the state of the industry circa 1993, when the paper was published — and serves as a sort of rebuttal to the pessimism of the original. Among its takeaways:

- The key issue with Pierce’s letter was his assumption that in order for speech recognition to become useful, computers would need to comprehend what words
*mean*. Given the technology of the time, this was completely infeasible. - In a sense, Pierce was right: by 1993 computers had meager understanding of language—and in 2018, they’re still notoriously bad at discerning meaning.
- Pierce’s mistake lay in his failure to anticipate the myriad ways speech recognition can be useful, even when the computer doesn’t know what the words actually mean.

The *Whither* sequel ends with a prognosis, forecasting where ASR would head in the years after 1993. The section is couched in cheeky hedges (“We confidently predict that at least one of these eight predictions will turn out to have been incorrect”) — but it’s intriguing all the same. Among their eight predictions:

- “By the year 2000, more people will get remote information via voice dialogues than by typing commands on computer keyboards to access remote databases.”
- “People will learn to modify their speech habits to use speech recognition devices, just as they have changed their speaking behavior to leave messages on answering machines.
**Even though they will learn how to use this technology, people will always complain about speech recognizers.”**

In a forthcoming installment in this series, we’ll be exploring more recent developments and the current state of automatic speech recognition. Spoiler alert: neural networks have played a starring role.

But neural networks are actually as old as most of the approaches described here — they were introduced in the 1950s! It wasn’t until the computational power of the modern era (along with much larger data sets) that they changed the landscape.

But we’re getting ahead of ourselves. Stay tuned for our next post on Automatic Speech Recognition by following Descript on Medium, Twitter, or Facebook.

Timeline via Juang & Rabiner

- Recurrent Neural Networks
- Basics RNNs and Speech Recognition
- Connectionist Temporal Classification
- Summary

This post requires knowledge about Recurrent Neural Networks. If you aren’t familiar with this kind of Neural Networks, I encourage you to check out my article about them (click here).

Nevertheless, I will give you a little recap:

RNNs are the state of the art algorithm for sequential data. They have an internal time-dependent state (memory) due to the so-called “feedback-loop”. Because of that, they are good at predicting what’s coming next based on what happened before. This makes them well suited for problems that involve sequential data like speech recognition, handwriting recognition and so on.

To better understand RNNs, we have to look at what makes them different than a usual (feedforward) Neural Network. Take a look at the following example.

Imagine that we put the word „SAP“ into a feedforward Neural Network (FFNN) and into a Recurrent Neural Network (RNN) and that they would process it one character after the other. By the time, the FFNN reaches the letter „P“, it has already forgotten about „S“ and „A“, but a RNN doesn’t. This is due to the different way they process information, that the two image below illustrates.

In a FFNN, the information only moves in one direction (from input, through the hidden layers, through the output layers) and therefore, the information never touches a node twice. This is the reason why feedforward Neural Networks can’t remember what they received as input – they only remember the data they are trained upon.

In contrast, a RNN cycles information through a loop. When making a decision, a RNN takes into consideration the current input and also what it has learned from the previous inputs.

Speech Recognition is the task of translating spoken language into text by a computer. The problem is that you have an acoustic observation (some recording as an audio file) and you want to have a transcription of it, without manually creating it.

You know, that Recurrent Neural Networks are well suited for tasks that involve sequential data. And because speech is sequential data, in theory, you could train a RNN with a dataset of acoustic observations and their corresponding transcriptions. But like you probably already guessed, it isn’t that easy. That is because basic, also called canonical Recurrent Neural Networks require aligned data. This means that each character (not each word!) needs to be aligned to its location in the audio file. Just take a look at the image below and imagine that every character would need to be aligned to its exact location.

Of course, this is tedious work that no one wants to do and only a few organizations could afford to hire enough people to do that, which is the reason why there are very few datasets like this out there.

Fortunately, we have Connectionist Temporal Classification (CTC), which is a way around not knowing the alignment between the input and the output. CTC is simply a loss function that is used to train Neural Networks, like Cross-Entropy and so on. It is used at problems, where having aligned data is an issue, like Speech Recognition.

Like I said, with CTC, there is no need for aligned data. That is because it can assign a probability for any label, given an input. This means it only requires an audio file as input and a corresponding transcription. But how can CTC assign a probability for any label, just given an input? Like I said, CTC is „alignment-free“. It works by **summing over the probability of all possible alignments between the input and the label**. To understand that, take a look at this naive approach:

Here we have an input of size 9 and the correct transcription of it is „Iphone“. We force our system to assign an output character to each input step and then we collapse the repeats, which results in the output. But this approach has two problems:

It does not make sense to force every input step to be aligned to some output, because we also need to account for silence within the input. There is no way to produce words as output that have two characters in a row, like a word „Kaggle“. If we use this approach, we could only produce “Kagle” as output.

There is a way around that, called the „blank token“. It does not mean anything and it simply gets removed before the final output is produced. Now, the algorithm can assign a character or the blank token to each input step. There is one rule: To allow double characters in the output, a blank token must be between them. With this simple rule, we can also produce output with two characters in a row.

Here is an illustration of how this works:

1.) The CTC network assigns the character with the highest probability (or no character) to every input sequence.

2.) Repeats without a blank token in between get merged.

3.) and lastly, the blank token gets removed.

The CTC network can then give you the probability of a label for a given input, by summing over all the probabilities of the character for each time-step.

In this post, we’ve reviewed the main facts about Recurrent Neural Networks and we learned why canonical RNNs aren’t really suited for sequence problems. Then we discussed what Connectionist Temporal Classification is, how it works and why it enables RNNs to tackle tasks like Speech Recognition.

]]>

**The Waterfall Methodology****Agile & the Agile Manifesto****The Scrum Methodology****Team Roles****Product Owner****Scrum Master****Development Team**

**Scrum Artifacts****Product Backlog****Sprint Backlog****Product Increment**

**Summary**

The Waterfall Methodology is one of the oldest and most traditional methods to manage the development of software applications. It splits the Software Development Lifecycle (SDLC) into 6 different stages. Waterfall is a linear approach, where you can only proceed to the next stage if the current stage is completely finished. Because of that it is called the Waterfall Methodology. If you would want to go back to a previous stage, for example, if you want to change the design during the deployment phase, you would need to go through every stage again that comes after the design stage.

Although, the Agile approach is more commonly applied in the industry than the Waterfall Methodology, it is still used because implementing a Waterfall model is a straightforward process, due to its step by step nature.

You can see the six stages below:

**I. Requirement Analysis**

Here, the requirements for the application are analyzed and documented. These documents are the baseline for creating the software and the requirements are split into functional- and un-functional requirements. Like at every of those 6 stages, the stage gets reviewed & signed off before proceeding to the next phase.

**II. System Design **

At this stage, the design team creates a blueprint for the software, using the requirement documents of stage one. They are creating high- and low-level design documents, which also get revised and signed off before proceeding to the next phase.

**III. Implementation**

In the implementation stage, developers convert the designs into actual software. This stage will result in a working software program.

**IV. Testing**

Here, the software gets tested to ensure that all requirements are fulfilled and that it is working flawlessly. In addition, this helps to identify errors or missing requirements in contrary to the actual requirements from stage 1. Testing, also provides stakeholders with information about the quality of the software.

**V. Deployment**

In this stage, the software gets transformed into a real world application. Software deployment is all of the activities that make a software system available for use. This involves preparing the application to run and operate in a specific environment, making it scaleable and optimize its performance. This can be done either manually or through automated systems. Because every software is unique, this is more a *general process* that has to be customized according to the specific requirements and characteristics.

**VI. Maintenance**

In the Maintenance stage, the software gets passed to the maintenance team, which regularly updates it and fixes issues. They also develop new functionality enhancements to the system, to further improve performance or other attributes.

This is how the Waterfall Model works. Like I said, the problem with the Waterfall Methodology is that if any requirements change, the team would have to move back to the requirement analysis stage & has to go through all stages again. The same would happen if the design or somethings else would change. This makes it difficult and inconvenient to make changes afterwards. The design and development processes are not sequential most of the time, because new requirements can surface at any point necessitating changes in design, which in turn results in new development tasks. Another disadvantage is that no working software is produced till the mid/late stages. This is a problem because if the investors decide to shut down the project, you would have achieved 60% of the whole project but you would not even have a single line of code. This would be different with the agile approach, which will be discussed in the next section. Because of these issues, Waterfall can be a risky and uncertain approach, depending on the project. Therefore it is not well suited for large and complex projects.

Its advantages are that it is easy to understand and implement, that it’s phases don’t overlap, that it works good for relatively small projects and that it is easy to manage.

The term „Agile“ comprises different software development methods that work all very similar and concurrent to the Agile manifesto. Note that Agile and its corresponding methods are not as prescriptive as it may appear. Most of the time, the individual teams figure out what works for them and adjust the Agile system accordingly. Teams that are working Agile, are cross-functional, self organized and take working software as their measure of progress.

Agile methods are all iterative approaches that build software incrementally, instead of trying to deliver all at once near the end. Agile works by breaking down project requirements into little bits of user-functionalities, which are called user stories. These are then prioritized and delivered continuously in short cycles which are called iterations. To receive the highest customer satisfaction, working agile means to set a focus on quality. In agile teams, development team members have the responsibility to solve problems, organize and assign tasks and create a code architecture that is modular, flexible, extensible and suits the nature of team.

In 2001, software and development experts created a statement of values for successful software projects, which is known as the agile manifesto. Agile methods are all based on the four main values of this Manifesto, which you can see below.

We are uncovering better ways of developing software by doing it and helping others do it. Through this work, we have come to the value:

Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

That is, while there is value in the items on the right,we value the items on the left more.

The people who have written the Agile Manifesto, also created twelve principles of Agile software development, which support the manifestos values. These make the practical implementation of Scrum within a company easier to manage.

Scrum is one of the most used and well known agile frameworks out there. At its foundation, Scrum can be applied to nearly any project.

First of all, in Scrum the project team is separated into three different roles. These are the following: The product Owner, who is responsible for the product that will be created. The Scrum Master, who is responsible for the implementation and compliance of the Scrum rules and processes within the company. And lastly, the Development Team, which are the actual programmers and experts that create the product.

Secondly, Scrum has 5 events that ensure transparency, inspection and adaption. These events are: The Sprint, which is at the heart of scrum. A sprint is a cycle that should always be equally long and should not take longer than four weeks. Every sprint has clear goals about what should be accomplished during its timeframe and the sprints progress is tracked at the scrum board. This board is split into „To Do“, „Build“, „Test“ and „Done“, where the tasks are placed as a sticker and moved from „Todo“ on till it reaches „Done“. You can see such a board below.

At the end of every sprint, a new functionality for the final product should be finished and delivered to the customer. Then there is the Sprint Planning, where the goals for every sprint are set. This is of course done before a sprint starts. There is also the daily Scrum, where the last 24 hours are discussed an the next 24hours are planned. This takes about 15 minutes. There is also the sprint review, in which the last sprint gets evaluated and the developed functionality (product increment) gets tested. This happens at the end of every sprint. Lastly, there is the sprint retrospective, which is also done after every sprint, which has the goal of improving the sprints in general.

To be more clearly, here is the general procedure of a sprint:

Each sprint starts with a Sprint Planning, which is facilitated by the Scrum Master and attended by the Product Owner and the Development Team and (optionally) other stakeholders. They sit down together and select high priority items from the Product Backlog that the Development Team can commit to a single delivery in a single sprint. The selected items are known as the Sprint Backlog. The Development Team works on the items in the Sprint Backlog only for the duration of the Sprint. New issues usually must wait for the next sprint. The Daily Scrum is a short standup meeting attended by the Scrum Master, the Product Owner and the Development Team. A review of the sprint often includes a demo to stakeholders. An examination of what went well, what could be improved, etc. The goal is to make each Sprint more efficient and effective than the last. At the end of the Sprint, completed items are packaged for the release. (Note that some teams release more often than this.) Any incomplete items are returned to the Product Backlog.

**1.) Product Owner **

The Product Owner is always a single person, which is fully responsible for the product. He writes the requirements that the final product has to fulfill into the so called „Product Backlog“, which we will discuss later on.

The Product Owner basically represent to some degree the customers of the product and therefore represents their needs. But this is only one part of his role. He also represents the bridge between the actual customer and the development team. Therefore, he is responsible for quality assurance of the final product. He is also responsible for considering the needs and ideas of the development team, so that everyone feels comfortable and knows the his opinion is taken into account in regards to the Product Backlog.

**2.) Scrum Master **

It is the job of the Scrum Master to help the Product Owner and the Development Team to develop and maintain good habits, that are inline with the Scrum Methodology.

The Scrum Master is responsible for the implementation and compliance of the Scrum rules and processes within the company. This involves making sure that every person understands the Scrum framework and works inline with it. Basically, he improves the overall understanding of Scrum within the company. If the companies people understand Scrum 100% correctly, he would be nearly unnecessary. The Scrum Master is also acting as a bridge between the scrum team and the stakeholders. He isn’t part of any hierarchy within the company and has therefore a good general overview. He also helps the product owner with planning and executing of scrum events. On top of that, he can help the development team with the so called „Sprint Backlog“, which we will discuss later on. He also moderates the Sprint Planning, which is also attended by the Product Owner and the development team. He also moderates the daily scrum meeting.

**3.) The Development Team **

The Development Team consists of the developers and technical experts who are responsible for creating the product increment of a sprint (we will also discuss this later on). Like I said, in Agile and therefore also in Scrum, the teams are self organized. The Development Team is also responsible for creating the „Sprint Backlog and updating it daily. During a sprint, the development is constantly working at achieving the goals for the current sprint. The team meets daily for the so called „Daily Scrum“, where the last 24 hours are discussed an the next 24hours are planned. This takes about 15 minutes. It belong also to the responsibility of the Development Team to present the new product increment during a „Sprint Review“. If an expectation is not fulfilled, which only the Product Owner can decide, the whole team is responsible for it. The Development Team is also part of the Sprint Retrospective.

Scrum is based on empirical process management, which has three main parts. These are the following:

**Transparency: **Everyone involved in an agile project knows what is going on at the moment & the how the overall progress is going. E.g. Everyone is completely transparent about what he is doing.

**Inspection: **Everyone involved in the project should frequently evaluate the product. For example, the team openly and transparently shows the product at the end of each sprint.

**Adaption: **This means having the ability to adapt based on the results of the inspection.

Now we will discuss the three different Artifacts of Scrum. An Artifact is a tangible by-product produced during the product development.

**1.) Product Backlog **

The Product Backlog is an ordered list of all the possible requirements for the product. The Product Owner is responsible for the Product Backlog, for its access, its content, the order of its contents and for the prioritization. Note that every product has only one product backlog and only one Product Owner, even if there are many development teams. It is important to understand that a product backlog can never be „complete“, since it is constantly evolving along with the product. Typical entries for it are, description, order and priority of a functionality and the value that a functionality provides for the product. Since the markets and technology in general are constantly evolving, the requirements for the product and therefore the product backlog are constantly changing. The Scrum Team (Product Owner & Development Team) are improving the Backlog constantly together, which is of course a continuous process.

The priority of a product requirement within the Product Backlog plays an important role, because the higher it is, the more clearly and detailedly the requirement is formulated. This is because these are the ones that will soon move to the sprint backlog and therefore need to be clear enough.

**2.) Sprint Backlog **

The Sprint Backlog consists of the Product Backlog entries with the highest priority in regards to the currently available capacities within the company.

It is the plan of which functionalities are included in the next product increment. It is basically the goal plan of the things that the development team needs to achieve within the next sprint. Therefore it includes the required tasks that need to be done and the criteria the functionalities need to fulfill. Its goal is also to be so clearly to make the teams progress visible during the daily scrum meeting. Therefore it is also constantly evolving and gets regularly updated. The members of the development team are the only ones who are allowed to work on the sprint backlog, which is done during their normal work. During the sprint review, the sprint backlog is used to verify the goals.

**3.) Product Increment**

At the end of every sprint review, the product increment gets evaluated. It has to fulfill all the requirements from all previous sprints, which means it has to be compatible with them. This is because every product increment is just one part of the whole final product and therefore has to match to the other parts. It is like a puzzle, where a product increment represents one of the different pieces.

If it would not be like this, the whole product would may not work anymore. Also every Product increment, which is a new updated version of the product, must be fully functional at the end of the sprint so that the product owner can deliver it at any time to the customers. Note that a product increment doesn’t have to be delivered but it has to be ready to be delivered.

In this post you’ve learned about the traditional Waterfall Methodology, its advantages & disadvantages and why it sometimes can be a risky and uncertain approach, depending on the project. Furthermore, we’ve discussed what the term Agile means and looked at the Agile Manifesto, which is at the core of every Agile framework. You now know that Agile frameworks take working software as their measure of progress and you are familiar with the general procedure of a sprint. Lastly, we took a detailed look at the Scrum Methodology, including the different Roles of the project team, the three main parts of empirical process management and the three artifacts of Scrum.

]]>- What is Logistic Regression?
- How it works
- Logistic VS. Linear Regression
- Advantages / Disadvantages
- When to use it
- Multiclass Classification
- one-versus-all (OvA)
- one-versus-one (OvO)

- Other Classification Algorithms
- Summary

Like many other machine learning techniques, it is borrowed from the field of statistics and despite its name, it is not an algorithm for regression problems, where you want to predict a continuous outcome. Instead, Logistic Regression is the go-to method for binary classification. It gives you a discrete binary outcome between 0 and 1. To say it in simpler words, it’s outcome is either one thing or another.

A simple example of a Logistic Regression problem would be an algorithm used for cancer detection that takes screening picture as an input and should tell if a patient has cancer (1) or not (0).

Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.

These probabilities must then be transformed into binary values in order to actually make a prediction. This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits. This values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier.

The picture below illustrates the steps that logistic regression goes through to give you your desired output.

Below you can see how the logistic function (sigmoid function) looks like:

We want to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. You can maximize the likelihood using different methods like an optimization algorithm. Newton’s Method is such an algorithm and can be used to find maximum (or minimum) of many different functions, including the likelihood function. Instead of Newton’s Method, you could also use Gradient Descent.

You may be asking yourself what the difference between logistic and linear regression is. Logistic regression gives you a discrete outcome but linear regression gives a continuous outcome. A good example of a continuous outcome would be a model that predicts the value of a house. That value will always be different based on parameters like it’s size or location. A discrete outcome will always be one thing (you have cancer) or another (you have no cancer).

It is a widely used technique because it is very efficient, does not require too many computational resources, it’s highly interpretable, it doesn’t require input features to be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs well-calibrated predicted probabilities.

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. Therefore Feature Engineering plays an important role in regards to the performance of Logistic and also Linear Regression. Another advantage of Logistic Regression is that it is incredibly easy to implement and very efficient to train. I typically start with a Logistic Regression model as a benchmark and try using more complex algorithms from there on.

Because of its simplicity and the fact that it can be implemented relatively easy and quick, Logistic Regression is also a good baseline that you can use to measure the performance of other more complex Algorithms.

A disadvantage of it is that we can’t solve non-linear problems with logistic regression since it’s decision surface is linear. Just take a look at the example below that has 2 binary features from 2 examples.

It is clearly visible that we can’t draw a line that separates these 2 classes without a huge error. To use a simple decision tree would be a much better choice.

Logistic Regression is also not one of the most powerful algorithms out there and can be easily outperformed by more complex ones. Another disadvantage is its high reliance on a proper presentation of your data. This means that logistic regression is not a useful tool unless you have already identified all the important independent variables. Since its outcome is discrete, Logistic Regression can only predict a categorical outcome. It is also an Algorithm that is known for its vulnerability to overfitting.

Like I already mentioned, Logistic Regression separates your input into two „regions” by a linear boundary, one for each class. Therefore it is required that your data is linearly separable, like the data points in the image below:

In other words: You should think about using logistic regression when your Y variable takes on only two values (e.g when you are facing a classification problem). Note that you could also use Logistic Regression for multiclass classification, which will be discussed in the next section.

Out there are algorithms that can deal by themselves with predicting multiple classes, like Random Forest classifiers or the Naive Bayes Classifier. There are also algorithms that can’t do that, like Logistic Regression, but with some tricks, you can predict multiple classes with it too.

Let’s discuss the most common of these “tricks” at the example of the MNIST Dataset, which contains handwritten images of digits, ranging from 0 to 9. This is a classification task where our Algorithm should tell us which number is on an image.

With this strategy, you train 10 binary classifiers, one for each number. This simply means training one classifier to detect 0s, one to detect 1s, one to detect 2s and so on. When you then want to classify an image, you just look at which classifier has the best decision score

Here you train a binary classifier for every pair of digits. This means training a classifier that can distinguish between 0s and 1s, one that can distinguish between 0s and 2s, one that can distinguish between 1s and 2s etc. If there are N classes, you would need to train NxN(N-1)/2 classifiers, which are 45 in the case of the MNIST dataset.

When you then want to classify images, you need to run each of these 45 classifiers and choose the best performing one. This strategy has one big advantage over the others and this is, that you only need to train it on a part of the training set for the 2 classes it distinguishes between. Algorithms like Support Vector Machine Classifiers don’t scale well at large datasets, which is why in this case using a binary classification algorithm like Logistic Regression with the OvO strategy would do better, because it is faster to train a lot of classifiers on a small dataset than training just one at a large dataset.

At most algorithms, sklearn recognizes when you use a binary classifier for a multiclass classification task and automatically uses the OvA strategy. There is an exception: When you try to use a Support Vector Machine classifier, it automatically runs the OvO strategy.

Other common classification algorithms are Naive Bayes, Decision Trees, Random Forests, Support Vector Machines, k-nearest neighbor and many others. We will also discuss them in future blog posts but don’t feel overwhelmed by the amount of Machine Learning algorithms that are out there. Note that it is better to know 4 or 5 algorithms really well and to concentrate your energy at feature-engineering, but this is also a topic for future posts.

In this post, you have learned what Logistic Regression is and how it works. You now have a solid understanding of its advantages and disadvantages and know when you can use it. Also, you have discovered ways to use Logistic Regression to do multiclass classification with sklearn and why it is a good baseline to compare other Machine Learning algorithms with.

]]>

**Why it is important?****Confusion Matrix****Precision and Recall****F-Score****Precision/Recall Tradeoff****Precision/Recall Curve****ROC AUC Curve and ROC AUC Score****Summary**

Evaluating a classifier is often much more difficult than evaluating a regression algorithm. A good example is the famous MNIST dataset that contains images of handwritten digits from 0 to 9. If we would want to build a classifier that classifies a 6, the algorithm could classify every input as non-6 and get a 90% accuracy, because only about 10% of the images within the dataset are 6’s. This is a major issue in machine learning and the reason why you need to look at several evaluation metrics for your classification system.

First, you can take a look at the Confusion Matrix which is also known as error matrix. It is a table that describes the performance of a supervised machine learning model on the testing data, where the true values are unknown. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (and vice versa). It is called „confusion matrix“ because it makes it easy to spot where your system is confusing two classes.

Below you can see the output of the „confusion_matrix()“ function of sklearn, used on the mnist dataset.

Each row represents an actual class and each column represents a predicted class.

The first row is about non 6 images (the negative class), where 53459 images were correctly classified as non-6s (called true negatives). The remaining 623 images were wrongly classified as 6s (false positives).

The second row represents the actual 6 images. 473 were wrongly classified as non-6s (false negatives). 5445 were correctly classified as 6 (true positives). Note that the perfect classifier would be right 100% of the time, which means he would have only true positives and true negatives.

A confusion matrix gives you a lot of information about how well your model does, but there’s a way to get even more, like computing the classifiers precision. This is basically the accuracy of the positive predictions and it is typically viewed together with the “recall”, which is the ratio of correctly detected positive instances.

Fortunately, sklearn provides build-in functions to compute both of them:

Now we have a much better evaluation of our classifier. Our model classifies 89% of the time images correctly as a 6. The precision tells us that it predicted 92 % of the 6s as a 6. But there is still a better way!

You can combine precision and recall into one single metric, called the F-score (also called F1-Score). The F-score is really useful if you want to compare 2 classifiers. It is computed using the harmonic mean of precision and recall, and gives much more weight to low values. As a result of that, the classifier will only get a high F-score, if both recall and precision are high. You can easily compute the F-Score with sklearn.

Below you can see that our model get a 90% F-Score:

But unfortunately, the F-score isn’t the holy grail and has its tradeoffs. It favors classifiers that have similar precision and recall. This is a problem because you sometimes want a high precision and sometimes a high recall. The thing is that an increasing precision results in a decreasing recall and vice versa. This is called the precision/recall tradeoff and we will cover it in the next section.

To illustrate this tradeoff a little bit better, I will give you examples of when you want a high precision and when you want a high recall.

**high precision:**

You probably want a high precision, if you trained a classifier to detect videos that are suited for children. This means you want a classifier that may reject a lot of videos that would be actually suited for kids, but never shows you a video that contains adult content. therefore only shows safe ones (e.g a high precision)

**high recall:**

An example where you would need a high recall, is if you train a classifier that detects people, who are trying to break into a building. It would be fine if the classifier has only 25 % precision (would result in some false alarms), as long as it has a 99% recall and alarms you nearly every time when someone tries to break in.

To understand this tradeoff even better, we will look at how the SGDClassifier makes it’s classification decisions in regards to the MNIST dataset. For each image it has to classify, it computes a score based on a decision function and it classifies the image as one number (when the score is bigger the than threshold) or another (when the score is smaller than the threshold).

The picture below shows digits, that are ordered from the lowest (left) to the highest score (right). Let’s suppose you have a classifier that should detect 5s and the threshold is positioned at the middle of the picture (at the central arrow). Then you would spot 4 true positives (actual 5s) and one false positive (actually a 6) on the right of it. The positioning of that threshold would result in an 80% precision (4 out of 5), but out of the six actual 5s on the picture, he would only identify 4 so the recall would be 67% (4 out of 6). If you would now move the threshold to the right arrow, it would result in a higher precision but in a lower recall and vice versa if you move the threshold to the left arrow.

The trade-off between precision and recall can be observed using the precision-recall curve, and it lets you spot which threshold is the best.

Another way is to plot the precision and recall against each other:

In the image above you can clearly see that the recall is falling off sharply at a precision of around 95%. Therefore you probably want to choose to select the precision/recall tradeoff before that – maybe at around 85 %. Because of the two plots above, you are now able to choose a threshold, that gives you the best precision/recall tradeoff for your current machine learning problem. If you want for example a precision of 85%, you can easily look at the plot on the first image and see that you would need a threshold of around – 50,000.

The ROC curve is another tool to evaluate and compare binary classifiers. It has a lot of similarities with the precision/recall curve, although it is quite different. It plots the true positive rate (also called recall) against the false positive rate (ratio of incorrectly classified negative instances), instead of plotting the precision versus the recall.

Of course, we also have a tradeoff here, because the classifier produces more false positives, the higher the true positive rate is. The red line in the middle is a purely random classifier and therefore your classifier should be as far away from it as possible.

The ROC curve provides also a way to compare two classifiers with each other, by measuring the area under the curve (called AUC). This is the ROC AUC score. Note that a classifier that is 100% correct, would have a ROC AUC of 1. A completely random classifier would have a score of 0.5. Below you can see the output of the mnist model:

Now we definitely learned a lot of stuff. We learned how to evaluate classifiers and with which tools. We also learned how to select the right precision/recall tradeoff and how to compare different classifiers with the ROC AUC curve or score. Now we know, how we can create a classifier with virtually any precision we want. We also learned that this is not as desirable as it sounds because a high precision is not very useful in combination with a low recall ratio. So the next time you hear someone talking about a precision or accuracy of 99%, you know that you should ask him about the other metrics we discussed in this post.

https://en.wikipedia.org/wiki/Confusion_matrix

https://github.com/Donges-Niklas/Classification-Basics/blob/master/Classification_Basics.ipynb

http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

]]>

**Introduction****Mathematical Objects****Scalar****Vector****Matrix****Tensor**

**Computational Rules****1. Matrix-Scalar Operations****2. Matrix-Vector Multiplication****3. Matrix-Matrix Addition and Subtraction****4. Matrix-Matrix Multiplication**

**Matrix Multiplication Properties****1. Commutative****2. Associative****3. Distributive****4. Identity Matrix**

**Inverse and Transpose****1. Inverse****2. Transpose**

**Summary****Resources**

Linear Algebra is a continuous form of mathematics and it is applied throughout science and engineering because it allows you to model natural phenomena and to compute them efficiently. Because it is a form of continuous and not discrete mathematics, a lot of computer scientists don’t have a lot of experience with it. Linear Algebra is also central to almost all areas of mathematics like geometry and functional analysis. Its concepts are a crucial prerequisite for understanding the theory behind Machine Learning, especially if you are working with Deep Learning Algorithms. You don’t need to understand Linear Algebra before you get started with Machine Learning but at some point, you want to gain a better intuition for how the different machine learning algorithms really work under the hood. This will help you to make better decisions during a Machine Learning system’s development. So if you really want to be a professional in this field, you will not come around mastering the parts linear algebra that are important for Machine Learning. In Linear Algebra, data is represented by linear equations, which are presented in the form of matrices and vectors. Because of that, you are mostly dealing with matrices and vectors rather than with scalars (we will cover these terms in the following section). When you have the right libraries, like Numpy, in your proposal, you can compute complex matrix multiplication very easily with just a few lines of code. Note that this Blogpost ignores concepts of Linear Algebra that are not important for Machine Learning.

A scalar is simply just a single number. For example 24.

A Vector is an ordered array of numbers and can be in a row or a column. A Vector has just a single index, that can point to a specific value within the vector. For example, V2 refers to the second value within the Vector, which is „-8“ in the yellow picture above.

A Matrix is an ordered 2D array of numbers and it has two indices. The first one points to the row and the second one to the column. For example, M23 refers to the value in the second row and the third column, which is „8“ in the yellow picture above. A Matrix can have multiple numbers of rows and columns. Note that a vector is also a Matrix, but with only one row or one column.

The Matrix in the example in the yellow picture is also a 2 by 3-dimensional Matrix (rows*columns). Below you can see another example of a Matrix along with it’s notation:

A Tensor is an array of numbers, arranged on a regular grid, with a variable number of axis. A Tensor has three indices, where the first one points to the row, the second to the column and the third one to the axis. For example, V232 points to the second row, the third column, and the second axis. This refers to the value 0 in right Tensor at the picture below:

It is the most general term for all of these concepts above because a Tensor is a multidimensional array and it can be a vector and a matrix, which depends on the number of indices it has. For example, a first-order tensor would be a vector (1 index). A second order tensor is a matrix (2 indices) and third-order tensors (3 indices) and higher are called higher-order tensors (more than 3 indices).

If you multiply, divide, subtract or add a scalar with a matrix, you just do this mathematical operation with every element of the matrix. The image below shows that perfectly for the example of multiplication:

Multiplying a matrix with a vector can be thought of as multiplying each row of the matrix with the column of the vector. The output will be a vector that has the same number of rows as the matrix. The imagee below show how this works:

**To better understand the concept, we will go through the calculation of the second image. To get the first value of the resulting vector (16), we take the numbers of the vector we want to multiply with the matrix (1 and 5), and multiply them with the numbers of the first row of the matrix (1 and 3). This looks like this:**

**1*1 + 3*5 = 16**

**We do the same for the values within the second row of the matrix:**

**4*1 + 0*5 = 4**

**And again for the third row of the matrix:**

**2*1 + 1*5 = 7**

**Here is another example:**

**And here is some kind of cheat-sheet:**

**Matrix-Matrix Addition and Subtraction is fairly easy and straightforward. The requirement is that the matrices have the same dimensions and the result will be a matrix that has also the same dimensions. You just add or subtract each value of the first matrix with its corresponding value in the second matrix. The picture below shows what I mean:**

**Multiplying two Matrices together isn’t the hard either if you know how to multiply a matrix by a vector. Note that you can only multiply Matrices together if the number of the first matrixes columns matches the number of the second Matrixes rows. The result will be a matrix that has the same number of rows as the first matrix has and the same number of columns as the second matrix. It works as follows: **

**You simply split the second matrix into column-vectors and multiply the first matrix separately with each of these vectors. Then you put the results in a new matrix (without adding them up!). The image below explains this step by step:**

**And here is again some kind of cheat sheet:**

Matrix Multiplication has several properties that allow us to bundle a lot of computation into one Matrix multiplication. We will discuss them one by one below. We will start by explaining these concepts with Scalars and then with Matrices because this gives you a better understanding.

Scalar Multiplication is commutative but not Matrix Multiplication. This means that when we are multiplying scalars, 7*3 is the same as 3 * 7. But when we multiply matrices with each other, A*B isn’t the same as B*A.

Scalar and Matrix Multiplication are both associative. This means that the Scalar multiplication 3 (5*3) is the same as (3*5) 3 and that the Matrix multiplication A (B*C) is the same as (A*B) C.

Scalar and Matrix Multiplication are also both distributive. This means that 3 (5 + 3) is the same as 3*5 + 3*3 and that A (B+C) is the same as A*B + A*C.

The identity Matrix is a special kind of matrix but first, we need to define what an identity is. The number 1 is an identity because everything you multiply with 1 is equal to itself. Therefore every Matrix that is multiplied by an Identity Matrix is equal to itself. For example, Matrix A times its Identity-Matrix is equal to A.

You can spot an Identity Matrix by the fact that it has ones along its diagonals and that every other value is zero. It is also a „squared matrix“, meaning that its number of rows matches its number of columns.

We previously discussed that Matrix multiplication is not commutative but there is one exception, namely if we multiply a Matrix by an identity matrix. Because of that, the following equation is true: **A * I = I * A = A**

The Matrix inverse and the Matrix transpose are two special kinds of Matrix properties. Again, we will start by discussing how these properties relate to real numbers and then how they relate to Matrices.

First of all, what is an inverse? A number that is multiplied by its inverse is equal to 1. Note that every number except 0 has an inverse. If you would multiply a Matrix by its inverse, the result would be its identity matrix. The example below shows how the inverse of scalars looks like:

But not every Matrix has an inverse. You can compute the inverse of a matrix if it is a „squared matrix“ and if it can have an inverse. Discussing which Matrices have an inverse would be unfortunately out of the scope of this post.

Why do we need an Inverse? Because we can’t divide Matrices. There is no concept of dividing by a Matrix but we can multiply a Matrix by an Inverse, which results in the same thing.

The image below shows a Matrix that gets multiplied by its own inverse, which results in a 2 by 2 identity matrix.

You can easily compute the inverse of a matrix (if it can have one) using Numpy. Heres the link to the documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.inv.html.

And lastly, we will discuss the Matrix Transpose Property. This is basically the mirror image of a matrix, along a 45 degree axis. It is fairly simple to get the Transpose of a Matrix. Its first column is just the first row of the Transpose-Matrix and the second column is turned into the second row of the Matrix-Transpose. An m*n Matrix is simply transformed into an n*m Matrix. Also, the Aij element of A is equal to the Aji(transpose) element. The image below illustrates that:

In this post, you learned about the mathematical objects of Linear Algebra that are used in Machine Learning. You also learned how to multiply, divide, add and subtract these mathematical objects. Furthermore, you have learned about the most important properties of Matrices and why they enable us to make more efficient computations. On top of that, you have learned what inverse- and transpose Matrices are and what you can do with it. Although there are also other parts of Linear Algebra used in Machine Learning, this post gave you a proper Introduction to the most important concepts.

Deep Learning (book) – Ian Goodfellow, Joshua Bengio, Aaron Courville

https://machinelearningmastery.com/linear-algebra-machine-learning/

Andrew Ng’s Machine Learning course on Coursera

https://en.wikipedia.org/wiki/Linear_algebra

https://www.mathsisfun.com/algebra/scalar-vector-matrix.html

https://www.aplustopper.com/understanding-scalar-vector-quantities/

https://machinelearning-blog.com/2017/11/04/calculus-derivatives/

]]>

**Table of Contents:**

- Introduction to Data Types
- Categorical Data
- Nominal
- Ordinal

- Numerical Data
- Discrete
- Continuous
- Interval
- Ratio

- Why Data Types are important?
- Statistical methods
- Summary

Having a good understanding of the different data types, also called measurement scales, is a crucial prerequisite for doing Exploratory Data Analysis (EDA), since you can use certain statistical measurements only for specific data types.

You also need to know which data type you are dealing with to choose the right visualization method. Think of data types as a way to categorize different types of variables. We will discuss the main types of variables and look at an example for each. We will sometimes refer to them as measurement scales.

Categorical data represents characteristics. Therefore it can represent things like a person’s gender, language etc. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical meaning.

Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as „labels“. Note that nominal data that has no order. Therefore if you would change the order of its values, the meaning would not change. You can see two examples of nominal features below:

The left feature that describes a persons gender would be called „dichotomous“, which is a type of nominal scales that contains only two categories.

Ordinal values represent discrete and ordered units. It is therefore nearly the same as nominal data, except that it’s ordering matters. You can see an example below:

Note that the difference between Elementary and High School is different than the difference between High School and College. This is the main limitation of ordinal data, the differences between the values is not really known. Because of that, ordinal scales are usually used to measure non-numeric features like happiness, customer satisfaction and so on.

We speak of discrete data if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can’t be measured but it can be counted. It basically represents information that can be categorized into a classification. An example is the number of heads in 100 coin flips.

You can check by asking the following two questions whether you are dealing with discrete data or not: Can you count it and can it be divided up into smaller and smaller parts? On the contrary, if the data could be measured but not counted, we would speak of continuous data

Continuous Data represents measurements and therefore their values can’t be counted but they can be measured. An example would be the height of a person. You can only describe them by using intervals on the real number line.

**Interval Data**

Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. A good example would be a feature that contains temperature of a given place like you can see below:

The problem with interval values data is that they don’t have a „true zero“. That means in regards to our example, that there is no such thing as no temperature. With interval data, we can add and subtract, but we cannot multiply, divide or calculate ratios. Because there is no true zero, a lot of descriptive and inferential statistics can’t be applied.

**Ratio Data**

Ratio values are ordered units with intermediate values. Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc.

Datatypes are an important concept because statistical methods can only be used with certain data types. You have to analyze continuous data differently than categorical data otherwise it would result in a wrong analysis. Therefore knowing the types of data you are dealing with, enables you to choose the correct method of analysis.

We will now go over every data type again but this time in regards to what statistical methods can be applied. To understand properly what we will now discuss, you have to understand the basics of descriptive statistics. If you don’t know them, you can read my blog post (9min read) about it: https://towardsdatascience.com/intro-to-descriptive-statistics-252e9c464ac9. Other tools that aren’t discussed in this blog post, will be explained here.

When you are dealing with nominal data, you collect information through:

Frequencies: The Frequency is the rate at which something occurs over a period of time or within a dataset.

Proportion: You can easily calculate the proportion by dividing the frequency by the total number of events. (e.g how often something happened divided by how often it could happen)

Percentage: I think this one doesn’t need an explanation.

Visualization Methods: To visualize nominal data you can use a pie chart or a bar chart.

When you are dealing with ordinal data, you can use the same methods like with nominal data, but you also have access to some additional tools. Therefore you can summarize your ordinal data with frequencies, proportions, percentages. And you can visualize it with pie and bar charts. Additionally, you can use percentiles, median, mode and the interquartile range to summarize your data.

When you are dealing with continuous data, you can use the most methods to describe your data. You can summarize your data using percentiles, median, interquartile range, mean, mode, median, standard deviation, and range.

Visualization Methods:

To visualize continuous data, you can use a histogram or a box-plot. With a histogram, you can check the central tendency, variability, modality, and kurtosis of a distribution. Note that a histogram can’t show you if you have any outliers. This is why we also use box-plots.

In this post, you discovered the different data types that are used throughout statistics. You learned the difference between discrete & continuous data and learned what nominal, ordinal interval and ratio measurement scales are. Furthermore, you now know what statistical measurements you can use at which datatype and which are the right visualization methods. This enables you to create a big part of an exploratory analysis on a given dataset

- https://en.wikipedia.org/wiki/Statistical_data_type
- https://www.youtube.com/watch?v=hZxnzfnt5v8
- http://www.dummies.com/education/math/statistics/types-of-statistical-data-numerical-categorical-and-ordinal/
- https://www.isixsigma.com/dictionary/discrete-data/
- https://www.youtube.com/watch?v=zHcQPKP6NpM&t=247s
- http://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/
- https://study.com/academy/lesson/what-is-discrete-data-in-math-definition-examples.html

**Table of Contents:**

- Introduction
- Normal Distribution
- Central Tendency
- mean
- mode
- median

- Measures of Variability
- range
- interquartile range

- Variance and Standard Deviation
- Modality
- Skewness
- Kurtosis
- Summary

Doing a descriptive statistical analysis of your dataset is absolutely crucial. A lot of people skip this part and therefore lose a lot of valuable insights about their data, which often leads to wrong conclusions. Take your time and carefully run descriptive statistics and make sure that the data meets the requirements to do further analysis.

But first of all, we should go over what statistics really is:

*Statistics is a branch of mathematics that deals with collecting, interpreting, organization and interpretation of data. *

Within statistics, there are two main categories:

**1. Descriptive Statistics: **In Descriptive Statistics your are describing, presenting, summarizing and organizing your data (population), either through numerical calculations or graphs or tables.

**2. Inferential statistics: **Inferential Statistics are produced by more complex mathematical calculations, and allow us to infer trends and make assumptions and predictions about a population based on a study of a sample taken from it.

The normal distribution is one of the most important concepts in statistics since nearly all statistical tests require normally distributed data. It basically describes how large samples of data look like when they are plotted. It is sometimes called the “bell curve“ or the “Gaussian curve“.

Inferential statistics and the calculation of probabilities require that a normal distribution is given. This basically means, that if your data is not normally distributed, you need to be very careful what statistical tests you apply to it since they could lead to wrong conclusions.

**A normal Distribution is given if your data is symmetrical, bell-shaped, centered and unimodal. **

In a perfect normal distribution, each side is an exact mirror of the other. It should look like the distribution on the picture below:

You can see on the picture that the distribution is bell-shaped, which simply means that it is not heavily peaked. Unimodal means that there is only one peak.

In statistics we have to deal with the mean, mode and the median. These are also called the „Central Tendency“. These are just three different kinds of „averages” and certainly the most popular ones.

**The mean is simply the average** and considered the most reliable measure of central tendency for making assumptions about a population from a single sample. Central tendency determines the tendency for the values of your data to cluster around its mean, mode, or median. The mean is computed by the sum of all values, divided by the number of values.

**The mode is the value or category that occurs most often within the data.** Therefore a dataset has no mode, if no number is repeated or if no category is the same. It is possible that a dataset has more than one mode, but I will cover this in the „Modality“ section below. The mode is also the only measure of central tendency that can be used for categorical variables since you can’t compute for example the average for the variable „gender“. You simply report categorical variables as numbers and percentages.

**The median is the “middle” value or midpoint in your data** and is also called the „50th percentile“. Note that the median is much less affected by outliers and skewed data than the mean. I will explain this with an example: Image you have a dataset of housing prizes that range mostly from $100,000 to $300,000 but contains a few houses that are worth more than 3 million Dollars. These expensive houses will heavily effect then mean since it is the sum of all values, divided by the number of values. The median will not be heavily affected by these outliers since it is only the “middle” value of all data points. Therefore the median is a much more suited statistic, to report about your data.

In a normal distribution, these measures all fall at the same midline point. This means that the mean, mode and median are all equal.

The most popular variability measures are the range, interquartile range (IQR), variance, and standard deviation. These are used to measure the amount of spread or variability within your data.

**The** **range describes the difference between the largest and the smallest points in your data.**

The interquartile range (IQR) is a measure of statistical dispersion between upper (75th) and lower (25th) quartiles.

**While the range measures where the beginning and end of your datapoint are, the interquartile range is a measure of where the majority of the values lie.**

The difference between the standard deviation and the variance is often a little bit hard to grasp for beginners, but I will explain it thoroughly below.

The Standard Deviation and the Variance also measure, like the Range and IQR, how spread apart our data is (e.g the dispersion). Therefore they are both derived from the mean.

The variance is computed by finding the difference between every data point and the mean, squaring them, summing them up and then taking the average of those numbers.

The squares are used during the calculation because they weight outliers more heavily than points that are near to the mean. This prevents that differences above the mean neutralize those below the mean.

The problem with Variance is that because of the squaring, it is not in the same unit of measurement as the original data.

Let’s say you are dealing with a dataset that contains centimeter values. Your variance would be in squared centimeters and therefore not the best measurement.

This is why the Standard Deviation is used more often because it is in the original unit. It is simply the square root of the variance and because of that, it is returned to the original unit of measurement.

Let’s look at an example that illustrates the difference between variance and standard deviation:

Imagine a data set that contains centimeter values between 1 and 15, which results in a mean of 8. Squaring the difference between each data point and the mean and averaging the squares renders a variance of 18.67 (squared centimeters), while the standard deviation is 4.3 centimeters.

When you have a low standard deviation, your data points tend to be close to the mean. A high standard deviation means that your data points are spread out over a wide range.

Standard deviation is best used when data is unimodal. In a normal distribution, approximately 34% of the data points are lying between the mean and one standard deviation above or below the mean. Since a normal distribution is symmetrical, 68% of the data points fall between one standard deviation above and one standard deviation below the mean. Approximately 95% fall between two standard deviations below the mean and two standard deviations above the mean. And approximately 99.7% fall between three standard deviations above and three standard deviations below the mean.

The picture below illustrates that perfectly.

With the so-called „Z-Score“, you can check how many standard deviations below (or above) the mean, a specific data point lies. With pandas you can just use the „std()“ function. To better understand the concept of a normal distribution, we will now discuss the concepts of modality, symmetry and peakedness.

**The modality of a distribution is determined by the number of peaks it contains.** Most distributions have only one peak but it is possible that you encounter distributions with two or more peaks.

The picture below shows visual examples of the three types of modality:

Unimodal means that the distribution has only one peak, which means it has only one frequently occurring score, clustered at the top. A bimodal distribution has two values that occur frequently (two peaks) and a multimodal has two or several frequently occurring values.

**Skewness is a measurement of the symmetry of a distribution. **

Therefore it describes how much a distribution differs from a normal distribution, either to the left or to the right. The skewness value can be either positive, negative or zero. Note that a perfect normal distribution would have a skewness of zero because the mean equals the median.

Below you can see an illustration of the different types of skewness:

**We speak of a positives skew if the data is piled up to the left**, which leaves the tail pointing to the right.

**A negative skew occurs if the data is piled up to the right**, which leaves the tail pointing to the left. Note that positive skews are more frequent than negative ones.

A good measurement for the skewness of a distribution is Pearson’s skewness coefficient that provides a quick estimation of a distributions symmetry. To compute the skewness in pandas you can just use the „skew()“ function.

**Kurtosis** measures whether your dataset is heavy-tailed or light-tailed compared to a normal distribution. Data sets with high kurtosis have heavy tails and more outliers and data sets with low kurtosis tend to have light tails and fewer outliers. Note that a histogram is an effective way to show both the skewness and kurtosis of a data set because you can easily spot if something is wrong with your data. A probability plot is also a great tool because a normal distribution would just follow the straight line.

You can see both for a positively skewed dataset in the image below:

A good way to mathematically measure the kurtosis of a distribution is fishers measurement of kurtosis.

Below you can see the three most common types of kurtosis. Note that the area under a Probability density function curve is usually always 1, no matter what type of kurtosis you are dealing with.

A normal distribution is called mesokurtic (red line) and has kurtosis of or around zero. A platykurtic distribution (blue line) has negative kurtosis and tails are very thin compared to the normal distribution. Leptokurtic distributions (green line) have kurtosis greater than 3 and the fat tails mean that the distribution is clustered around the mean and that it has a relatively small standard deviation.

If you already recognized that a distribution is skewed, you don’t need to calculate it’s kurtosis, since the distribution is already not normal. In pandas you can view the kurtosis simply by calling the „kurtosis()“ function.

This post gave you a proper introduction to descriptive statistics. You learned what a Normal Distribution looks like and why it is important. Furthermore, you gained knowledge about the three different kinds of averages (mean, mode and median), also called the Central Tendency. Afterwards, you learned about the range, interquartile range, variance and standard deviation. Then we discussed the three types of modality and that you can describe how much a distribution differs from a normal distribution in terms of Skewness. Lastly, you learned about Leptokurtic, Mesokurtic and Platykurtic distributions.

]]>