Day 1: Data Types, Rectangular Data and Estimates of Location

Yize Zang
3 min readJan 18, 2021

Being one of the hottest topics and now even a subject having classic methods, the seed of data science can date back to 60 years ago. Traditional statistics using a small portion of computable data, by sampling, to infer characteristics of large populations. John W. Tukey is the first to forge the idea of computer science and statistics together on 1962, when he published the paper “The Future of Data Analysis”. 15 years later, Tukey published Exploratory Data Analysis, a book becoming the foundation of data science today.

On Exploratory Data Analysis (EDA), Practical Statistics for Data Scientists lays out eight concepts as followed:

Exploratory Data Analysis

Today, I’ll go over the first three: data types of structured data, rectangular data and estimates of location.

Mind map of data types, shape of data and estimates of location

Yes, structured data. Where is unstructured data? That’s another huge question. And this book (written by two statistician and one computer scientist) focuses on structured data. We really cares about data types because this is usually the first thing to do when we obtain a dataset and load it to pandas — assign data types to each column if not easily inferred by the package. It’s crucial to understand. Why? Let’s imagine if all elements in column A and column B are 10-digit long integers. What analysis will you apply to these two columns? Min, max, mean, median? Or count distinct? Or take difference of column A and B? All possible. 10-digit long integers could just be very large integers, or ordinal integers (sequence of which matters), or timestamp. You will know what analysis (and even feature engineering) you’d apply only after knowing what data type it is.

In most cases, we convert raw data to rectangle shape of data, no matter structured numbers, text, time series, images and more. We call the rectangle data as data frame. And it’s fun to see how people from different background call same elements in a data frame differently. Here’s synonyms of a couple terms:

Synonyms of the same object by a Data Scientist, a Statistician and a Computer Scientist

Once we have a data frame, now let’s get real work start on making estimations. The first estimation is numbers at special locations. Quantile, percentile, median (a fancy way to say “center”). When a column is sorted, numbers locate at two extremes could be outliers, which are very different than the majority of the column. For example, most townhouse in Seattle area are sold in range of $400k to $800k, if you see a townhouse is sold for $2M, that’s for sure an outlier. In some cases outliers are removed in case a model is sensitive to them, such as Logistic Regression. But outliers can be at the spotlight of interest, such as anomaly detection, a technique widely used in financial area, fraud detection, and risk operation.

There’s nothing difficult here, right? The easy journey will last a while (before you are completely lost).

--

--