How To Learn Data Science The Boring Way

Data Science has been called the sexiest job of the 21st century and it is with no doubt one of the most cutting-edge sectors you could decide to work in. It’s an exciting subject, full of innovation, and it is also a job that pays well. However, learning Data Science is not all fun and games, it is a hard topic and learning comes with many obstacles. The recent fashion is to learn Data Science using predominantly web-based resources such as e-learning courses, like Coursera. Online courses are a great resource, but I believe there’s a lot to be missed if we rely exclusively on them for our Data Science preparation. I will argue that the core of a Data Science learning curriculum should still come from textbooks, because no other resource like online classes or blogs can teach you all the basics in a formal manner, leaving no gaps in your knowledge. Instead, there are plenty of textbooks written by great academics that treat every single aspect of data science much more deeply, leaving out no important detail. A student that wants to learn Data Science by itself, as an autodidact, should follow the same journey that an academic scholar does. You don’t need to enroll in any University program because these resources are public available, but not participating in University classes doesn’t mean you should take shortcuts and try to learn with lesser resources. Surely, going through books might be less exciting than participating in online courses (hence the provocative title of this post: “How To Learn Data Science The Boring Way”), but in the end you will be much more prepared and outcompete your online-learning counterparts, trust me. After you’ve learned the theory with the proper books, you will have to get practice with pet projects of your own, again some online classes guide you through examples of how to apply what you learned, but those examples are way too controlled, you just have to follow the track that was created for you so you’re not really learning how to do it on your own. All Data Science online classes use the same old examples like the Titanic dataset and the MNIST dataset, those are fine but trite, you would learn a lot more by taking on a new problem, maybe on some data you’re personally interested in. Remember that the internet is packed with information and you can get data for your pet project from any source you like.

Now let’s dive into the constitutive disciplines of Data Science, for each of them I will suggest which ones I found most valuable in my own Data Science journey. Bear with me and learn Data Science the boring way!

1) Discovering Statistics Using R

Statistics is the first skill you need to get working on as an aspiring Data Scientist and R is the language of choice for that. This book by Andy Field is one of the most accessible resources to learn Statistics as a beginner, but it also contains some more advanced materials for those wanting to elaborate on the subject more deeply. Andy has a unique explanation style that will make your learning process easy and straightforward. “Discovering Statistics Using R” doesn’t really belong to this article, as it’s writing style isn’t boring at all, but it’s certainly a great point from where to start. With this book you can catch two birds with one stone, as it is both an introduction to the field of statistics and an introduction to the R programming language.

2) R for Data Science: Import, Tidy, Transform, Visualize, and Model Data

The previous book introduces the R programming language by its well-known statistical capabilities. There is much more to R than just statistics though, a big part of doing data science is preparing the data before doing any analysis. That’s the reason why I suggest this book, “R for Data Science” is the best guide to learn how to import and clean data with R to make it suitable to your subsequent analyses. “R for Data Science” includes a treatment of the most important R packages used for data manipulation, the so called “tidyverse”. Much of the tidyverse packages were developed by Hadley Wickham, who is one of the authors of this book, so you have the chance to learn how the packages work with the help of the developer of those packages himself. The other author, Garrett Grolemund, works at RStudio, the main development environment for the R programming language. Many beginners in Data Science focus all their efforts in Statistics and Machine Learning and forget to improve their data cleaning skills, one of the reasons for this problem is the use of pre-processed datasets for practice, datasets that were already prepared for you, in real life scenarios you will rarely encounter datasets like that, most of the time you will have to clean data by yourself and most online classes avoid this subject entirely. “R for Data Science” is a great resource to bridge this gap.

3) Advanced R by Hadley Wickham

The previous two books are already a good introduction to the R language, however if you feel like you need some additional knowledge and want to dive into more advanced stuff, “Advanced R” is the way to go. This book, like the previous one, is also written by Hadley Wickham, who’s probably the most experienced R developer around. “Advanced R” is very useful in case you want to learn to solve problems in R in the most efficient way, understanding the peculiarities of R compared to other programming languages and what makes it special.

4) Learning Python by Mark Lutz

Data Scientists generally use two programming languages: R and Python. R is more focused on Statistics and Machine Learning, it is a rather specialized programming language, Python instead is a general purpose programming language which is used in many different fields and it is actually one of the most popular programming languages in the world. Python is elegant, efficient and intuitive, you will love it. The book I suggest isn’t the most popular choice because it is very long but I believe it is the most solid book on the topic, if you need a reference on the language you can be sure than nothing is left out of “Learning Python” by Mark Lutz, it’s a complete manual and it is written very well. You may start learning Python with some more concise resources, but in that case you will eventually run into obstacles, there are two possible approaches to surpass those obstacles, one is to address them when they apper, by researching issue by issue and asking questions on stackoverflow, the other is to take your time  and study a complete manual so that you will have the more complex issues covered already. The second approach is what I suggest “Learning Python” for.

5) Bayesian Data Analysis by Andrew Gelman

Bayesian Statistics is getting increasingly relevant and popular in the scientific community. Bayesian Statistics is a very powerful tool but it is also a hard subject to study and Bayesian inference tends to be counterintuitive to most people, reason why it’s absolutely necessary to learn it seriously and dedicate the right amount of time to be sure you clearly understand it. Gelman’s “Bayesian Data Analysis” is the reference textbook for Bayesian statistics, you can’t go wrong with it, it’s the book that all Bayesians statisticians around the world use to teach their classes.

6) Pattern Recognition and Machine Learning 

Machine Learning can be broadly divided in two subjects, traditional machine learning and deep learning. The first one is an umbrella term that includes all machine learning algorithms that do not involve neural networks, we are talking about all the algorithms regularly used for linear and logistic regression. “Pattern Recognition and Machine Learning” by Bishop starts from the basics, explaining probability theory and information theory, and then it explores the majority of Machine Learning algorithms explaining in detail how they function underneath. In the same fashion of most books I suggest, this is also a very complete resource, and can be used as a reference manual to consult everytime you need a detailed description of a certain algorithm you need to implement for your projects. I think this is the best resource for all the algorithms making up the “traditional machine learning” set, for completeness it also treats neural networks but I think there are better resources to learn about Deep Learning, in particular the following book.

7) Deep Learning by Ian Goodfellow

Deep Learning is the most cutting edge sector of Data Science and is a research subject still in active development, for this reason there are few books treating the subject in a satisfying way. “Deep Learning” by Ian Goodfellow is the only complete textbook available to date, and it is an extraordinary book. This masterpiece is a much needed light in a topic that is so new and that few people are able to teach in such an understandable way. Deep Learning is the backbone of every modern Artificial Intelligence application, hence if you want to dive into the field of AI, it is a necessary subject to learn, no other resource currently can give you the knowledge that brilliant scientist Ian Goodfellow (one the leading minds in deep learning research) was able to put together in this book, which I strongly suggest.

8) Speech and Language Processing

Another great textbook, “Speech and Language Processing” is an introduction to the field of Natural Language Processing. This book covers all the theory around this subject, which is also still a field of very active research. Usually Data Science tasks are about analysing numerical data, Natural Language Processing is the practice of programming ways for computers to be able to understand information presented in human language. Natural Language Processing is central in Artificial Intelligence, it is the basis of technologies like digital assistants and chatbots, and is important also for search engines.

9) Natural Language Processing with Python

While the previous book covers the theory of Natural Language Processing, “Natural Language Processing with Python” covers the practice, the book is a guide to learn the Natural Language Toolkit, a Python library rich in language processing functions, currently the most advanced one available.

10) Artificial Intelligence: A Modern Approach

“Artificial Intelligence: A Modern Approach” is the leading textbook in the field of AI. It covers everything from the beginning of this research field to its most important historical milestones up to the current status of AI. The content of this book is very broad, it spans from the logical underpinnings of artificial intelligence, to models of perception, models of decision making and even robotics.

If you succesfully go through these books, you will be better prepared than most Data Scientists. Stay strong, don’t give up and keep on studying. Remember that learning is a lifetime process, not something to complete and be done with, but a state of mind of constant improvement that you must keep throughout your entire life. Precious books will be your loyal companions. Best of luck!

Leave a Reply