Machine Learning for Big Data and Text Processing

Machine learning methods drive much of modern data analysis across engineering, sciences, and commercial applications. For example, search engines, recommender systems, advertisers, and financial institutions employ machine learning algorithms for content recommendation, predicting customer behavior, compliance, or risk. Much of today's data is available in primarily textual form, requiring effective tools for using unstructured and semi-structured text. 

This course examines a suite of key machine learning tools and their applications, including predictive analysis. We will discuss key insights underlying the tools, what kinds of problems they can/cannot solve, how they can be applied effectively, and what issues are likely to arise in practical applications.

Lead Instructor(s): 

Regina Barzilay
Tommi Jaakkola
Stefanie Jegelka


Jun 18, 2018 - Jun 22, 2018

Course Length: 

5 Days

Course Fee: 





  • Registration opening soon

It is highly recommended that you apply for a course at least 6-8 weeks before the start date to guarantee there will be space available. After that date you may be placed on a waitlist. Courses with low enrollment may be cancelled up to 4 weeks before start date if sufficient enrollments are not met. If you are able to access the online application form, then registration for that particular course is still open.

This course has limited enrollment. Apply early to guarantee your spot.

Participant Takeaways: 

  • Understand broad opportunities for automation with machine learning
  • Be able to formulate/set up problems as machine learning tasks
  • Outline key aspects of practical problems that are likely to impact performance
  • Assess which types of methods are likely to be useful for a given class of problems
  • Understand strengths and weakness of "on-line" learning algorithms
  • Be able to discuss scaling issues (amount of data, dimensionality, storage, and computation)
  • See through the process of applying machine learning methods in practice, foresee likely hurdles and possible remedies
  • Understand modern natural language processing tools, formulations, and problems
  • Grasp what predictive analytics often does not provide
  • Understand current machine learning trends and opportunities that they bring

Who Should Attend: 

The course is designed to operate simultaneously on two levels, intuitive and more formal, describing key concepts, formulations, algorithms, and practical examples for professionals whose work interfaces data analysis in different ways and on different levels.

  • At the managerial level, the course provides the vision and understanding of the many opportunities, costs, and likely performance hurdles in predictive modeling, especially as they pertain to large amounts of textual (or similar) data.
  • For professionals whose work involves data hands-on, the course aims to provide a deeper understanding and sharper intuitions about what is possible, what is not, and which methods to consider in what contexts.
  • For everyone, the course provides the ability to see problems as machine learning problems and be able to discuss ways to approach them.

The course assumes an undergraduate degree in computer science or other technical area such as statistics, physics, electrical engineering, etc., with exposure to vectors and matrices, basic concepts of probability. High level understanding of programming (thinking in terms of programs) will be helpful.

Computer Requirements:

Laptops are required for this course. Tablets will not be sufficient for the computing activities performed in this course.

Program Outline: 

Day 1: (5.5 hours)

  • Overview of machine learning (1 hour)
  • Features, feature vectors, linear classifiers (2 hours)
  • On-line learning algorithms (1 hour)
  • Practicum (1.5 hours)

Day 2: (6.5 hours)

  • Non-linear classification and regression (2 hours)
  • Overfitting, regularization, generalization (1 hour)
  • Collaborative filtering, recommender problems (2 hour)
  • Practicum (1.5 hours)

Day 3: (6.5 hours)

  • Neural networks, deep learning (3 hours)
  • Dense vector representations (2 hour)
  • Practicum (1.5 hours)

Day 4: (6.5 hours)

  • Recurrent neural networks (2 hours)
  • Unsupervised learning, mixtures (3 hours)
  • Practicum (1.5 hours)

Day 5: (5.5 hours)

  • Reinforcement learning (2 hours)
  • Practical guide to machine learning (2 hours)
  • Practicum (1.5 hours)

Course Schedule: 

Registration is Monday morning, 9:00 - 9:30 am.

Class runs 10:00 am - 5:00 pm on Monday, 9:00 am - 5:00 pm Tuesday through Thursday, and 9:00 am - 4:00 pm on Friday.

2017 schedule:

10:00 Introduction: Overview of machine learning
10:45 Discussion and coffee (30 min)
11:15 Features, linear classification
Noon lunch break (1h)
13:00 Features, linear classification
14:30 Discussion and coffee (30 min)
15:00 On-line algorithms
15:30 practicum 
17:00 END

 9:00 Non-linear classification and regression
10:30 Discussion and coffee
11:00 Overfitting, regularization, generalization
Noon lunch break (1h)
13:00 Collaborative filtering, recommender problems
15:00 Discussion and coffee
15:30 Practicum 
17:00 END

 9:00 Neural networks and deep learning
10:30 Discussion and coffee
11:00 Neural networks and deep learning
Noon lunch break (1h)
13:00 Neural networks and deep learning
14:30 Discussion and coffee
15:00 Dense vector representations
15:30 Practicum
17:00 END

9:00 Unsupervised learning
10:30 Discussion and coffee
11:00 Unsupervised learning, mixtures
Noon lunch break (together)
13:00 EM and spectral methods
14:00 Discussion and coffee
14:30 Spectral methods
15:30 Practicum
17:00 END

9:00 Reinforcement learning
10:30 Discussion and coffee
11:00 Deep Reinforcement learning
Noon lunch break (1h)
13:00 Practical guide to ML
14:00 Discussion and coffee
14:30 Application of ML to Healthcare
15:30 Course Closing



This course takes place on the MIT campus in Cambridge, Massachusetts. We can also offer this course for groups of employees at your location. Please complete the Custom Programs request form for further details.


Fundamentals: Core concepts, understandings, and tools (40%) 40
Latest Developments: Recent advances and future trends (30%) 30
Industry Applications: Linking theory and real-world (30%) 30

Delivery Methods: 

Lecture: Delivery of material in a lecture format (75%) 75
Discussion or Groupwork: Participatory learning (25%) 25


Introductory: Appropriate for a general audience (60%) 60
Specialized: Assumes experience in practice area or field (30%) 30
Advanced: In-depth explorations at the graduate level (10%) 10