Prairie.Code() Sessions tagged machine learning

Node.js, ML, K8s and Unethical Face Recognition

How nice would it be to be able to remember everyone’s name? What if you could just walk into a room and know everyone’s Twitter handle? What if you could give them a score to decide if you should have a conversation with them or not? Kubernetes is a great tool that is being used more and more for deploying applications, and it can also be used in the context of machine learning. In this talk, the speaker will demonstrate how to use NodeJs, a touch of machine learning and a sprinkle of Kubernetes to recognize people in a crowd. This talk is about the various technologies that were used for this demo inspired by the Black Mirror show. It’s about the tech... and also why you shouldn't build it.

Speaker

Joel Lord

Joel Lord

Developer Advocate, Red Hat

Curating quality datasets for Machine Learning

In the contemporary world of machine learning algorithms - “data is the new oil”. For the state-of-the-art ML algorithms to work their magic it’s important to lay a strong foundation with access to relevant data. Volumes of crude data are available on the web nowadays, and all we need are the skills to identify and extract meaningful datasets. This talk aims to present the power of the most fundamental aspect of Machine Learning - Dataset Curation, which often does not get its due limelight. It will also walk the audience through the process of constructing good quality datasets as done in formal settings with a simple hands-on Pythonic example. The goal is to institute the importance of data, especially in its worthy format, and the spell it casts on fabricating smart learning algorithms.

Outline

Introduction

  • Popularity of Machine Learning & Applications
  • Significance of honing dataset building skills
  • Importance in Academia: Expanding domains to perform research on, Solve novel problems using ML, Lead research efforts in this domain, etc.
  • Importance in Industry: Availability of lots of raw data, no exact dataset available for training purposes, Proactively identify data to log to solve specific problems, etc.

Finding data source(s)

  • Guided Search based on a problem definition: Identifying essential data signals
  • Unguided Search with no problem definition in mind: Dealing with ambiguity
  • Tips on identifying data sources.

Data Extraction - Hands-On Example (Audience-level & Time-constraint dependent)

  • Live Python example implemented via Jupyter Notebook
  • Use of Python tools: Beautiful soup and Selenium
  • Step-by-step process to plan data extraction
  • Nitty-gritty details about tools and the extraction code itself

Dataset Preparation

  • Cleaning
  • Anonymizing
  • Standardizing
  • Structuring

Conclusion and Takeaways

  • Re-iterating the need for good and reliable datasets: Laying the strong foundation of ML
  • Pointers on why and how to proceed with different data extraction techniques based on the application keeping in mind the pros & cons
  • Some personal anecdotes, recommendations for different use cases as an ML Engineer

The workshop is based off the book authored by us, viz, Sculpting Data for ML: The first act of Machine Learning

Speakers

Jigyasa Grover

Jigyasa Grover

Machine Learning Engineer, Twitter, Inc.