Inspired by a desire to understand why more engineers weren’t using Ruby for their machine learning (ML) projects, I embarked on a journey to determine if I could build a project to do something non-trivial using ML in Ruby. Once again, it turns out that you can indeed use Ruby to do amazing things! By leveraging various libraries, building machine learning models becomes a breeze.
There is a large array of tools, libraries, and resources available that help facilitate the construction of ML models for Rubyists, and the list keeps growing. Now has never been a better time to dive into machine learning and build interesting projects. It is clear that AI has dramatically shaped the software industry. As such, it makes sense that Rubyists will also partake in these rapidly evolving technologies and create tools and applications never before thought possible!
Resources
- Github Repository
- Checkout Andrew Kane’s Blog for more information about how to use ML in Ruby
- Awesome Machine Learning with Ruby Resources
[00:00] (upbeat music)
[00:04] - Welcome. We had a little bit of technical difficulties
[00:07] but things seem to be rolling right now, so...
[00:10] Thank you for coming to my talk.
[00:14] Oops, wrong way.
[00:16] Hi, I'm Landon.
[00:18] I'm a Senior Monkey Patcher at Test Double.
[00:21] That is a name I came up with for myself.
[00:24] I'm a senior software consultant at Test Double.
[00:28] If you'd like to reach out to me, I'm on LinkedIn,
[00:31] Mastodon, and the Bird app.
[00:34] So the reason I'm giving this talk is
[00:39] about several months ago maybe a year ago maybe,
[00:42] I've been thinking about it for a while.
[00:44] I was thinking about machine learning, AI, and Ruby
[00:48] and a lot of people are doing machine learning and Ruby,
[00:52] or sorry, machine learning and Python.
[00:54] And I was curious why isn't anyone doing machine learning
[00:58] and Ruby?
[00:59] Like that's what I want to do, that's my native language.
[01:02] I don't want to have to write Python, right?
[01:04] Can I get a clap for that? I kept hearing that.
[01:06] Yeah, I don't want to have to write Python every
[01:09] time I want to do something in my main coding language.
[01:12] So I want to use Ruby.
[01:14] So this talk is going to walk
[01:17] through an entire project that I did
[01:19] and I have a gift for you all at the end.
[01:21] But I'm going to walk through the entire project
[01:24] and kind of present to you how to go
[01:27] about doing machine learning projects
[01:29] because I want you to be able to do it
[01:31] in Ruby and not have to learn a bunch of Python.
[01:34] So to kind of start us off
[01:36] this is sort of like the agenda for the talk.
[01:38] So I'm going to set up a problem.
[01:41] We're going to collect a little bit of data,
[01:43] we're going to do some data preparation.
[01:44] We're going to train our own machine learning model.
[01:47] So for many of you,
[01:48] this is going to be the first time doing that,
[01:50] and then we're going to make some predictions.
[01:52] So before we get to that, I want to talk about two things.
[01:56] I want to talk about tools and I want to talk about libraries.
[02:01] So as developers, one of our main tools is our code editor.
[02:08] But when you're doing data science work,
[02:12] one of the main tools is going to be Jupyter Notebooks,
[02:15] which is a program that lets you kind of
[02:18] build out your data science project
[02:21] in a way that's like re-shareable.
[02:23] And you can also execute code so it kind of runs top down.
[02:27] So this is an example, sorry.
[02:30] And so traditionally Jupyter notebook use
[02:33] it has like Python in it.
[02:35] So you write your Python code in the notebook
[02:37] and then you can execute the code in the notebook.
[02:42] I'll click down so you can actually see the notebook there.
[02:45] And so it'll be Python,
[02:46] but here we're going to execute Ruby and that.
[02:49] And we're using a a tool called iRuby to do that.
[02:54] So here we're doing some basic addition
[02:57] in Ruby and then we have
[03:00] I defined a method that just prints hello world.
[03:03] And you can just do that sequentially.
[03:07] In Jupyter notebook,
[03:08] you can also have some really cool visualization tools.
[03:10] So here this is the only bit of like Python code
[03:14] that will be in this presentation
[03:15] but I'm calling a Python library that does
[03:19] some visualization stuff.
[03:20] There's also some visualization Ruby gems as well.
[03:24] But I just want to show you like, hey,
[03:25] you can have some visualizations
[03:27] so you can kind of download this file
[03:30] and like show your business stakeholders
[03:32] and kind of show them a whole project that you've done.
[03:36] So next I want to talk a little bit about libraries.
[03:39] So for this machine learning project
[03:42] I'm using three libraries, one's called numo
[03:45] one's called daru and one's called rumale.
[03:49] Numo is numerical, N-dimensional array class
[03:52] for fast data processing and easy manipulation.
[03:56] Daru is a gem that gives you a data structure
[04:01] called a data frame, which allows you to do analysis
[04:05] manipulation and visualization in data.
[04:07] So I'm not sure how familiar you are with Python
[04:11] but numo and duru have a synonymous like Python library,
[04:17] I guess, called Pandas and NumPy.
[04:20] So those are replacements for those.
[04:22] And then rumale is a gem that allows you to
[04:28] use different machine learning algorithms.
[04:31] So first we're going to set up the problem.
[04:34] So I want to predict the weather
[04:36] 'cause I think that's super cool.
[04:38] And specifically I want to predict the max temperature
[04:42] for a weather data set.
[04:43] So first we need to collect our data.
[04:47] So I went online and I found a data set
[04:50] from the National Centers for Environmental Information
[04:53] and they have a ton of weather data that you can download,
[04:57] you can use.
[05:00] And specifically I downloaded the weather data set
[05:03] for the Atlanta airport and it goes back
[05:06] to like 1960 something 'cause I thought it'd be cool
[05:11] like we're all in Atlanta and so let's predict
[05:16] the max temperature for some given input.
[05:20] The next step is data preparation.
[05:23] So now that we have our data, we're going to prepare it
[05:26] and we're going to import that data into our Jupyter notebook.
[05:31] And then we're going to note the rows in the columns.
[05:33] We'll see that there's about 20,000 rows
[05:36] and there's like 48 columns there.
[05:39] And the next line is just duplicating that data.
[05:44] So when you're working on a data science project,
[05:46] you have, you want to pull in your data
[05:47] and there's going to be a lot of changes
[05:49] that you're going to make to that data.
[05:51] You don't want to actually change the data that
[05:52] you're importing 'cause you might have to reference
[05:54] that later.
[05:55] So say you have 48 columns and you drop a bunch
[05:58] of them and you only have five columns left
[06:00] you might want to reference those other columns
[06:02] but you drop them so they're not there.
[06:04] So I'm making a duplication that I can work
[06:06] off of and continue working on my project.
[06:09] So we're actually going to do that.
[06:10] I'm going to drop, go up for a minute,
[06:12] I'm going to drop all the rows,
[06:15] sorry, all the columns except five.
[06:18] So the data set that I got from the website shows
[06:24] that there are like five core values that they define.
[06:26] So I'm going to use these five core values
[06:29] to kind of simplify my project for this example.
[06:31] And I'm going to use these as the predictors to predict
[06:34] the future max temperature.
[06:37] So I'm going to go ahead and I'm going to drop all the
[06:40] I'm going to create a new data frame
[06:42] I'm going to drop all the other columns
[06:44] and then this dot head method will just look
[06:46] at the top five rows in that data frame and you can,
[06:51] it's just basically like the CSV file column or data.
[06:56] So it just kind of gives you
[06:57] like an overview of what the data looks like.
[07:02] And as part of this like data cleanup,
[07:05] a lot of times this data processing
[07:07] a lot of times we're going to have to clean up the data.
[07:09] So you can't just use the data that you get
[07:12] and just throw it into a machine learning algorithm.
[07:14] There's a lot more work you have to do
[07:16] and that work takes a lot of time.
[07:18] So sometimes you have to manage or handle missing values.
[07:22] You'll have like nils, you have to decide
[07:25] well, do I just want to make the nil a zero
[07:27] but that's going to really throw off my data set, right?
[07:29] Or do I want to just drop the the nil rows or do
[07:32] I want to try to do something called imputing
[07:35] where you can take an average
[07:36] of all the values for like that specific column
[07:39] and just like drop it in there.
[07:41] That's a lot of nuance there and you're going to
[07:43] have to decide how you're going to want to handle it.
[07:45] Sometimes you'll have outliers that are going to
[07:47] like really throw off your data set.
[07:49] So you might have one through a hundred
[07:50] and then you might have a million
[07:51] and that's going to affect how your model performs
[07:54] and it's going to over-optimize
[07:56] for this outlier data data point.
[07:59] And you don't want that.
[08:00] You're going to have to handle that.
[08:01] Sometimes you're going to have malformed data,
[08:03] you're going to have misspellings.
[08:05] Sometimes you're going to have duplicate rows
[08:08] in your data as well
[08:09] and you're going to have to handle that as well.
[08:11] So, so this is me.
[08:17] I had to clean up some data using the daru library.
[08:20] So this is actually, I'm just dropping the nil rows.
[08:25] The code here is a little bit gnarly
[08:27] and I'm not very happy with it.
[08:29] There's different data frame libraries
[08:32] that just give you a really nice function
[08:33] that you can just drop those nil values.
[08:37] But I didn't use those this time.
[08:40] Whew, now I'm tired.
[08:42] Data cleaning turns out to be a lot of work.
[08:45] So much so that there's a name for it.
[08:48] It's the 80/20 rule for like data scientists and
[08:53] and basically it says that you're going to spend about 80%
[08:55] of your time, 80% of your time cleaning up the data,
[08:59] doing all that data manipulation stuff that I talked about.
[09:03] And you're going to be spend about like 20%
[09:04] of your time like building models
[09:06] and trying different models and doing everything else.
[09:09] So it's very time consuming, it's very tedious.
[09:12] But the good thing is we're already 80% of the way there.
[09:15] So the last 20% we're going to train our model
[09:18] and make those predictions.
[09:22] So as we go about training our models,
[09:26] we're going to have to split the dataset
[09:27] before we're able to train.
[09:30] So about 80% of the dataset is going to be used
[09:34] for training data and about 20%
[09:36] of the dataset is going to be used for testing.
[09:40] So what's the difference
[09:41] between the training dataset and the testing dataset?
[09:44] Well, so the training data is going to be used
[09:49] to train the model and then you're going to need a way
[09:52] to like validate that it works
[09:55] and you're going to want some data points to like put
[09:57] into your model to kind of test it.
[09:59] So that's what the testing dataset is assigned for.
[10:06] So what I did here is I just split the data into two.
[10:12] This looks really complicated.
[10:13] I basically took the first 80%
[10:15] of the rows and I said that's going to be my training dataset.
[10:21] And I said the last 20% are just going to be
[10:23] my testing dataset.
[10:26] And since we're using linear regression
[10:28] I want to talk a little bit about that.
[10:29] That's like the model I chose.
[10:31] So there's a lot of different models that you can choose.
[10:33] I'm picking linear regression
[10:34] 'cause I think it's a little bit more simpler to understand
[10:37] 'cause I think some of us have had like maybe exposure
[10:40] to some algebra.
[10:42] You might, I guess I'll read this.
[10:46] Linear regression is an attempt to model the relationship
[10:49] between two variables
[10:50] by fitting a linear equation to the observed data.
[10:55] So you may remember this equation.
[10:57] Does anybody know what this is?
[10:59] - [Speaker] Slope?
[11:00] - If you don't, your teachers yelled at you, failed you.
[11:03] No, just kidding. Yeah, slope.
[11:04] It's the, this is the equation for a line.
[11:08] So y equals mx plus b someone said slope.
[11:11] So the m is the slope and the b is the y intercept.
[11:15] So I prefer it written this way because it,
[11:21] it kind of pulls upon sort of our intuition
[11:25] as developers where we program with functions and methods
[11:29] and I see f of x equals mx plus b, that's just a function
[11:34] and I can put some x value in which, oh, our x values
[11:37] turn out to be all the data that we want to use to
[11:40] predict some other value, which is our y value.
[11:44] So if you can imagine like all the data that we have
[11:48] we're going to put into that x and out
[11:50] it's going to pop some prediction.
[11:52] For this example, it's technically multi linear regression
[11:56] cause we have multiple X values, not just one.
[11:59] And those are going to be the column, the five data points
[12:02] that we kind of separated the precipitation
[12:06] and the snowfall and things like that.
[12:08] We're going to use those to predict the max temperature.
[12:13] So imagine
[12:15] and this isn't actually what our data set looks like
[12:17] but imagine our data set, if we plotted it
[12:20] it looks sort of like this, right?
[12:23] It kind of has this like linear pattern.
[12:26] So if we're doing a linear regression model
[12:30] and we want to plot a line,
[12:33] we have to plot the line somehow through this data
[12:37] so that it is close to all the different points.
[12:41] And that's not really necessarily the best way to do this.
[12:45] Like there are other machine learning models
[12:48] that can kind of trace through
[12:51] all the different data points, you know,
[12:53] and have really fine tuned predictions.
[12:56] So I'm going to leave that to all of you to kind of look
[12:59] at my project and tear it apart and be like, oh,
[13:02] I found a different model that works better
[13:04] than what the presenter presented.
[13:06] So, and I would love for you to message me
[13:08] and throw it in my face and say like, look what I did.
[13:11] I would love that.
[13:14] So this line, this straight line that minimizes the distance
[13:19] between all the data points.
[13:21] This is called like the best fit line.
[13:26] So in order to build the linear regression model,
[13:30] it's super, super simple.
[13:32] So all you have, this is basically all the code
[13:35] that you need to train your model.
[13:37] So you're taking all of those x values, the precipitation,
[13:40] the minimum temperature, I think it was like the snowfall
[13:45] and you're shoving them into the x value and then the
[13:50] the model's going to fit your data
[13:54] and produce a linear regression model
[13:56] and you're going to be able to use that to do predictions.
[13:59] So we have our model, we're done.
[14:02] Okay, now I can go home
[14:05] hope you had a great Rails Conf, see you next year.
[14:09] So that's basically it.
[14:11] Now where does kind of Rails come
[14:14] into this building applications.
[14:16] You showed me this really interesting project
[14:18] and you did a bunch of stuff like what does that mean?
[14:20] Like how can I use this into my app?
[14:22] Well, we're going to use it to make predictions
[14:26] and this is the line of code that we're going to,
[14:32] we have our test data and we're going to put our test data
[14:35] into this predict function and it's going to pop
[14:40] out that y value, remember y equals mx plus b
[14:45] or the way I like to write it is f of x equals mx plus b.
[14:48] So we shove all those b values our predictors into it,
[14:52] those are called independent variables.
[14:55] And then out pops a dependent variable which is
[14:58] the prediction that we want.
[15:00] So theoretically if you're writing a Rails app
[15:04] and you did all these steps and then you have your model,
[15:07] you can wrap this code right here in in some sort of method
[15:12] and call it anytime you wanted to predict something
[15:15] for your users on your Rails app.
[15:17] So I think that's really nifty.
[15:21] So we set up our problem, we collected some data,
[15:25] we have some data preparation
[15:27] we trained the model and we made some predictions.
[15:30] So really that's all there is to it.
[15:36] So I just want to thank some folks,
[15:38] I want to thank Test Double.
[15:40] Andrew Kane is someone who, he's been working a lot
[15:46] on machine learning in the Ruby space.
[15:49] You should check out his blog.
[15:51] And then I've been taking some courses
[15:53] from Great Learning just to kind of build
[15:55] out Python projects and like been trying to
[16:00] figure out how to adapt it towards Ruby.
[16:03] So I also have a present for you.
[16:06] I told you all
[16:07] I have something special to kind of give away.
[16:09] So I published the project onto GitHub
[16:14] and you can kind of look at it.
[16:16] So my goal ultimately for this talk is
[16:19] that people can download this, look at it, tweak it,
[16:22] and kind of use their own data sets that they have
[16:26] in their work or just fun data sets that
[16:29] they found and and to see that like,
[16:32] this project that I have really isn't that complicated
[16:36] and I'm hoping that you'll use your own data sets
[16:40] and tweak it and do something interesting with it
[16:43] because ultimately, the only way we're going to
[16:46] get more machine learning into Ruby and Rails
[16:51] is if all of you start working on projects
[16:53] and you really don't need some sort of PhD to do this stuff.
[16:59] I think there's like academic side
[17:01] and there's a place for that.
[17:02] But then there's a place for all of us who just
[17:04] want to like tinker around and and play with things.
[17:06] So I hope you help me out with that.
[17:11] And that's all I had.
[17:13] So Test Double, we have a email list
[17:16] that you can sign up for, wanted you to check that out.
[17:20] And I'm just going to leave a little bit of time for like
[17:22] Q and A and questions as I know a lot of folks have time
[17:26] for that or are interested.
[17:27] So I see three here, I don't know who was first?
[17:30] 1, 2, 3
[17:33] - [Audience Member] When we were doing the prediction
[17:35] was the x value
[17:37] like the day of the year
[17:38] and the y value is the temperature? I guess...
[17:42] - Is that the prediction line?
[17:45] Okay. Yeah, so the x value is all the values
[17:48] that we kind of set aside.
[17:50] So that would've been the
[17:53] let's see if I go here, that would've been these values.
[18:00] So it's a precipitation snowfall,
[18:02] snow depth and the minimum maximum temperature
[18:07] which you could use the minimum maximum temperature
[18:08] but it's basically, I just reduced the number
[18:12] of parameters that we're using to predict
[18:14] just to kind of simplify it.
[18:16] There's not as many things there
[18:17] and you can see it kind of in the project.
[18:20] I think one piece I sort of missed was
[18:22] like I was also using the maximum temperature to
[18:25] predict the future maximum temperature
[18:28] for like the next day.
[18:29] So I kind of took the like,
[18:32] for today, there's a max temperature
[18:35] for today and then tomorrow there will be a max temperature.
[18:38] But I kind of like took tomorrow's max temperature
[18:40] from the historical data
[18:42] and like kind of moved it upward in the data set.
[18:45] You'll see that in the project
[18:46] and I'm using the max temperature from the day
[18:48] before to predict the max temperature for the next day.
[18:51] 'Cause those seem to be like slightly correlated.
[18:53] Like if it's 60 degrees today,
[18:55] it's either going to be like, you know,
[18:57] maybe 60 or 62 the next day.
[18:59] Also, this isn't like a perfect science and full disclaimer
[19:02] like a lot of the things that I did here were to
[19:06] kind of present like what a project would look like.
[19:09] I would not use linear regression model
[19:12] for like a real like forecasting sort of thing like this,
[19:16] I'd use something different.
[19:18] There's different things you can use.
[19:21] - [Audience Member] Can we still use that?
[19:22] Yeah, yeah, so the question is like how
[19:24] do you actually like use the model?
[19:26] Like, you know, is there a way to like persist it?
[19:29] If I recall correctly
[19:31] like I haven't gotten to that part yet.
[19:32] I think there's a way to like
[19:33] there's definitely a way to export the model, but so if you
[19:38] so in your app also, if you train the model
[19:41] and you're using it, it's going to be the same model.
[19:45] You're not going to have to retrain it.
[19:46] The only time you're going to have to retrain it is, you know,
[19:48] you deploy the model, you get new data,
[19:51] you want to optimize your model to be a little bit better
[19:53] and you can kind of like redeploy it.
[19:56] But yeah, I probably need to look more
[20:00] into like the exporting and actually like inputting
[20:03] into like the Rails app, but it shouldn't be too hard.
[20:06] There's like a way to export the model and things like that.
[20:08] So a lot of that's going to be in
[20:10] like the daru documentation and Andrew Kane site as well.
[20:17] I think I'm at time, but thank you.
[20:20] If you have any questions, you can talk to me after.