A Long Overdue Galvanize Update

It’s been awhile since I blogged about Galvanize and a lot has happened in that time!

In the midst of working on my capstone project, I was offered a position as a Data Scientist in Residence (DSR) with Galvanize upon graduation. I’ll be serving as sort of like a TA for the new cohort of future data scientists. As I just went through the program, I (along with my fellow DSR) am well positioned to help these students through the DSI program. I’m thrilled to have a (paid) position where I can continue to deepen my data science skills. It’s a 3-month contract, so I’m mindful of continuing to  look for what comes next.

blackboard with math

Moar math!

I’ve heard feedback from former DSRs that they learned just as much or more as a DSR as they had when a student. I’m excited to see all the course content another time, not to mention learning through teaching! You really have to understand a topic to explain it to someone else. I can tell I will get a lot of practice reading other people’s code, which I understand is likely how you spend much of your time as a data scientist in industry.

This also meant that my last weeks in the DSI were a bit more relaxed. As a rule, DSRs don’t participate in hiring day with their cohort (as we have a job for the next few months) which also took some pressure off my project. But don’t worry, I did finish it! I’ll post a link once I have the website up and running, as I took it down for now to save some money on hosting. I observed hiring day, where my fellow cohort-mates knocked it out of the park with their project presentations.

The final week of the program was focused on technical interviews– and was quite intimidating. Technical interviews are unlike any other interview process I’ve experienced, and many questions are pretty challenging. Whiteboarding is also a skill of its own, and one I know I’ll need to get more comfortable with in the future.

I’m hopeful that I’ll be able to keep blogging my way through my continued time with Galvanize– I know I dropped off a bit there towards the end, but here’s to re-establishing good habits!

Entering the Project Phase

We’ve finished up the structured curriculum at galvanize and are now working on our capstone projects. I had a bit of flip flopping in picking a project topic. While I’m (obviously) really interested in healthcare applications of data science, there are fewer open data sources available. I poked around a few available datasets but ultimately decided on a non-healthcare project that I think will challenge me more and give me more opportunity to show off my newly-developed skills. Also it’s food-related, which is probably right up there with healthcare on my interests list.

table set for brunch with waffles on plate

Yum!

For the past week, I worked on building a cross-domain recommender system that will allow a user to input one of their favorite restaurants in SF and receive a recipe they might enjoy. So far, I have a simple model that will compute the similarity between the given restaurant’s menu and all the recipes in my database, based on text analysis of the menu’s item descriptions and the recipes ingredient list. I also have a Flask app up and running (only on my local machine so far, so no links to it yet but soon!) I’m pretty happy with my progress this week, as I still have the upcoming week to play around with the model. I’d like to try some more sophisticated text mining techniques that will hopefully result in better recommendations.

If you’re interested in building simple web apps or websites, I definitely recommend checking out Bootstrap templates. See Start Bootstrap for some free, downloadable templates. With just a tiny bit of HTML and CSS knowledge, you can customize the templates and make your site look really slick. It took me maybe an hour to go from a barebones, text-only website to a nicely formatted, image-rich design. Can’t wait to share the finished product!

Halfway there!

I’ve officially halfway completed the data science immersive (DSI) program at Galvanize. I actually had the past week off from classes as a chance to review and solidify what we’ve learned so far and also to start thinking about what I’d like to do for my capstone project. The capstone project serves as (a start to) a data science portfolio. We will eventually present our projects to recruiters from various tech companies, so it’s really a chance to show off what I’ve learned over the course of the DSI.

portfolio

I’m mulling over a few different ideas– ideally, I’d love to do a health-themed project as that’s a topic I’m both interested in and have subject matter expertise in. Unfortunately, finding good datasets can be a challenge. I’d previously been advised to get experience working with patient-level or claims data, but obviously there are fewer open sources of patient data since it’s a pretty clear privacy issue. CMS (Centers for Medicare & Medicaid Services) does have some limited, de-identified datasets available, so perhaps I can find an interesting question to answer based on that data. It’s funny– it feels like a bit of a backwards approach (starting with a dataset instead of a question) but it’s reasonable, given our time constraints. We have about two and a half weeks to construct our projects, so you can’t get too hung up on constructing unique datasets.

 

Prepping for Simple Image Recognition

This week at Galvanize, we covered a variety of topics, from web scraping to clustering techniques. I want to focus on dimensionality reduction today, as it’s a challenging but crucial technique for working with real-world data.

This week also marked our first foray into working with text and image data. Up to this point, we’d always started from nice tabular, numerical data. Machine learning algorithms really only understand numbers, so we must first translate our text or images into something the machine understands. We want to give our algorithm a feature matrix– really, just something resembling a spreadsheet, with our features as columns across the top and each data point as a row in the spreadsheet. Each ‘cell’ would then be a number representing that data point’s value for that feature. How would you turn an image into such a table?

It turns out you break each image down into pixels, and assign a value that corresponds to the shade of that pixel. Our features will be each pixel location. If we’re talking grayscale images, the numeric value in each row of our table will be how light or dark that specific pixel is in the row’s image, usually on a scale from 0 (black) to 255 (white). You can imagine this turns images into huge amounts of data– for example, for one little 8×8 pixel image, you now have 64 numerical features.

Text and image data quickly snowballs into thousands of features– which is a problem for some of our models, In fact, it’s such a problem it’s known as the “curse of dimensionality.” To address this, there are methods to identify and extract the most important components or transformed features, and use those in your model.

MNIST, a classic machine learning dataset for image recognition.

MNIST, a classic machine learning dataset for image recognition.

We illustrated this with the MNIST dataset, a classic dataset for image recognition. MNIST is a bunch of handwritten digits, as seen above in a sample of 100 such images. It’s easy for us to look at any given image and recognize it as a 2 or a 4, but how could we train a computer to do this?

You might recognize this as a classification problem (is the image class ‘2’ , class ‘3’, class ‘4’, etc?). Our goal in one exercise this week was to perform just the pre-processing required before one could use a classifier. We used a dimensionality reduction technique called principal component analysis (PCA) to pull out the most important transformed features, or components, from the dataset.

The image below shows what happens when you project the transformed dataset (here using only 0s – 5s) down into 2 dimensions, so it can be easily visualized. The points are the true labels, or classes, of the image. The x- and y-axes are transformed features from our original feature matrix of pixels. One drawback of PCA is that transformation makes the features challenging to interpret– they are no longer the columns from our original feature matrix, but some weird combination of them that we can’t easily name.

mnist

After PCA, the data projected into 2-dimensions

Already, you’ll notice that the 4’s are showing up near each other, the 0’s are grouped away from the 4’s with little overlap, while the 2’s and 3’s overlap a lot– those digits are visually much more similar than a 0 and 4, no? If we were to cluster this dataset, you can imagine getting pretty decent results from using only these first 2 principal components.

Applying Data Science to Business Problems

At this point, I am officially 1/3 of the way through the Galvanize Data Science Immersive. It’s amazing to think about how much I’ve learned in just a few weeks.My programming skills are certainly leaps and bounds above where they were when I started, in large part due to spending hours coding each and every day. Practice really does pay off!

We spent this week building on the algorithms learned last week (mainly decision trees). We learned how to make better predictions by combining multiple models into what are called ensemble methods:  Random Forests and bagged or boosted trees. While I won’t delve into details at this point, the big picture is adding up so-called “weak learners” (that is, models that are only slightly better than random guessing) allows you to emerge with a better-performing predictive model.

A piece of advice I’ve gleaned, relevant for anyone interested in getting into data science, is that a solid understanding of linear algebra will help you when it comes to implementing machine learning algorithms. Thinking about the shape of your data at every step can save you a lot of painful debugging. You can of course use existing Python  libraries like scikit-learn that will take care of much of this for you, but to really understand what’s going on under the hood, matrix multiplication (and also some calculus) is very helpful.

line graph of profits

Testing different classifiers on a churn problem

One highlight for me this week was applying what we’ve learned to a concrete business problem. We worked through an example of predicting churn for a telecommunications company, and then building profit curves for various approaches to the modeling problem. Basically, we assigned real costs or benefits to a model making predictions correctly or not.  I think this is such an important point for data scientists, to have that insight into both the math/science and the business perspectives. Similarly, I’ve heard from data scientists working in industry that much of their job is communicating results to non-data scientists. The skill this requires is not to be overlooked.

 

Making Predictions from Data

This week at Galvanize we are truly starting to dig into the fun part of data science — making data-driven predictions! We covered some simple and common regression and classification models– linear and logistic regression, k-nearest neighbors, and decision trees so far. Lost yet? I’ll try to break it down.

Regression: trying to predict a value for a certain observation. Say I know the  income, profession, and credit limit of a person. I might predict their average credit card balance (dollar amount) based on that information.

Classification: trying to predict which category a specific observation falls into. For a silly example, if you know the height and weight of an animal, you might predict if it is a dog or a horse. More realistically, when you apply for a loan, a bank is likely to look at your income, profession, credit rating, etc. to predict if you are likely to default on your loan (Yes/No).

pug dog in sweater

Dogs are more likely to wear sweaters.

You make either type of prediction by first building a model (maybe a mathematical equation) based on some set of training data. Training data are sets of observations (i.e. people, animals) for whom you know the answer. Once you’ve built a model that you like, you would test it against new data that the model hasn’t seen before. But you still know the answers for this data, so you can compare what your model predicted to the true answer for each observation. This gives you a sense of how good your model is at predicting the right answer.

Once you have a model that’s performing well, you could release it into the wild and try to make predictions on brand new data. Hopefully, you can learn some actionable information from your model–  like whether you should pet or ride this animal, or whether to extend a loan to this person.

Google’s DeepMind and Healthcare

I’d filed away to read later a Fast Company article whose headline proclaimed Google will be applying artificial intelligence to healthcare problems. Upon returning to read it, I was a bit disappointed to see how speculative the article was. Basically, Google acquired a company with a messaging app for hospital staff that streamlines communication. It’s been hinted that Google might apply artificial intelligence tools to help identify patients at risk of kidney failure whom a clinician might not deem at risk.

stethoscope and smartphone

Joining forces?

Even if machine learning were applied to predict which patients might be at risk of kidney failure,  that’s not really the end of the story. Once patients are identified, there have to be effective interventions to help them, and perhaps moreover, the entire system where this is occurring needs to allow for these predictions and interventions to take place. Looking at the big picture, is identification of at-risk patients and communication among clinicians the true ‘problem’ that needs to be solved? Or is there a different systemic problem or bottleneck that is truly responsible for delaying care? While reading the article, I found myself nodding in agreement to this excerpt:

[S]ome health experts fear that this kind of technology is just putting a Band-Aid on a broken system… “Some people have this utopian plan that you can sprinkle some AI on a broken health system and make things better,” says Jordan Shlain, a Bay Area-based doctor and entrepreneur who has advised the NHS.

Overall, I’m really excited about the idea of using data and machine learning to improve care, but it’s important to be realistic about where these tools can help. I think the promise to fix “broken systems” is overinflated. Artificial intelligence might help us identify at-risk patients, make better diagnoses, or select specific treatment plans, but at the end of the day, healthcare systems are built by and made up of people– and I’m not sure machines can fix those systems.

P(late post | busy weekend) = 1.0

Another week done at Galvanize! It’s been a full week and a busy weekend, so I’m posting a bit later than I’d like. Also my jokes are getting nerdier (see post’s title) though I’m OK with that. This week was all about…

Statistics

Photo credit: LendingMemo

…Statistics! I was excited about this since I’ve had some stats classes in the past. My classes had generally focused on frequentist statistics, which is just one approach to statistical thinking. I’d previously learned basic inferential statistics and hypothesis testing, so I was (nerdily) most excited to learn more about Bayesian statistics. I’ve mostly approached statistics within the context of public health research or drug development, so thinking about the parallels to web traffic, for example was new to me.

Some of the highlights for me this week were learning about bootstrapping, getting some practice with Bayesian methods and calculating Bayesian posteriors, and learning about different approaches to A/B testing. The final assignment of the week was to work on the “multi-armed bandit” problem, where we explored different ways that one might test versions of a webpage to determine which generates the most clicks. It’s pretty cool to see how you can set up an experiment that will automatically converge on the version with the best click through rate… assuming you’ve implemented your algorithms correctly.

I’ll also point out that each week, we work on 5 different individual assignments and 5 pair assignments, where we practice pair programming with another person in our cohort. This has been a great way to 1.) meet everyone in our cohort! and 2.) learn from each other. Each day, you might end up with someone who knows more or less about the particular topic than you, or sometimes the pair is pretty equally matched. Either way, you definitely learn a lot from each other! I’ve enjoyed this aspect of the program, especially knowing that pair programming is a popular approach in industry.

The Adventure Begins!

My cohort and I celebrated surviving our first week of the Galvanize Data Science bootcamp! I know I was excited and nervous at the beginning of the week, not knowing how quickly we would move through topics, what the days would feel like, or who would be sitting in the classroom alongside me. It feels great to have answered many of those questions– especially meeting and getting to know my fellow bootcampers!

The focus this week was software engineering, including object-oriented programming and using a few key Python libraries.  Big picture, I think my most important lesson learned is the importance of planning before starting to code. Thinking through the structure of how you want to address the problem and what you want to build really can make writing the actual code a lot smoother.

coffee and campfire

Lesson 1: There’s very little camping in bootcamp

One piece of advice that I am hoping to take to heart for the next 12 weeks reflecting each day on “what do I know now that I didn’t know yesterday?” I’m not sure if I’ll keep documenting that here, but at least for week 1, I identified the following (sometimes small) victories:

Day 1: I used sys.arg with ease to create a program I could pass arguments to when calling from the terminal.

Day 2: I felt my understanding of classes/OOP really grew today. I found the assignment to implement an interactive blackjack game really challenging but rewarding. At the end of the night, I had at least a basic, functional game up and running.

Day 3: I much better understand the different types of joins. This had definitely tripped me up before when using SQL.

Day 4: I now actually understand how you filter specific columns or rows from a Pandas dataframe. (Pandas is a popular Python library for data analysis). I’ve played around with Pandas before, but always used a lot of trial and error to select the column or row I care about.

Next week we move onto probability and statistics. I’ve taken a few statistics classes in the past, so I’m very interested to see how much is review and how much is new!

Why health tech needs MPHs

While reading this Fast Company article about health care tech companies, I was struck by the following quote:

“The tech community isn’t used to dealing with studies, FDA approval, publications, and reimbursement…[but] the tech community wants things to happen fast. Obviously that doesn’t work in health care.”

I believe new technology and innovative approaches will be a net positive to the health care industry, but I think this article highlights the need for health tech companies to listen and learn from the current state of the industry. Those with industry experience are going to be invaluable partners– yes, even experience in the industry they are trying desperately to disrupt.

One of most important skills/values I learned during my MPH degree was the importance of evidence base. If you’re going to set policies or recommendations that affect how thousands (or millions) of people receive care, you need to be as confident as possible that you’re recommending the right things. One professor’s quote that I’ve never forgotten: “If a doctor makes a mistake, he or she might be responsible for the death of the patient. When a public health professional makes a mistake, they could be responsible for thousands of deaths.”

No more snake oil please.

Motivated by that principle, my classmates and I spent our time learning how to collect data and interpret it to find meaning; to evaluate new research by critically reading methods sections in academic literature; to understand who sets medical practice standards, recommends preventive measures or screening, or monitors food safety, drugs, or devices; and learning some of the complexities of who pays for care and who determines what gets paid for.

I think all of the above skills and ways of thinking will be really useful to emerging health tech companies, and I hope they’ll value that input. Not to say they aren’t already, but based on the above article, it certainly seems like it could be time for more MPH graduates to migrate into health tech.