In our last adventure we executed some unique Artificial Intelligence natural language modeling to predict 100+ start-ups which — based soley on their company categorization and descriptions — had profiles matching successful start-ups in the past. It’s by no means a sure thing but it’s an interesting data set. Can we narrow that list down?
We started our last session using BigML.com to do more traditional or “supervised” machine learning to find patterns in a list of several thousand funded companies listed at CrunchBase.com. Today we’ll go back to BigML.com to train a dataset and then match that against our same natural language dataset to see if we can find some commonalites.
Here’s a quick tip. For every test you do you should identify a challenge, an overview of a solution, and a specific predictive field or variable. Here are a few examples I’m working on for future projects:
- Challenge: Maximize the time of account reps and find the best leads for the sales team to go after.
Solution: Test to see if explicit data (form data they fill out) or implicit data (leads browsing the website) is a better predictor of successful sources.
Predictive field: Signed-up as a binary 1 or 0. Use 2 datasets with split testing to see best results.
- Challenge: .Where should political campaign best spend their resources?
Prediction: Model data across a state of congressional races. Find patterns of donations and expenditures to see which elements best predict a win.
Predictive field: What categories of political expenditures (the money that campaigns spend to elect candidates) is the best predictor of suceess. Model to a binary 1 or 0 field around win or loss.
For the model currently upon us here’s the breakdown of what we’re trying to accomplish:
- Challenge: What are the characteristics of a successful acquisition for companies which have an investment of $2MM or more?
- Solution: Pull down data from CrunchBase.com. Model elements of those datasets, test and match against a dataset.
- Predictive Field: “Status” —model to “was acquired” or “closed”.
Take the dataset we worked with before but this time combine them so that the “was acquired” and the “closed” data are in one dataset. From there you can choose “1-CLICK TRAINING | TEST” from the arrow pull-down on the dataset. This will automatically create 2 new datasets — one with a random 80% of data and the remainder in a 20% dataset. We’ll model on the 80% and then test against the other to test our modeling.
Here are a list of the fields I pulled down from crunchbase.com for reference: Company Name, Category Groups, Headquarters, Location, Description, Crunchbase Rank, Founded Date, Years-til-acq, Number of Article,s Categories, Number of Founders, Number of Employees, Number of Funding Rounds, Last Funding Amount, Last Funding Type, Last Equity Funding Amount, Total Equity Funding Amount, Total Funding Amount, Last Funding Date, Last-fund-to-acq-yrs, Number of Lead Investors, Number of Investors, Status_no, Status
The fields in bold I had to produce with a little bit of math in Excel and the Status_no field was a binary field from the status field… that’s our target field.
By creating the binary status field we can do some quick calcualtions to assess some correlations between simple fields. Below, the Y-axis represents this field with 1 = “was acquired” and 0 = “closed”. Let’s do some screenshot testing of some variables on the x-axis. Note the darker the color the more funding the start-up had.
Remember that the closer the Pearson and Spearman correlation numbers get to 1 the more association there is between the 2 numbers. In our first case it appears that the “number of articles” published about a start-up company did not have a correlation as to whether or not they were acquired:
Again, none of these data points seem to come to a point of significant correlation. Of course, we could do all of this in Excel but for the big crunching… to see patterns across all of these elements — we need machine learning.
Building a standard model from the dataset we can see some drill-down on these fields that has some meat to it:
Compare that even though the number of employees in our simple correlation graph had little impact it appears that it might be the most important factor when finding a path do predict an acquisition. We can also wee that the number of investors might also be a good predictive path.
Once we have this model down we can evaluated it against the 20% dataset which was created above.
The results have some good news and bad news. We’ll go into that at a later time but for now we’re going to press on and map the model at a batch prediction against our fresh database of 1000 companies what are still in a “operating” status.
Click on “Batch Prediction” and then download the results.
The model result I came back with found 142 companies with some prediction of success.
Let’s map that against last week’s list of 100+ companies chosen by the natural language approach:
- Aidoc https://www.crunchbase.com/organization/aidoc
- Artivatic Data Labs https://www.crunchbase.com/organization/artivatic-data-labs
- Cymulate https://www.crunchbase.com/organization/cymulate
- Element Data https://www.crunchbase.com/organization/element-data
- Finxact https://www.crunchbase.com/organization/finxact
- Idwall https://www.crunchbase.com/organization/idwall
- Infonesia https://www.crunchbase.com/organization/infonesia
- Magenta Therapeutics https://www.crunchbase.com/organization/magenta-therapeutics
- MealPal https://www.crunchbase.com/organization/mealpal
- Numerated https://www.crunchbase.com/organization/numerated
- OnTruck https://www.crunchbase.com/organization/ontruck
- Rulai https://www.crunchbase.com/organization/rulai
- StayAbode https://www.crunchbase.com/organization/stayabode
- SuperFlex https://www.crunchbase.com/organization/superflex
- Swingvy https://www.crunchbase.com/organization/swingvy
- Tenzo https://www.crunchbase.com/organization/tenzo
- Timeular https://www.crunchbase.com/organization/timeular
- Atavium Viracta Therapeutics https://www.crunchbase.com/organization/viracta-therapeutics
- Virtualitics https://www.crunchbase.com/organization/virtualitics
- Zenjob https://www.crunchbase.com/organization/zenjob
- Zinc https://www.crunchbase.com/organization/zinc-3