Appendix D: Tidying the Data

Available Artifacts

The following is a list of artifacts currently available from five annual offerings of the program. The artifacts from the first cohort (2012) are incomplete. There is some summary data available that can help in developing an understanding of the 2012 cohort. The 2012 ummary sdata will be incorporated into the narrative where appropriate. The artififacts from the most current cohort (2016) are still being assembled.

Data Available by Cohort as of July 24, 2016

Cohort 2012 2013 2014 2015 2016
Mentor Bios NO YES YES YES NO
Speaker Bios NO YES YES YES NO
Judges Bios NO YES NO YES NO
Mentor Meeting Agendas NO YES NO NO NO
Session Agendas NO YES YES YES NO
Technology Descriptions NO YES YES YES NO
Applications NO YES YES YES YES
Application Results YES YES YES YES YES
Attendance Results YES YES YES YES YES
Resumes NO YES YES YES YES
Business Plans YES YES YES YES YES
Team Assignments YES YES YES YES YES
Business Plan Scoring Sheets NO YES YES YES NO
Investor Presentations YES YES YES YES YES
Investor Plan Scoring Sheets NO YES YES YES NO
Participant Surveys Partial YES YES YES YES

Application Data

The applications were collected using an online application. The composition of the application evolved over time and questions were added or updated before each new cohort. This required a bit of data collection and data wrangling to get the data into a form for analysis.

Application Data Code Book

Applications Table

Field Data Type Description
Identifier character Unique identifier concatenated from (Cohort + P + Entry ID)
TeamID1 character Program Team Assignment concatenated from (Cohort + T + Team ID)
Path1 character Path to case node in Envivo.
Cohort1 character Year of cohort (YYYY).
Accepted1 logical Was participant accepted to program?
Decision1 logical Did participant attend program?
Finished1 logical Did participant finish program?
Entry.Id character Unique identifier assigned by application system.
Team12 character Program Team Assignment. Assigned each team an ID (A:H).
First2 character First Name
Middle2 character Middle Name
Last2 character Last Name
Phone2 character Participant’s Phone Number
Phone22 character Participant’s Alternate Phone Number
Email2 character Participant’s Email Address
Email22 character Participant’s Alternate Email Address
Address1 character First line of address. (2014+)
Address1 character Second line of address. (2014+)
city character Name of City (2014+)
State character Two digit State code (2014+)
Zip character Mailing address zip code (2014+)
Country character Mailing address country (2014+)
Degree factor Participants Highest Degree Earned (PhD, Master, Bachelor, Associate, HS)
Experience character A description of the field of work/educational experience.
Q4 factor How did you hear about the program? (Check All That Apply)
Q4a   Checked: Program website
Q4b   Checked: Past participant
Q4c   Checked: Facebook/Twitter/LinkedIn
Q4d   Checked: Email
Q4e   Checked: OTL newsletter
Q4f   Checked: Newspaper article
Q4g   Checked: Word of mouth
Q4h   Checked: Other
Q4Other character Fill in the blank for Other. (2014+)
Q5 factor What is your primary goal for participation in the program? (Check All That Apply)
Q5a   Checked: Gain self confidence
Q5b   Checked: Start a company
Q5c   Checked: Networking
Q5d   Checked: Entrepreneurship training
Q5e   Checked: Attending seminars/workshops
Q5f   Checked: All of the above.
Q5g   Checked: Other
Q5h   Checked: Job/career opportunities (2016)
Q5i   Checked: Valuable knowledge and skills (2016)
Q5j   Checked: Interest in technology commercialization (2016)
Q5Other character Fill in the blank for “Other”. (2014+)
Q7 factor Do you have regular access to a computer? (Yes/No)
Q7a-f factor Do you have access to the following software? (2014)
Q7a   Checked: Internet
Q7b   Checked: Microsoft Word
Q7c   Checked: Excel
Q7d   Checked: PowerPoint
Q7e   Checked: Email
Q7f   Checked: Other
Q7g character Fill in the blank for “Other”. (2014)
Q8 factor Have you been involved with any new discoveries that have been patented by the University of Florida? (Yes/No)
Q8a character If you answered “Yes” to the previous question, please briefly describe the technology and your affiliation. (Ex: inventor, graduate student, post-doc, other)
Q9 character Attach a copy of your Resume in PDF format. (Limit 3 pages.)

1 These fields were added to aid in data analysis.
2 These fields will be redacted in final data set.

**Notes:**

  • Some of the records from the original data set contain duplicate data. These records were created when participants completed the application twice. The last application (Highest Entry ID #) submitted was retained for data analysis. The earlier submissions are marked as “Duplicate” in the full data set for reference. These will be removed prior to data analysis.
  • It is also noted that for the for some years the “Entry ID” numbers do not start at #1. This could be data that was removed by the program organizers, or it could be the earlier entries were test data from testing the application from prior to publication.

Compiling the Data

The application data were provided in the form of an Excel spreadsheet. For some cohorts (2014 & 2015) there is a raw data spreadsheet as exported from the online application and a separate spreadsheet with additional information that was used in the admissions decision-making process. For other cohorts (2013, 2016), it appears as if the application export spreadsheet was edited to add information and document the decision-making process.

Notes:

  • For 2012, the applications were collected on paper forms and e-mailed. Unfortunately, the 2012 forms have been lost and are not available for analysis.
  • In some cases, information had to be compiled from other data sources to complete the data set.

  • Team assignments and completion data were gathered and verified through the attendance worksheets. Teams were assigned an ID (A:H) so that they can be anonymized in the final data set. Team letter assignments were recorded on the original attendance spreadsheets. This portion of the data collection process was completed by hand using an Excel spreadsheet.
  • Each cohort data was stored in an individual spreadsheet and then exported into a .csv file for import into R.

Preparing the Data for analysis

In this first R block, the individual cohort .csv data files are combined into a dataframe. Empty values are properly coded with “NA” and then duplicate records (rows) are removed from the dataset. Once all the dataframes were complete, they are combined together into a single dataframe. This dataframe is stored as ‘Applicants.csv’ for use in the quantitative analysis.

## Reading data and combining into one dataframe.
## This block does not need to execute after initial data tidying.

# Read Applicants data from cohort application files and remove observations marked duplicate.
Applicants12 <- read.csv(“2012Applicants.csv”, header=TRUE, sep=”,”, na.strings = c(“”, “NA”))
Applicants12 <- filter(Applicants12, Accepted != “Duplicate”)

Applicants13 <- read.csv(“2013Applicants.csv”, header=TRUE, sep=”,”, na.strings = c(“”, “NA”))
Applicants13 <- filter(Applicants13, Accepted != “Duplicate”)

Applicants14 <- read.csv(“2014Applicants.csv”, header=TRUE, sep=”,”, na.strings = c(“”, “NA”))
Applicants14 <- filter(Applicants14, Accepted != “Duplicate”)

Applicants15 <- read.csv(“2015Applicants.csv”, header=TRUE, sep=”,”, na.strings = c(“”, “NA”))
Applicants15 <- filter(Applicants15, Accepted != “Duplicate”)

Applicants16 <- read.csv(“2016Applicants.csv”, header=TRUE, sep=”,”, na.strings = c(“”, “NA”))
Applicants16 <- filter(Applicants16, Accepted != “Duplicate”)

# Merge Applicants from individual cohort Applicants tables into one ‘Applicants’ table.
Applicants <- rbind(Applicants16, Applicants15)
Applicants <- rbind(Applicants, Applicants14)
Applicants <- rbind(Applicants, Applicants13)
Applicants <- rbind(Applicants, Applicants12)

str(Applicants)

Tidying the Data

After reviewing the data with ‘str(Applicants)’ it is apparant the factors are not consistent for all variables. This may be due to the fact that as the application form was edited, different spellings for various factors were used. This block cleans up the inconsistencies for each of the factors and shortens some of the factor labels so that they are easier to manipulate and display in tables.

## Cleaning up factor labels so that analysis will be consistent.
## This block does not need to execute after initial data tidying.

# Create degree indicators that are one word in length.
Applicants$Degree <- str_replace_all(Applicants$Degree, “GED or High school diploma”, “HS”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “undergrad”, “HS”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “High school graduate”, “HS”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “Associate’s degree”, “Associate”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “A.A.”, “Associate”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “Bachelor’s degree”, “Bachelor”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “B.A.”, “Bachelor”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “B.S.”, “Bachelor”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “B.F.A.”, “Bachelor”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “Master’s degree”, “Master”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “M.S.”, “Master”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “M.A.”, “Master”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “MBA”, “Master”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “M.B.A.”, “Master”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “J.D.”, “Master”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “Ph.D.”, “Doctorate”)
Applicants$Degree <- str_replace_all(Applicants$Degree, “PhD”, “Doctorate”)
Applicants$Degree <- factor(Applicants$Degree, levels=c(“HS”, “Associate”, “Bachelor”, “Master”, “Doctorate”), ordered=TRUE)

#Update Enties for questions which are parsing into multiple factors
Applicants$Q5a <- str_replace_all(Applicants$Q5a, “Gain self confidence”, “Gain self-confidence”)
Applicants$Q5a <- factor(Applicants$Q5a)
Applicants$Q5f <- str_replace_all(Applicants$Q5f, “All of the above.”, “All of the above”)
Applicants$Q5f <- factor(Applicants$Q5f)
Applicants$Q8 <- str_replace_all(Applicants$Q8, “Yes (Please answer Question 11.)”, “Yes”)
Applicants$Q8 <- str_replace_all(Applicants$Q8, “Yes (Please answer Question 9.)”, “Yes”)

# Remove Duplicate factors and ensure variables are factors.
Applicants$Accepted <- factor(Applicants$Accepted, levels=c(“Yes”, “No”), ordered=TRUE)
Applicants$Decision <- factor(Applicants$Decision, levels=c(“Yes”, “No”), ordered=TRUE)
Applicants$Finished <- factor(Applicants$Finished, levels=c(“Yes”, “No”), ordered=TRUE)
Applicants$Q8 <- factor(Applicants$Q8)

Annoynomizing the Data Set

The final data set has been cleansed to remove all identifying information so it is a truly anonymous data set for analysis. Only columns with respondant data will be included, all participant identifying information and all references to program name and university will be changed. The dataframe is saved as ‘Applicants.csv’. For all future iterations of analysis, the data will be loaded from ‘Applicants.csv’, and previous blocks will not execute (‘eval=FALSE’). This annonymized data set is the only version of the data that will be made available for analysis via GitHub.

**Notes:**

  • All references to the program name, the university, and other identifiable indicators have been replaced with generic terms surrounded by asteriks. Ex. program, university, company, etc. This process was completed manually in Excel by reading through applicant responses to find all identifying data and replacing it with generic placeholders.

# Choosing the columns to be included in the anonymized data set.
Applicants <- select(Applicants, Identifier, TeamID, Cohort, Accepted, Decision, Finished, Team, Degree, Experience, contains(“Q”))

# Save the file for later analysis.
write.csv(Applicants, file=”Applicants.csv”)
save(Summative, file = “data/Summative.Rda”)

Summative Evaluation

This code can be used to Tidy individual years, or the aggregated data. To analyze an individual year, comment out the lines that combine the data sets.

  • CodeBook.xlsx contains the variable names (question numbers), the text of the corresponding question, and an inventory for which years the question appeared on the survey.
  • Summative.csv contains the combined survey data.
  • Individual files 2013Summative.csv through 2016Summative.csv were created to hold individual year data.
  • The finished dataframe Summative.Rda has been exported for use in the analysis.

Compiling the Data

  • For cohorts 2013 – 2015 the surveys were completed in paper and pencil format. A matching survey was created in Qualtrics and survey responses were hand entered and exported as a .csv file.
  • For 2016, the survey results were collected in Survey Monkey, results were exported as a .csv file.
  • Survey responses for 2012 are missing from the available data.

While many of the questions in the 2016 survey are identical to 2013 – 2015, the export from SurveyMonkey was substantially different. To prepare the data for use in this analysis, variable names had to be hand entered to match those from the Qualtrics import. Likert Scale responses all used the same 5 point scale, but were configured to use different indicators. These differences are noted in the code book.

Summative Data Code Book

# Read the data file.

# Cohort 2016 has 40 completers, and 42 surveys.
S2016 <- read.csv(“./data/2016_Summative.csv”, sep = “,”, header = TRUE)
nrow(S2016)

## [1] 42

# Cohort 2015 has 47 completers, 5 missing surveys.
S2015 <- read.csv(“./data/2015_Summative.csv”, sep = “,”, header = TRUE)
nrow(S2015)

## [1] 42

# Cohort 2014 has 42 completers, 8 missing surveys.
S2014 <- read.csv(“./data/2014_Summative.csv”, sep = “,”, header = TRUE)
nrow(S2014)

## [1] 34

# Cohort 2013 has 41 completers, two did not attend 11/12 session, 6 missing surveys.
S2013 <- read.csv(“./data/2013_Summative.csv”, sep = “,”, header = TRUE)
nrow(S2013)

## [1] 33

# Convert rename column one and convert to character
colnames(S2013)[colnames(S2013) == “ï..V1”] <- “V1”
colnames(S2014)[colnames(S2014) == “ï..V1”] <- “V1”
colnames(S2015)[colnames(S2015) == “ï..V1”] <- “V1”
colnames(S2016)[colnames(S2016) == “ï..V1”] <- “V1”
S2013[,1] <- as.character(S2013[,1])
S2014[,1] <- as.character(S2014[,1])
S2015[,1] <- as.character(S2015[,1])
S2016[,1] <- as.character(S2016[,1])

# Review the data file variables.
# str(S2015)

# Read the code book.
CodeBook <- read.xlsx(“./data/Summative_CodeBook.xlsx”, sheetIndex=1, header = TRUE)
# CodeBook = data.table(CodeBook)

# View the code book.
# head(CodeBook)

The following code block coverts the numerical response for Question 6: “Who is your mentor?” to the team codes. These codes are assigned to maintain anonymity of teams and mentors. The 2016 Mentors were updated by hand due to the data being encoded using full text descriptions.

# Question 6 Factor: “Who was your mentor?”
# Note this question uses the TeamID which matches the TeamID in the applicant.csv file.

# 2016 Mentors
# These were updated using Excel.  The SurveyMonkey version of the datafile included all names and technologies.  To maintain anonymity, they were updated before importing the data into this analyisis.

# 2015 Mentors
S2015$Q6[S2015$Q6 == “1”] <- “2015TF”
S2015$Q6[S2015$Q6 == “2”] <- “2015TG”
S2015$Q6[S2015$Q6 == “3”] <- “2015TD”
S2015$Q6[S2015$Q6 == “4”] <- “2015TA”
S2015$Q6[S2015$Q6 == “5”] <- “2015TC”
S2015$Q6[S2015$Q6 == “6”] <- “2015TB”
S2015$Q6[S2015$Q6 == “7”] <- “2015TU”  # This is unknown in the survey. n=0
S2015$Q6[S2015$Q6 == “8”] <- “2015TE”
S2015$Q6 = factor(S2015$Q6,levels=c(“2015TA”,”2015TB”,”2015TC”, “2015TD”, “2015TE”, “2015TF”, “2015TG”, “2015TU”), ordered=TRUE)

# 2014 Mentors
S2014$Q6[S2014$Q6 == “1”] <- “2014TA”
S2014$Q6[S2014$Q6 == “2”] <- “2014TG”
S2014$Q6[S2014$Q6 == “3”] <- “2014TF”
S2014$Q6[S2014$Q6 == “4”] <- “2014TB”
S2014$Q6[S2014$Q6 == “5”] <- “2014TC”
S2014$Q6[S2014$Q6 == “6”] <- “2014TD”
S2014$Q6[S2014$Q6 == “7”] <- “2014TU”  # This is unknown in the survey. n=0
S2014$Q6[S2014$Q6 == “8”] <- “2014TE”
S2014$Q6 = factor(S2014$Q6,levels=c(“2014TA”,”2014TB”,”2014TC”, “2014TD”, “2014TE”, “2014TF”, “2014TG”, “2014TU”), ordered=TRUE)

# 2013 Mentors
S2013$Q6[S2013$Q6 == “1”] <- “2013TD”
S2013$Q6[S2013$Q6 == “2”] <- “2013TC”
S2013$Q6[S2013$Q6 == “3”] <- “2013TA”
S2013$Q6[S2013$Q6 == “4”] <- “2013TG”
S2013$Q6[S2013$Q6 == “5”] <- “2013TB”
S2013$Q6[S2013$Q6 == “6”] <- “2013TF”
S2013$Q6[S2013$Q6 == “7”] <- “2013TE”  # This is unknown in the survey. n=0
S2013$Q6[S2013$Q6 == “8”] <- “2013TU”
S2013$Q6 = factor(S2013$Q6,levels=c(“2013TA”,”2013TB”,”2013TC”, “2013TD”, “2013TE”, “2013TF”, “2013TG”, “2013TU”), ordered=TRUE)

Combine the individual survey response files into one complete Summative dataframe. This will allow cumulative data analysis as well as comparisons across cohorts.

# Combine files into one Summative dataframe
Summative <- bind_rows(S2015, S2014)

Summative <- bind_rows(Summative, S2013)

The following code block coverts the numerical response items into factor response items that will be easier to analyze. This block only affects 2013 – 2015 data as 2016 already has the factored response items for these questions.

# Question 20 Factor: “Time alloted each week for speakers was:”
Summative$Q13[Summative$Q13 == “1”] <- “Too Short”
Summative$Q13[Summative$Q13 == “2”] <- “About Right”
Summative$Q13[Summative$Q13 == “3”] <- “Too Long”
Summative$Q13 = factor(Summative$Q13,levels=c(“Too Short”,”About Right”,”Too Long”),  ordered=TRUE)

# Question 14 Factor: “Time alloted each week for teamwork was:”
Summative$Q14[Summative$Q14 == “1”] <- “Too Short”
Summative$Q14[Summative$Q14 == “2”] <- “About Right”
Summative$Q14[Summative$Q14 == “3”] <- “Too Long”
Summative$Q14 = factor(Summative$Q14,levels=c(“Too Short”,”About Right”,”Too Long”),  ordered=TRUE)

# Question 19 Factor: “How many hours a week on average did your team meet ourside of the program?”
Summative$Q19[Summative$Q19 == “1”] <- “1-2 hrs”
Summative$Q19[Summative$Q19 == “2”] <- “1-2 hrs”
Summative$Q19[Summative$Q19 == “3”] <- “3-5 hrs”
Summative$Q19[Summative$Q19 == “4”] <- “3-5 hrs”
Summative$Q19[Summative$Q19 == “5”] <- “More than 5 hrs”
Summative$Q19 = factor(Summative$Q19,levels=c(“1-2 hrs”,”2-3 hrs”,”3-4 hrs”, “4-5 hrs”, “More than 5 hrs”), ordered=TRUE)

# Question 20 Factor: “The programs duration was:”
Summative$Q20[Summative$Q20 == “1”] <- “Too Short”
Summative$Q20[Summative$Q20 == “2”] <- “About Right”
Summative$Q20[Summative$Q20 == “3”] <- “Too Long”
Summative$Q20 = factor(Summative$Q20,levels=c(“Too Short”,”About Right”,”Too Long”),  ordered=TRUE)

# Question 21 integer: Convert to range of values
Summative$Q21[Summative$Q21 <= “2”] <- “1-2 hrs”
Summative$Q21[Summative$Q21 <= “5”] <- “3-5 hrs”
Summative$Q21[Summative$Q21 >= “6”] <- “More than 5 hrs”
Summative$Q21 = factor(Summative$Q21,levels=c(“1-2 hrs”, “3-5 hrs”, “More than 5 hrs”), ordered=TRUE)

# Question 22 Factor: “Would you recommend this program to other women?”
Summative$Q22[Summative$Q22 == “1”] <- “Yes”
Summative$Q22[Summative$Q22 == “2”] <- “No”
Summative$Q22[Summative$Q22 == “3”] <- “Unsure”
Summative$Q22 = factor(Summative$Q22,levels=c(“Yes”,”No”,”Unsure”),  ordered=TRUE)

# Question 24 Factor: “What is your highest level of education completed?”
Summative$Q24[Summative$Q24 == “1”] <- “HS”
Summative$Q24[Summative$Q24 == “2”] <- “Associate”
Summative$Q24[Summative$Q24 == “3”] <- “Bachelor”
Summative$Q24[Summative$Q24 == “4”] <- “Master”
Summative$Q24[Summative$Q24 == “5”] <- “PhD”
Summative$Q24[Summative$Q24 == “6”] <- “Other”
Summative$Q24 = factor(Summative$Q24,levels=c(“HS”,”Associate”,”Bachelor”, “Master”, “PhD”, “Other” ), ordered=TRUE)

# Question 29 Factor: “What is your area of expertise?”
Summative$Q29[Summative$Q29 == “1”] <- “Finance”
Summative$Q29[Summative$Q29 == “2”] <- “Business”
Summative$Q29[Summative$Q29 == “3”] <- “Science”
Summative$Q29[Summative$Q29 == “4”] <- “Engineering”
Summative$Q29[Summative$Q29 == “5”] <- “Computer/IT”
Summative$Q29[Summative$Q29 == “6”] <- “Marketing/Communications/Design”
Summative$Q29[Summative$Q29 == “7”] <- “Other”
Summative$Q29 = factor(Summative$Q29,levels=c(“Finance”,”Business”,”Science”, “Engineering”, “Computer/IT”, “Marketing/Communications/Design”, “Other” ), ordered=TRUE)

# Question 30 Factor: “What is your age?”
Summative$Q30[Summative$Q30 == “1”] <- “18-24”
Summative$Q30[Summative$Q30 == “2”] <- “25-34”
Summative$Q30[Summative$Q30 == “3”] <- “35-44”
Summative$Q30[Summative$Q30 == “4”] <- “45-54”
Summative$Q30[Summative$Q30 == “5”] <- “55-64”
Summative$Q30[Summative$Q30 == “7”] <- “65-74”
Summative$Q30[Summative$Q30 == “6”] <- “75 or older”
Summative$Q30 = factor(Summative$Q30,levels=c(“18-24″,”25-34″,”35-44”, “45-54”, “55-64”, “75 or older”), ordered=TRUE)

# Question 31 Factor: “Choose the answer that best describes your current situation:”
Summative$Q31[Summative$Q31 == “1”] <- “married or in a committed relationship with no children”
Summative$Q31[Summative$Q31 == “2”] <- “married or in a committed relationship with grown children (18+)”
Summative$Q31[Summative$Q31 == “3”] <- “married or in a committed relationship with school aged children (5-18)”
Summative$Q31[Summative$Q31 == “4”] <- “married or in a committed relationship with younger child/ren (under 5)”
Summative$Q31[Summative$Q31 == “5”] <- “single with no children”
Summative$Q31[Summative$Q31 == “6”] <- “single parent with grown children (18+)”
Summative$Q31[Summative$Q31 == “7”] <- “single parent with school aged children (5-18)”
Summative$Q31[Summative$Q31 == “7”] <- “single parent with younger child/ren (under 5)”
Summative$Q31 = factor(Summative$Q31, ordered=TRUE)

# Question 32 Factor: “Which of the following best represents your racial or ethnic heritage?”
Summative$Q32[Summative$Q32 == “1”] <- “Non-Hispanic White or Euro-American”
Summative$Q32[Summative$Q32 == “2”] <- “Black, Afro-Caribbean, or African American”
Summative$Q32[Summative$Q32 == “3”] <- “Latino or Hispanic American”
Summative$Q32[Summative$Q32 == “4”] <- “East Asian or Asian American”
Summative$Q32[Summative$Q32 == “5”] <- “South Asian or Indian American”
Summative$Q32[Summative$Q32 == “6”] <- “Middle Eastern or Arab American”
Summative$Q32[Summative$Q32 == “7”] <- “Native American or Alaskan Native”
Summative$Q32[Summative$Q32 == “8”] <- “Other”
Summative$Q32 = as.factor(Summative$Q32)

# Question 33 Factor: “What was your total household income before taxes during the past 12 months?”
Summative$Q33[Summative$Q33 == “1”] <- “Less than $25,000”
Summative$Q33[Summative$Q33 == “2”] <- “$25,000 to $34,999”
Summative$Q33[Summative$Q33 == “3”] <- “$35,000 to $49,999”
Summative$Q33[Summative$Q33 == “4”] <- “$50,000 to $74,999”
Summative$Q33[Summative$Q33 == “5”] <- “$75,000 to $99,999”
Summative$Q33[Summative$Q33 == “6”] <- “$100,000 to $149,999”
Summative$Q33[Summative$Q33 == “7”] <- “$150,000 or more”
Summative$Q33 = factor(Summative$Q33, levels=c(“Less than $25,000”, “$25,000 to $34,999”, “$35,000 to $49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, “$150,000 or more”), ordered=TRUE)

# Question 34 Factor: “Please circle the option(s) that best describe(s) your current situation. Ok to choose more than one if applicable.”
Summative$Q34[Summative$Q34 == “1”] <- “Master’s student”
Summative$Q34[Summative$Q34 == “2”] <- “MBA student”
Summative$Q34[Summative$Q34 == “3”] <- “MD student”
Summative$Q34[Summative$Q34 == “4”] <- “PhD student”
Summative$Q34[Summative$Q34 == “5”] <- “Postdoc”
Summative$Q34[Summative$Q34 == “6”] <- “Unemployed (not a student)”
Summative$Q34[Summative$Q34 == “7”] <- “Work part-time (not a student)”
Summative$Q34[Summative$Q34 == “8”] <- “Employed at a technology startup”
Summative$Q34[Summative$Q34 == “9”] <- “Employed at a non-technology startup”
Summative$Q34[Summative$Q34 == “10”] <- “Employed at a technology non-startup company”
Summative$Q34[Summative$Q34 == “11”] <- “Employed at a non-technology non-startup company”
Summative$Q34[Summative$Q34 == “12”] <- “Owned my own technology business”
Summative$Q34[Summative$Q34 == “13”] <- “Owned my own non-technology business”
Summative$Q33 = as.factor(Summative$Q33)

# Update Likert scale questions to represent NA as NA (Qualtrics exported it as #6)

Summative$Q1_1[Summative$Q1_1 == “6”] <- NA
Summative$Q4_1[Summative$Q4_1 == “6”] <- NA
Summative$Q7_1[Summative$Q7_1 == “6”] <- NA
Summative$Q7_2[Summative$Q7_2 == “6”] <- NA
Summative$Q7_3[Summative$Q7_3 == “6”] <- NA
Summative$Q9_1[Summative$Q9_1 == “6”] <- NA
Summative$Q9_2[Summative$Q9_2 == “6”] <- NA
Summative$Q9_3[Summative$Q9_3 == “6”] <- NA
Summative$Q9_4[Summative$Q9_4 == “6”] <- NA
Summative$Q9_5[Summative$Q9_5 == “6”] <- NA

# Add 2016 data to Summarize
Summative <- bind_rows(Summative, S2016)

# Rename ID Column
Summative <- rename(Summative, ID = V1)

# Rename Cohort column and factor
Summative <- rename(Summative, Cohort = Q1)
Summative$Cohort <- factor(Summative$Cohort, levels=c(“2013”, “2014”, “2015”, “2016”), ordered=TRUE)

# Rename Degree column and clean up factor levels
Summative <- rename(Summative, Degree = Q24)
Summative$Degree[Summative$Degree == “High School Diploma”] <- “HS”
Summative$Degree[Summative$Degree == “Master’s Degree”] <- “Master”
Summative$Degree[Summative$Degree == “Bachelor’s Degree”] <- “Bachelor”
Summative$Degree[Summative$Degree == “Ph.D.”] <- “PhD”
Summative$Degree = factor(Summative$Degree,levels=c(“HS”,”Associate”,”Bachelor”, “Master”, “PhD”, “Other” ), ordered=TRUE)

# Rename Team Column, Survey ID, Age, Race, Income, Relationship
Summative <- rename(Summative, Team = Q6)
Summative <- rename(Summative, PaperID = Q25)
Summative <- rename(Summative, Discipline = Q29)
Summative <- rename(Summative, Age = Q30)
Summative <- rename(Summative, Relationship = Q31)
Summative <- rename(Summative, Race = Q32)
Summative <- rename(Summative, Income = Q33)
Summative <- rename(Summative, Employment = Q34)

# Rename Questions. Use TEXT to identify open-ended response items.
Summative <- rename(Summative, Q1 = Q1_1)
Summative <- rename(Summative, Q4 = Q4_1)
Summative <- rename(Summative, TEXT2 = Q2)
Summative <- rename(Summative, TEXT5 = Q5)
Summative <- rename(Summative, TEXT8 = Q8)
Summative <- rename(Summative, TEXT10 = Q10)
Summative <- rename(Summative, TEXT12 = Q12)
Summative <- rename(Summative, TEXT35 = Q35)
Summative <- rename(Summative, TEXT17 = Q17)
Summative <- rename(Summative, TEXT23 = Q23)
Summative <- rename(Summative, TEXT24 = Q24_TEXT)
Summative <- rename(Summative, TEXT26 = Q26)
Summative <- rename(Summative, TEXT29 = Q29_TEXT)

# Saving the data for later analysis.
write.csv(Summative, file=”data/Summative.csv”)
save(Summative, file = “data/Summative.Rda”)