Part A: Data Analysis
You have been provided a dataset. A description of the variables is below. (This data has been generated using statistical software; it does not represent real measurements.) Import it into R. Then briefly answer each question below. Be sure to save a script file with any R code you need to answer these questions.
Variables:
ability = score on an IQ test
race = reported race (white or nonwhite)
earnings = annual earnings
birthplace = state of birth
- How many observations are there? How many variables? Be sure to specify which is the number of observations and which is the number of rows.
- What kind of dataset is this: Time series, cross-sectional, or panel? Briefly defend your answer in one or two sentences.
- Produce a five-number summary of each quantitative variable.
- For each quantitative variable, tell me whether it is symmetric, skewed left, or skewed right? Write a sentence or two to defend your answer for each variable.
- The proper choice of data visualization technique depends in part on whether a variable is categorical or quantitative. Using an appropriate visualization technique, produce a graph showing the distribution of EARNINGS. Write a sentence or two explaining why this is a proper choice of visualization technique.
- The proper choice of data visualization technique depends in part on whether a variable is categorical or quantitative. Using an appropriate visualization technique, produce a graph showing the distribution of BIRTHPLACE. Write a sentence or two explaining why this is a proper choice of visualization technique.
- Assume your dataset is representative of a broader population. What is the 95% confidence interval for the population mean of ABILITY?
- What is the correlation coefficient for EARNINGS and ABILITY? Then, in words, provide an intuitive explanation of what this value of the correlation coefficient tells us about the relationship between these two variables.
- Estimate the simple regression with EARNINGS as the dependent variable and ABILITY as the independent variable. Report the slope coefficient (b1) from this regression. Then, in words, provide an intuitive explanation of what this value of the slope coefficient tells us about the relationship between these two variables.
- What is the predicted value of EARNINGS for an observation with ABILITY = 150? Provide at least one reason you might want to be cautious about trusting this estimate of the EARNINGS of an individual with this level of ABILITY.
- Construct a scatterplot of EARNINGS and ABILITY. Then describe the relationship between these variables in a sentence or two. Does the relationship you see in the scatterplot match the relationship implied by the regression and correlation coefficient?
- In the population, for every person born in South Dakota, there are 4 people born in Kansas, 4 born in Iowa, 2 born in Nebraska, and 5 born in Colorado. Is this dataset representative of the population? How, if at all, does the answer affect your analysis of this data? Explain briefly.
- What is the interquartile range of ABILITY? If two people’s ABILITY differs by an amount equal to the interquartile range, what is the expected difference in their EARNINGS? Use your regression results to find an answer to the second question.
- Conduct a formal hypothesis test of whether EARNINGS is different for the two groups defined by RACE. Be sure to lay out each of the 5 steps in our hypothesis testing procedure.
- Suppose your data is a representative sample of the population of interest. Let 𝑝 be the population proportion. It is defined as the share of individuals in which RACE = WHITE. Suppose you wanted to use your data to test the null hypothesis that 𝑝 = 0.3. You plan to use a significance level of 0.05. What values of 𝑝̂will fall into the rejection region for this test?
- Consider the hypothesis test laid out in the question immediately above. Suppose the “true” value of the population parameter is 𝑝 = 0.25. Calculate the statistical power of this test. Then describe in a sentence or two what the statistical power tells us.
Part B: Short Answer
For each question below, provide a brief answer. No answer should be longer than 250 words.
- Suppose you are interested in whether preschool is helpful to children. You obtain observational data on a population of children who attended preschool and a population of children who did not attend preschool. The data is a random, representative sample of the population. All children were given an identical standardized test. The result of a hypothesis test shows statistically significant evidence that children who attended preschool earned better scores on the test. a.) Does this data prove that preschool improves children’s test scores? If so, explain why. If not, provide at least one alternative explanation for the result. b.) Briefly describe how you might design an experiment to determine whether preschool improves children’s test scores.
- A survey of students found a negative correlation between the weekly hours of T.V. watched and the weekly hours spent exercising. One student explained that reducing the hours of T.V. watched (cause) would result in students sleeping longer and having more energy to exercise (effect). Give another explanation with hours of exercise as the cause and hours T.V. watched as the effect.
- A college department wants to learn about the jobs that its alumni are working in. An online survey is set up on the department website which invites alumni to complete and includes demographic questions and questions about their job (current and history). Describe a problem that could arise with this survey.
- In your own words, explain the difference between a parameter and a statistic. Give an example of each.