Overview
About the Initiative Evaluation
The Initiative evaluation aims to help Oregon Community Foundation and The Ford Family Foundation understand and document the impacts of the Initiative, to support internal learning and adaptation during the Initiative, and to share what we learn with out-of-school time providers, other foundations and other stakeholders. All of this supports the ultimate goal of the Initiative: to narrow the opportunity gap.
The evaluation is managed by Oregon Community Foundation research staff in partnership with additional evaluation and out-of-school time experts—including Carrie Furrer at Portland State University’s Center for Improvement of Child and Family Services and Corey Newhouse at Public Profit—as well as an evaluation advisory group comprising out-of-school time and evaluation experts. Throughout the evaluation, these experts and program staff have vetted our plans, participated in sense-making and reflection activities to help us interpret and use our findings, and provided feedback on draft materials.
See the full list of folks we want to thank for their contributions in Acknowledgments.
EXPLORE OUR LIST OF Resources & REFERENCES.
Evaluation design
The evaluation was designed to provide both formative/process and summative/impact findings (i.e., to support Initiative implementation and to assess its effectiveness). We took a utilization-focused approach, prioritizing the collection and sharing of information that would be most useful to the Initiative team and participating programs. Though we did not begin with a developmental approach in mind, we did end up employing some aspects of that approach as well, including more rapid-cycle feedback and participatory sense-making processes.
Beginning in 2018, the evaluation team also benefited from participation in the Equitable Evaluation Initiative, which influenced some of our analysis and dissemination decisions, as we aimed not only to support learning through the evaluation, but also to further equity.
The evaluation was organized around three main questions:
- How and how well was the Initiative designed and implemented to meet its goals?
- How and how well did participating programs implement high-quality out-of-school-time programming to support success for middle school students of color, students from under-resourced rural communities, and students from low-income families?
- How and how much did the Initiative and its participating programs contribute to positive youth, parent, organizational and community outcomes?
Data collection and analysis
To answer these questions, the evaluation team engaged in a wide range of data collection activities, including:
- Literature reviews about out-of-school time, the Initiative’s core components and other relevant topics such as social and emotional learning (2014–2020).
- Annual out-of-school time staff interviews for almost all programs (2014–2020).
- Observations of dozens of program sessions (2014–2017).
- Ten focus groups with parents and caregivers (2015–2016).
- Two rounds of interviews with out-of-school time stakeholders (including other funders and researchers) to learn more about the out-of-school time field in Oregon and beyond (2015, 2017).
- Support for program quality data collection through the Youth Program Quality Assessment (YPQ) tools (2014–2020).
- Support for eight programs that implemented a photovoice project with several dozen students to explore and document their identity in relationship to their community, school and out-of-school time program (2016).
- Support for participating programs in collecting and submitting student-level data about program participation as well as student surveys about social and emotional learning and program experiences (2014–2017).
- Use of student participation data from 2014 through 2017 (divided into three entry cohorts based on when they began programming) to track students’ short- and long-term educational outcomes and compare them to those of similar peers. This analysis used data available through the Oregon Department of Education's standardized test scores for math, English language arts, attendance, discipline referrals and freshman credit attainment (2014–2020).
- Support for planning and reflection relating to learning community activities, collecting participant feedback, and conducting discussion sessions modeled on after-action reviews to help staff continuously improve implementation (2014–2020).
All data was collected with the consent of students, their caregivers and programs.
Throughout the Initiative, the evaluation team provided summaries of our internal analysis of data collection efforts, usually with an emphasis on whatever would most help Initiative staff improve Initiative and learning community implementation. Evaluation team members also supported program capacity-building—particularly in evaluation methodology—through sessions offered at learning community convenings and webinars, as well as through program-specific coaching on student-level data collection.
Copies of the tools and instructions developed for the evaluation are available on request from kleonard@oregoncf.org.
Data sources
The evaluation team interviewed program staff and leaders in almost every year of their Initiative participation. The configurations and details varied each year, depending on the Initiative team’s information needs and the interests of participating programs. As the Initiative progressed, we conducted small-group interviews with programs in their third or later years to encourage peer learning. More than 100 individual and group interviews were conducted over the course of the evaluation.
These interviews were important opportunities for the evaluation team to build understanding and rapport with program leaders and staff, and always included time for programs to provide feedback and ask questions about the Initiative. Findings from staff interviews were one of the most useful sources of information for the Initiative team in its efforts to continuously improve the learning community and other Initiative components. Whenever possible, we shared our notes or summaries with interviewees and program leaders.
Interviews were held separately from grant renewal decision-making processes. After each round of interviews, the evaluation team conducted thematic qualitative analysis and provided a summary of aggregate findings to Foundation staff, describing similarities and differences across programs without compromising confidentiality and with a focus on sharing information that would help the Initiative team understand and support participating programs. Most of these interviews happened at program locations during site visits that also included program observations.
Program observations took place in the first several years of the Initiative. Dozens of observations were conducted by the evaluation team, both during the school year and over the summer. This was one of the most important sources of the evaluation team’s understanding of what happens during programming. It provided early indications of themes and patterns that would resonate throughout later analysis, including the ways that programs were promoting a sense of belonging, confidence and other social and emotional skills. They also familiarized us with the communities and contexts in which each program works. The evaluation team took extensive notes during observations, using thematic qualitative analysis to code, digest and synthesize findings that were used primarily to describe what programming looks like, how it varies, and how programs are working toward positive impacts for students.
The evaluation team conducted 10 focus groups with parents and caregivers during the 2015–2016 school year. This was a optional evaluation component for participating programs; we recruited a strategic sample of programs (e.g., by geography and program type) but encouraged only those with adequate capacity and interest to participate. We worked closely with program leaders to identify and recruit parents and caregivers and to coordinate space, catering and child care as needed. We provided informational flyers and offered our contact information for any parents or caregivers with questions.
During each focus group, we explored how parents work to support their students, how programs impact students and families, and how families are or aren’t engaged by the programs. In doing so, we captured constructive feedback that we shared with program leaders along with a more comprehensive summary of the discussion. Thematic qualitative analysis generated aggregate findings that we could share while ensuring confidentiality.
In 2015 and 2017, the evaluation team conducted telephone interviews with roughly a dozen out-of-school time stakeholders to learn more about the out-of-school time field in Oregon and beyond. Participants included other funders and researchers as well as out-of-school time experts. The interviews helped the evaluation team get to know Oregon's policy and funding landscape for out-of-school time, explore participant perspectives on the strengths and challenges of out-of-school programs in our state, and identify opportunities specific to the Initiative given its goals and potential role in Oregon’s out-of-school time ecosystem. As with our other qualitative data collection efforts, we completed thematic qualitative analysis and provided summaries to the Initiative team and to interviewees.
Program quality data was captured through the Youth Program Quality Assessment (YPQ) tools adapted for use in Oregon.
GET MORE INFORMATION about these tools & the program quality assessment process in Improving Out-of-School Time Program Quality.
The Initiative team did not use program quality assessment scores to make funding decisions, nor did the evaluation team monitor score changes over time to determine whether improvement is happening. Positioning program leaders and staff as the primary users of that data—and supporting their use of the data to inform improvement planning rather than drive it—has helped ensure authentic engagement in the process, resulting in meaningful program improvement. We have found that this focus on the process—rather than the scores—is critical.
That said, the evaluation team has explored aggregate assessment scores and improvement plans to look for patterns and trends. For example, it was helpful for learning community planning to see when youth voice and reflection were an area of potential growth and of interest to participating programs. However, these patterns and trends weren’t consistent enough for us to draw any conclusions about program quality.
We also don’t expect to see substantial score increases within the typical three-year period, because we know that over the first couple of years programs are just starting to understand the process and can often become harsher critics of their own work. Seeing assessment score changes is also unlikely when different locations, components and staff are observed each year; it is rare that a program has adequate observations of similar programming to draw conclusions from year-to-year score changes. However, because reviewing the previous years’ scores can generate conversations about how practice varies and help programs plan for improvement, programs do have access to their historical scores through an online database.
We have examined relationships between external quality assessment data and other data sets, including educational data. This was a purely exploratory exercise, using data we knew was likely gathered too early in the quality assessment process to merit extensive or conclusive analysis. Nevertheless, we did see some intriguing patterns, including a positive relationship between composite scores on the “supportive environment” domain of the Youth PQA and some educational outcomes, including regular attendance. In other words, students who attended school regularly went to programs with higher quality ratings for a supportive environment. Unfortunately, our results were inconsistent across the Youth PQA domains and cohort groups, and our analyses did not allow for causal conclusions.
Photovoice is a participatory evaluation method that engages students in personal sharing through photographs and the written word. It is especially useful in situations where there is an uneven power dynamic, such as in adult-led programs for youth (Strack et al., 2004). With support and expertise from Oregon photographer and arts educator Julie Keefe, the evaluation team provided instruction, training and resources to help eight programs guide a few dozen students through photovoice projects in 2015–2016. These projects asked students to use photography and narrative to describe how they see themselves and how others view them in school, in their out-of-school time program and in their community. Finished products highlighted youth commitment to school, peers as an important support network, the ways youth are perceived internally versus externally, and hope for the future. The evaluation team used these direct statements from youth to complement quantitative data and analyses.
This was an entirely optional evaluation component for participating programs; we recruited a strategic sample of programs (e.g., by geography and program type) but encouraged only those with adequate capacity and interest to participate. Most worked with a smaller subset of their students. Photovoice projects were not integral to programming—they did not replace other curricula, for example. Rather, they were a tool to help program staff and the evaluation team understand more about how students are feeling and thinking about themselves in relation to the program and world around them.
In the first few years of the Initiative, more than 1,300 students in 21 participating programs completed surveys about their social and emotional learning using an adapted version of the Engagement, Motivations and Beliefs Survey created by Youth Development Executives of King County and validated by the American Institutes of Research (Naftzger, 2016). To support this effort, the evaluation team provided all necessary materials and extensive instruction so that program staff could administer the survey directly.
The survey asked students to reflect on their academic identity, mindsets, interpersonal skills and cultural identity, and on how much their programs supported them through academics, belonging and engagement. Each section relating to social and emotional learning included multiple statements as part of a scale. Students were asked whether these statements were not at all true (1), somewhat true (2), mostly true (3) or completely true (4). Scale averages combine the response values for five to seven statements in each topic area. Wherever findings in this report describe students as being “in agreement” with a survey statement, we have combined mostly true and completely true responses.
Because we did not have a large or consistent enough group of respondents year over year, analysis of survey results was limited to a single response per student. In cases where students completed surveys over multiple years or across different programs in the same year, surveys were randomized to select a single survey per student.
Student survey data was linked to demographic data from the Oregon Department of Education. Respondents largely mirrored the demographics of students in the Initiative. A strong majority were eligible for free or reduced-price meas (85%); however, fewer American Indian/Alaska Native students were represented in the survey (4%) compared to their participation in the Initiative (25%). Students surveyed were 48% Latino/x, 31% Black/African American, 4% Native Hawaiian or Pacific Islander, and 2% Asian. In addition, 18% of the students qualified for special education services and 18% were English language learners. Students were fairly evenly split among the sixth, seventh and eighth grades.
Our analysis of the student survey highlights basic associations and differences in agreement between groups of students and their programs (e.g., academically focused versus culturally specific). These associations do not account for additional differences between the groups being compared. For example, students in programming focused on social and emotional learning may be different from students in other types of programming, and our analysis does not control for those differences. Still, these findings highlight areas for further exploration and analysis.
The 21 programs that started participating in the Initiative in either 2013 or 2014 submitted data about students and their program participation for the 2014–2015, 2015–2016 and 2016–2017 school years. Data tracked included student demographics and monthly hours of participation. The evaluation team provided simple and secure tools for data collection and submission as well as training and one-on-one coaching to support programs in capturing and submitting the data periodically each year. Consent forms were provided in several languages to ensure that parents and caregivers understood and agreed to the use of student data in the evaluation. Each year, the evaluation team gave each program a summary of the data submitted, which provided a synopsis programs could use in their own reporting as well as an opportunity to quality-check the data.
Collecting and submitting student participation data was a substantial lift for participating programs, many of which did not have electronic data systems and had to transfer information by hand from paper applications and sign-in sheets each month. Within a couple of years, we realized that this was not sustainable for the participating programs or the evaluation team.
In 2017, we decided not to continue collecting student-level data from participating programs and deepened our focus on supporting learning and improvement for the Initiative team and participating programs. At that point, we had captured enough student-level data to have a sufficiently large and diverse group of students to complete our analysis. We also knew that it would be at least 2020 before we would find out whether even the oldest students included in our data were graduating at the same or greater rates than their peers (in fact, we didn’t end up getting that data until 2021 due to pandemic-related delays). Ultimately, we learned a lot about what data was helpful to programs and realized that it would be far more valuable to focus on program quality data and other tools most immediately useful to program staff.
We also realized that we didn’t need to capture student-level data to see program impacts or to understand the Initiative’s effectiveness. Not only was the national research base growing in support of out-of-school time programming, but we also had plenty of evidence of its value, and the value of the Initiative, from the perspective of students, family and program staff—a strong set of equally valid qualitative data.
To measure educational outcomes for participating students, the evaluation team used administrative data from the Oregon Department of Education, including daily attendance, standardized test scores in math and English language arts, suspension/expulsion discipline referrals, the freshman “on track” indicator, and high school completion. We tracked three cohorts of students through their 2018–2019 school year based on the year they first started participating in Initiative-supported programming (2014–2015, 2015–2016 and 2016–2017). High school completion data was collected based on the 2019–2020 reporting year.
For each cohort, we used probabilistic matching to identify participating students in the Oregon Department of Education average daily membership (ADM) datafile based on student name, date of birth, gender, grade, race and school. In total, we were able to identify 86% of OCF students in the average daily membership datafile (see Initiative participation samples by cohort, below). Students in the statewide datafile who were not identified as participating students were placed in the comparison pool. Students from the previous cohorts (2014–2015 and 2015–2016) were excluded from the comparison pool of subsequent cohorts (2015–2016 and 2016–2017).
Second, we used 1:1 propensity score matching to select a sample of students from the comparison pool for each cohort based on a number of student characteristics (e.g., free or reduced-price meal eligibility, race, grade level, gender, language of origin, and English language learner status) and the county in which they attended school. Across all three cohorts, 72% of participating students were included in the matched comparison analyses.
Initiative participation samples by cohort
Total, Oregon Department of Education average daily membership datafile match, comparison match & test score data in first program year.
Cohort |
Total | |||
---|---|---|---|---|
2014–15 | 2015-16 | 2016-17 | ||
OCF participation data | 1,594 | 1,365 | 655 | 3,614 |
Matched in ADM datafile (% total) | 1,328 (83%) | 1,225 (90%) | 568 (87%) |
3,121 (86%) |
Matched with comparison student (% total) | 1,169 (73%) | 1,026 (75%) | 423 (65%) |
2,612 (72%) |
Had English language arts standardized test scores in first program year (% total) | 1,119 (70%) | 1,023 (75%) | 420 (64%) | 2,559 (71%) |
Had math standardized test scores in first program year (% total) | 1,113 (70%) | 1,026 (75%) | 423 (65%) | 2,559 (71%) |
Had freshman on-track indicator (% total) | 1,083 (68%) | 981 (72%) | 219 (33%) | 2,283 (63%) |
Using all available data (up to five years of follow-up for the 2014–2015 cohort), we used linear mixed modeling to estimate whether trends were statistically different for participating students than for students in the matched comparison group. Key outcomes were calculated for each school year as the proportion of students who 1) met or exceeded grade-level standards for math; 2) met or exceeded grade-level standards for reading; 3) attended school regularly, or at least 90% of the time; and 4) received any type of discipline referral.
Both linear and quadratic (to account for trends with a curve) models were evaluated; final models were selected based on best fit (smallest –2 restricted log likelihood). All models included linear variables: Time, Program (Initiative=1 vs. Comparison=0) and an interaction term, Time*Program.
Quadratic models also included variables for quadratic time (Time2) and the interaction for quadratic time and program (Time2*Program). Sample sizes varied according to the outcome of interest (for example, see the table above for differences in sample sizes for English language arts and math standardized test outcomes during the first program year).
The following tables compile findings from the linear mixed models for each cohort and each of the four key educational outcomes.
Attendance: Tests of fixed effects
Linear model.
|
Model Parameters (F) |
||
Cohort |
Time |
Program |
Program*Time |
2014–15 |
789.77*** |
6.07* |
5.74* |
2015–16 |
367.14*** |
0.05 |
0.76 |
2016–17 |
79.83*** |
0.17 |
1.25 |
English language arts: Tests of fixed effects
Quadratic model in 2014–2015 and 2015–2016; linear model in 2016–2017.
|
Model Parameters (F) |
||||
Cohort |
Time (Linear) |
Time (Quad) |
Program |
Program*Time (Linear) |
Program*Time (Quad) |
2014–15 |
107.31*** |
116.34*** |
1.38 |
2.96† |
1.85 |
2015–16 |
1.68 |
1.61 |
2.33 |
3.39† |
3.29† |
2016–17 |
0.28 |
– |
0.43 |
3.56† |
– |
Significance levels: † p<.10, * p<.05, ** p<.01, *** p<.001.
Math: Tests of fixed effects
Quadratic model in 2014–2015 and 2015–2016; linear model in 2016–2017.
|
Model Parameters (F) |
||||
Cohort |
Time (Linear) |
Time (Quad) |
Program |
Program*Time (Linear) |
Program*Time (Quad) |
2014–15 |
212.12*** |
103.89*** |
4.26* |
4.83* |
3.08† |
2015–16 |
0.09 |
3.26† |
1.39 |
0.47 |
0.53 |
2016–17 |
8.48** |
– |
0.11 |
0.17 |
– |
Significance levels: † p<.10, * p<.05, ** p<.01, *** p<.001.
Discipline: Tests of fixed effects
Quadratic model in 2014–2015 and 2015–2016; linear model in 2016–2017.
|
Model Parameters (F) |
||||
Cohort |
Time (Linear) |
Time (Quad) |
Program |
Program*Time (Linear) |
Program*Time (Quad) |
2014–15 |
33.41*** |
34.84*** |
0.20 |
0.12 |
0.12 |
2015–16 |
111.81*** |
65.44*** |
0.62 |
2.45 |
2.57 |
2016–17 |
77.66*** |
– |
0.61 |
5.54* |
– |
Significance levels: † p<.10, * p<.05, ** p<.01, *** p<.001.
Freshman “on track” data was available for a single point in time (ninth-grade year, on track or not), and we used 2020–2021 five-year high school completion data for students who were in eighth grade in the 2014–2015 cohort (dropout, still in school, graduated or completed). Using chi-squared analysis, we then compared outcomes for participating students to outcomes in the matched comparison groups.
Standardized test scores in math and English language arts are complicated to analyze longitudinally beyond middle school because ninth and 10th graders typically do not take these tests. In our data, sample sizes decreased precipitously as students left middle school. The longitudinal statistical models handled the missing data by estimating the effect of Initiative programming for all students. Sometimes, the model estimates for all students didn’t exactly match what we saw in pairwise comparisons, which used only actual data. For example, the 2014–2015 entry cohort’s second year of follow-up data (school year 2016–2017) only included test scores from those who started Initiative programming as sixth graders (i.e., eighth graders in school year 2016–2017). Their actual performance differed from model estimates that predicted trends for all students. There were no statistically significant program effects for any entry cohort in math and English language arts. Although there were some marginally significant findings, we did not find clear patterns in the data. For these reasons, we did not report longitudinal findings for meeting grade-level benchmarks in math and English language arts.
In advance of the complex statistical modeling, we calculated descriptive statistics for each educational outcome in the Initiative and comparison groups for each cohort (chi-squared analysis). We also disaggregated educational outcomes by student characteristics, Initiative program/grantee, program characteristics, program dosage (quintiles of number of hours spent in programming), and program quality (2015–2016 and 2016–2017 cohorts). Chi-squared and Cramer’s V tests were used to analyze differences between groups. Moreover, we analyzed average differences in social and emotional learning outcomes within each dichotomized outcome group using t-tests.
We explored a number of different ways to calculate educational outcomes, including percentage of days present, one-year growth in standardized test scores, and almost meeting grade-level benchmarks. All of these outcomes were analyzed using the linear mixed modeling described above. We found that using dichotomous outcomes produced similar results and were easier to interpret and explain.
We also explored whether there were longitudinal trends in educational outcomes for Initiative versus Comparison groups for different student subsets (gender, race, free or reduced-price meal status, grade level, English language learner status, and special education). Also known as moderation analysis, this technique effectively disaggregates data trends over time by program status and student groups. Finding statistically significant moderation effects requires a great deal of statistical power (i.e., large sample sizes or large differences between groups), which was missing for many of our analyses. Moreover, the propensity score matching technique ensured that the Initiative sample was matched with a comparison sample overall, but not necessarily within particular subsets of students.
For the most part, we did not find statistically significant moderated effects. When we followed up on statistically significant moderated effects that appeared in more than one cohort, we often found that the effect diminished once we controlled for baseline differences between Initiative and Comparison students within the particular group (e.g., Latino/x students). We did not include these findings in this report because they were inconclusive and did not tell a clear story about Initiative programming.
Evolution of the Initiative evaluation
As the Initiative evolved, so did the evaluation. Though our overarching goals and core evaluation questions stayed the same, two things in particular drove shifts in the evaluation: deepening program quality work, and increased understanding of out-of-school time impacts coupled with a galvanized focus on addressing the opportunity gap.
As work on program quality improvement intensified within the learning community and our understanding of what information was most valuable to programs and Initiative leaders deepened, we shifted evaluation priorities to match. Beginning with the third round of programs, funded in late 2016, we ended student-level data collection in order to focus on supporting program quality work and related capacity-building. We continued to interview staff in person or by phone/video at least annually to capture feedback on the Initiative’s design as well as qualitative input about its impact on programs, staff and students. We also continued to track the students who participated in the first three school years, following them into high school to see what we could learn about their educational outcomes.
Oregon Community Foundation’s embrace of the opportunity gap frame in 2016 helped articulate the through-line between out-of-school time programs and the achievement gap, validating the Initiative’s focus on providing high-quality opportunities for the youth most likely to experience the opportunity gap. This coupled well with our shifting thinking about how to frame and measure program impacts. Together, these factors are pushing our work—and encouraging others to push their work—toward measures of student progress that connect more closely to what is happening in programming and to measures of conditions that support students (e.g., program quality). This is effectively shifting the burden of demonstrating progress from youth to the adults who control the systems and environments in which we hope students will thrive.
DOWNLOAD A BRIEF, PRINTER-FRIENDLY PDF ON REFRAMING OUT-OF-SCHOOL TIME IMPACTS.
We also adapted our methods and our assistance to participating programs based on their needs and interests. For example, in the early stages of the Initiative, foundation and program staff were interested in understanding more about program impacts on social and emotional skill development. While there was a sense that program staff were supporting social and emotional learning and that students were accordingly developing these skills, a lot of questions remained about what competencies to focus on and how best to measure these skills in youth. The evaluation team identified Youth Development Executives of King County’s Youth Engagement, Motivation and Beliefs Survey as a promising tool and worked with participating program leaders to adapt and implement it. As the tool was administered each year, we adapted our guidance and support to programs—as well as our reporting to them of their survey results—to make the process more manageable and the results more useful.