Proving the impact of your learning initiatives

How do we evaluate the impact of a learning program in a way that isolates its effect on performance from other confounding factors? In short, how do we uncover the truth?

In 2020, companies in the United States spent $82.5 billion dollars on learning and development, according to the “2020 Training Industry Report,” even as COVID-19 began to take a toll on expenditures. That kind of investment in people, technology and programs is substantial but hardly surprising. A PwC report, “Talent Trends 2019: Upskilling for a Digital World,” found that 79 percent of CEOs worldwide are concerned that a lack of essential skills in their workforce is threatening the future growth of their organization. The investment made to close those gaps inevitably places pressure upon learning organizations to show a return and demonstrate how they drive value in KPI and ROI terms. If we’re accountable as learning leaders, asking whether a new program or approach directly contributed to a lift in revenue, a drop in support calls or an increase in customer satisfaction are fair questions. As we all strive to meet the challenges of work and learning in a pandemic-stricken world, rapid experimentation with new approaches has become mandatory and questions of what works and what doesn’t become even more pressing.

But answering those questions is harder than it seems. Over the years, industry reports from sources like ATD’s “Evaluating Learning: Getting to Measurements that Matter” have repeatedly indicated that while most organizations routinely capture Level 1 (reaction) and Level 2 (learning) data, level 3 (behavior) and level 4 data (results) are obtained by just 35 percent of organizations. That this finding has remained largely unchanged since the survey was first launched in 2009 suggests we may have hit a ceiling of sorts.

It’s not for lack of interest. Other reports show a strong desire among leaders to evaluate better and more broadly. According to LinkedIn’s “2020 Workplace Learning Report,” learning leaders said their most important area of strategic focus for 2020 was evaluating the effectiveness of learning programs. The same survey revealed that one in five learning leaders said that demonstrating the value of learning was among their top three challenges in the near term.

There is a genuine desire within learning organizations not just to move the needle with regard to KPI, but to continuously innovate and improve as a team, and evaluation is the best tool we have to do that.

Why is learning’s impact so hard to accurately measure?

Measuring the true impact of learning solutions is hard to do for myriad reasons, including a lack of data, time or resources to conduct evaluation, and maybe the biggest reason — the difficulty of isolating learning as a cause of improvement.

Historically one of the biggest challenges in assessing the impact of learning programs has been accessing the performance data necessary to make comparisons. However, that has shifted in recent years as learning platforms are capturing better data and business intelligence and finance teams have become more willing partners in pulling the data we need.

Still, even with better access to performance data, the method frequently used to quantitatively measure the results of a newly deployed program is itself limiting. A new program is rolled out to its target audience and KPIs are measured over some period of time. If a change is seen in the data — say, revenue, error rates or customer satisfaction scores — that change is attributed to the new program. But should it be? Unfortunately, this approach is problematic.

The inherent challenge in adhering to a Kirkpatrick-style evaluation approach is that the further out you get from the learning event itself (e.g., “the levels”), the harder it is to isolate the event’s impacts from what researchers call confounding variables — factors out of your control that directly affect the data you’re trying to measure.

Example: sales onboarding

Let’s look at a simple scenario. A sales team is onboarding 500 new hires. Results from the usual onboarding have been acceptable, but a new and improved version of the program is planned that will incorporate upgraded content, activity enhancements, a smarter delivery blend and new approaches to follow-up coaching. The new program is deployed, and performance data are tracked over a six-month period following the program launch. The results for one key measure are charted below and clearly show an overall improvement (D).

Based on these results, can we conclude the program was effective? It’s tempting to say yes, but the truth is we don’t know. Let’s look at why that is.

When data hide the truth: the danger of correlations

It is nearly impossible to deploy a new training program in a vacuum. There are all kinds of things happening inside and outside an organization that affect performance. Returning to our sales example below, notice the occurrence of initiatives unrelated to L&D — we’ll call these “outside events” — and the noticeable change in performance following each.

 

Maybe the new sales program is happening at the same time sales leadership made big changes to how they recruit, assign markets, target customers, or even what products and services are sold. These events may be having — in fact they are intended to have — a direct effect on performance that we need to factor out so we can say confidently that our new program is better (or worse) than whatever we’re comparing it with. Is the change in performance due to the new program, other factors or both?
The false positive: While it’s certainly possible that the new onboarding program had a significant impact on performance, it’s equally possible that it had little to no benefit and that some or all of the performance lift is due to those outside events. The point is you don’t know because the lift only correlates with deploying your new program; we have no idea if it caused any lift.
The false negative: Similarly, what if you saw little to no change at all but suspect that outside events are in play that actually may have harmed performance (e.g., poor hires, unpopular change in compensation plan, economic downturn, etc.)? In that case, you’d be seeing a correlation that falsely shows your learning program had weak effects, or even failed, when it didn’t. You see the problem.
Are we saying correlations have no value? No, not necessarily. A strong lift in performance, taken together with other data (e.g., changes in level-1 and level-2 data, focus groups, anecdotal feedback from frontline leaders, etc.) would give us some basis for taking a correlation more seriously. That’s legitimate and sometimes will be the best we can do. But the fact is we don’t really know, and by relying on correlation we risk concealing the true value of our programs while making innovation more difficult.
So how do we evaluate the impact of a learning program in a way that isolates its effect on performance from other confounding factors? In short, how do we uncover the truth?
The power of pilot testing
The most conclusive way to determine that a new learning program caused (or contributed to) a change in performance is to compare the outcomes between what we’ll call a pilot group and a group not using the new program, or what we’ll call a control group. Think of it as comparing a new drug against an alternative drug, or no drug at all (e.g., a placebo).Returning to our sales example, let’s say the new program was rolled out to a pilot group of 25 individuals, while the remainder went through the usual, preexisting program. We see that the pilot group (line A) outperformed the control group (line B) over the same six-month period. We know that the difference in performance is strictly due to the learning strategy employed because that’s the only difference between the groups.
By comparing a pilot group with a control group, we can clearly measure its effects while filtering out the likely effects of those two outside events (along with any number of other hidden factors we’re not even aware of). Notice that both groups see performance change immediately following the outside events, but the pilot group still outperformed the control group during that time, thereby proving its impact. Had we simply rolled out the new program to everyone, we would see just one line in our graph and have no idea what, if any, performance gain was due to the new program.

There are several important considerations for running a pilot:
  • Participants in a pilot group should be a representative subset of the overall target audience (preferably randomized to the extent possible) and compared against some or all of the remaining target audience.
  • During the pilot, performance data should be captured and tracked for both groups for long enough that the desired effects have time to emerge. Be sure to allow time for your full solution to play out. For example, it should allow time for any coaching/support elements to have their effect.
  • Consider adding a second pilot group to the mix if you’re comparing two variations of a new program.

What would adding a “baseline group” get us?

Let’s adjust our scenario and say that sales leadership wants to boost prospecting skills and is comparing the results between a vendor-supplied program against a new, internally developed program. This would be program versus program, like our previous scenario — but what if we also track performance of those not going through either program — a “baseline” group?

Results could look like this:

 

By including performance data for individuals who did not participate in either program, we can see not only which program was better than another (D1), but we can now also see how much of the overall change in performance was due strictly to each program (D2 and D3 respectively). Without this baseline data, we only know the degree to which one program is better than another — we don’t know what share of the overall gain was caused by the programs. By comparing pilot results against this baseline group, we know with certainty.
When to consider this evaluation method
Due to the time needed for outcome data to emerge following delivery of learning programs, this approach is, in itself, an investment, not to mention a test of patience. It’s not necessary or practical to use this approach for every program or event; not every individual learning intervention that gets rolled out is expected, by itself, to “move the needle” at a measurable, KPI level. We need to be selective about running this kind of evaluation, and it may be best reserved for when certain conditions are met. You might ask these questions:
  • Is the stated goal of the program or strategy to directly influence one or more performance measures? All programs hope to influence improvement, but only some solutions are genuinely expected, by themselves, to result in measurable performance benefits. If your program is in the latter group, it may be a candidate for this approach.
  • Would the program’s deployment mean a substantial investment or change-management effort? Definitely consider this method when evaluating a new program that, if it were deployed broadly, would require a big change in investment, delivery approach, time to competency, hiring cycles and so forth.
  • Can you get the right data partners? Use this approach when outcome data can be gathered over time and analyzed in partnership with other business units. Lou Tedrick, VP of Global Learning and Development at Verizon, says, “The key is having strong relationships with business partners who are open to sharing performance data and with finance who can help us monetize the value of KPIs. By looking at both, we can measure the impact of learning on the business and the financial ROI.”
  • Are there other big changes happening beyond L&D? Consider this approach when you know in advance that the organization is rolling out changes that could conceal learning’s impact and you want to isolate its effects.
The benefits
Programs intended to shift attitudes or improve soft skills are notoriously difficult to measure, yet they are increasingly the focus of L&D initiatives. Using this approach, it doesn’t matter what type of gains you’re targeting — leadership, sales, customer service, diversity and inclusion — because you’re comparing outcomes for groups whose only difference is the learning program they completed. Using this approach, you have isolated your program’s effects regardless of how soft or nonquantitative the subject matter might be.
There’s also no limit to what you might compare. This method could be used to evaluate:
  • New programs
  • Programs with updated content, interactions and activities
  • Alternate delivery approaches for the same program
  • Alternate approaches to post-training support, coaching or mentoring
Better information for better decisionsTo be clear, this approach is not trying to follow tightly controlled experimental design or show what statisticians call “statistical significance.” We’re simply applying an experimental method and mindset to arrive at a much cleaner and less biased set of data that leaders can use to make better decisions, and by extension, to establish improved accountability for learner performance and establish the real impact of programs.Learning leaders, like all leaders, need to be accountable for results, and this method can be instrumental in that accountability. But more than that, if learning leaders are going to be innovators, they need to know what works and what doesn’t. With this approach you can confidently answer the question, “How well did it work?”