Sunday, April 27, 2014

Mobile assessment comes of age + research update

The idea of administering employment tests on mobile devices is not new.  But serious research into it is in its infancy.  This is to be expected for at least two reasons: (1) historically it has taken a while with new technologies to have enough data to analyze (although this is changing), and (2) it takes a while for researchers to get through the arcaneness of publishing (this, to my knowledge, isn't changing, but please prove me wrong).

Readers interested in the topic have benefited from articles elsewhere, but we're finally at a point where good research is being published on this topic.  Case in point: the June issue of the International Journal of Selection and Assessment.

The first article on this topic in this issue, by Arthur, Doverspike, Munoz, Taylor, & Carr, studied data from over 3.5 million applicants who completed unproctored internet-based tests (UIT) over a 14-month period.  And while the percentage that completed them on mobile devices was small (2%), it still yielded data on nearly 70,000 applicants.

Results?  Some in line with research you may have seen before, but some may surprise you:

- Mobile devices were (slightly) more likely to be used by women, African-Americans and Hispanics, and younger applicants.  (Think about that for a minute!)

- Scores on a personality inventory were similar across platforms.

- Scores on a cognitive ability test were lower for those using mobile devices.  Without access to the entire article, I can only speculate on proffered reasons, but it's interesting to think about whether this is a reflection of the applicants or the platform.

- Tests of measurement invariance found equivalence across platforms (which basically means the same thing(s) appeared to be measured).

So overall, in terms of using UITs, I think this is promising in terms of including a mobile component.

The next article, by Morelli, Mahan, and Illingworth, also looked at measurement variance of mobile versus non-mobile (i.e., PC-delivered) internet-based tests, with respect to four types of assessment: cognitive ability, biodata, a multimedia work simulation, and a text-based situational judgment test.  Data was gathered from nearly 600,000 test-takers in the hospitality industry who were applying for maintenance and customer-facing jobs in 2011 and 2012 (note the different job types).  Nearly 25,000 of these applicants took the assessment on mobile devices.

Results?  The two types of administrations appeared be equivalent in terms of what they were measuring.  However, interestingly, mobile test-takers did worse on the SJT portion.  The authors reasonably hypothesize this may be due to the nature of the SJT and the amount of attention it may have required compared to the other test types.  (btw this article appears to be based on Morelli's dissertation, which can be found here--it's a treasure trove of information on the topic)

Again, overall these are promising results for establishing the measurement equivalence of mobile assessments.  What does this all mean?  It suggests that unproctored tests delivered using mobile devices are measuring the same things as tests delivered using more traditional internet-based methods.  It also looks like fakability or inflation may be a non-issue (compared to traditional UIT).  This preliminary research means researchers and practitioners should be more confident that mobile assessments can be used meaningfully.

I agree with others that this is only the beginning.  In our mobile and app-reliant world, we're only scratching the surface not only in terms of research but in terms of what can be done to measure competencies in new--and frankly more interesting--ways.  Not to mention all the interesting (and important) associated research questions:

- Do natively developed apps differ in measurement properties--and potential--compared to more traditional assessments simply delivered over mobile?

- How does assessment delivery model interact with job type?  (e.g., may be more appropriate for some, may be better than traditional methods for others)

- What competencies should test developers be looking for when hiring?  (e.g., should they be hiring game developers?)

- What do popular apps, such as Facebook (usage) and Candy Crush (score), measure--if anything?

- Oh, and how about: does mobile assessment impact criterion-related validity?

Lest you think I've forgotten the rest of this excellent issue...

- Maclver, et al. introduce the concept of user validity, which uses test-taker perceptions to focus on ways we can improve assessments, score interpretation, and the provision of test feedback.

- Bing, et al. provide more evidence that contextualizing personality inventory items (i.e., wording the items so they more closely match the purpose/situation) improves the prediction of job performance--beyond noncontexual measures of the same traits.

- On the other hand, Holtrop, et al. take things a step further and look at different methods of contextualization.  Interestingly, this study of 139 pharmacy assistants found a decrease in validity compared to a "generic" personality inventory!

- This study by Ioannis Nikolaou in Greece of social networking websites (SNWs) that found job seekers still using job boards more than SNWs, that SNWs may be particularly effective for passive candidates (!), and that HR professionals found LinkedIn to be more effective than Facebook.

- An important study of applicant withdrawal behavior by Brock Baskin, et al., that found withdrawal tied primarily to obstructions (e.g., distance to test facility) rather than minority differences in perception.

- A study of Black-White differences on a measure of emotional intelligence by Whitman, et al., that found (N=334) Blacks had higher face validity perceptions of the measure, but Whites performed significantly better.

- Last, a study by Vecchione that compared the fakability of implicit personality measures to explicit personality measures.  Implicit measures are somewhat "hidden" in that they measure attitudes or characteristics using perceptual speed or other tools to discover your typical thought patterns; you may be familiar with project implicit, which has gotten some media coverage.  Explicit measures are, as the name implies, more obvious items--in this case, about personality aspects.  In this study of a relatively small number of security guards and semiskilled workers, the researchers found the implicit measure to be superior in terms of fakability resistance.  (I wonder how the test-takers felt?)

That's it for this excellent issue of IJSA, but in the last few months we also got some more great research care of the March issue of the Journal of Applied Psychology:

- An important (but small N) within-subjects study by Judge, et al. of the stability of personality at work.  They found that while traits exhibited stability across time, there were also deviations that were explained by work experiences such as interpersonal conflict, which has interesting implications for work behavior as well as measurement.  In addition, the authors found that individuals high in neuroticism exhibited more variation in traits over time compared to those who were more emotionally stable.  You can find an in press version here; it's worth a read, particularly the section beginning on page 47 on practical implications.

- Smith-Crowe, et al. present a set of guidelines for researchers and practitioners looking to draw conclusions from tests of interrater agreement that may assume conditions that are rarely true.

- Another interesting one: Wille & De Fruyt investigate the reciprocal relationship between personality and work.  The researchers found that while personality shapes occupational experiences, the relationship works in both directions and work can become an important source of identity.

- Here's one for you assessment center fans: this study by Speer, et al. adds to the picture through findings that ratings taken from exercises with dissimilar demands actually had higher criterion-related validity than ratings taken from similar exercises!

- Last but not least, presenting research findings in a way that is understandable to non-researchers poses an ongoing--and important--challenge.  Brooks et al. present results of their study that found non-traditional effect size indicators (e.g., a common language effect size indicator) were perceived as more understandable and useful when communicating results of an intervention.  Those of you that have trained or consulted for any length of time know how important it is to turn correlations into dollars or time (or both)!

That's it for now!