Definitions of Critical Thinking
One of the most debatable features about critical thinking is what constitutes critical thinking—its definition. Table 1 shows definitions of critical thinking drawn from the frameworks reviewed in the Markle et al. (2013) paper. The different sources of the frameworks (e.g., higher education and workforce) focus on different aspects of critical thinking. Some value the reasoning process specific to critical thinking, while others emphasize the outcomes of critical thinking, such as whether it can be used for decision making or problem solving. An interesting phenomenon is that none of the frameworks referenced in the Markle et al. paper offers actual assessments of critical thinking based on the group's definition. For example, in the case of the VALUE (Valid Assessment of Learning in Undergraduate Education) initiative as part of the AAC&U's LEAP campaign, VALUE rubrics were developed with the intent to serve as generic guidelines when faculty members design their own assessments or grading activities. This approach provides great flexibility to faculty and accommodates local needs. However, it also raises concerns of reliability in terms of how faculty members use the rubrics. A recent AAC&U research study found that the percent agreement in scoring was fairly low when multiple raters scored the same student work using the VALUE rubrics (Finley, 2012). For example, the percentage of perfect agreement of using four scoring categories across multiple raters was only 36% when the critical thinking rubric was applied.
In addition to the frameworks discussed by Markle et al. (2013), there are other influential research efforts on critical thinking. Unlike the frameworks discussed by Market et al., these research efforts have led to commercially available critical thinking assessments. For example, in a study sponsored by the American Philosophical Association (APA), Facione (1990b) spearheaded the effort to identify a consensus definition of critical thinking using the Delphi approach, an expert consensus approach. For the APA study, 46 members recognized as having experience or expertise in critical thinking instruction, assessment, or theory, shared reasoned opinions about critical thinking. The experts were asked to provide their own list of the skill and dispositional dimensions of critical thinking. After rounds of discussion, the experts reached an agreement on the core cognitive dimensions (i.e., key skills or dispositions) of critical thinking: (a) interpretation, (b) analysis, (c) evaluation, (d) inference, (e) explanation, and (f) self-regulation—making it clear that a person does not have to be proficient at every skill to be considered a critical thinker. The experts also reached consensus on the affective, dispositional components of critical thinking, such as “inquisitiveness with regard to a wide range of issues,” “concern to become and remain generally well-informed,” and “alertness to opportunities to use CT [critical thinking]” (Facione, 1990b, p. 13). Two decades later, the approach AAC&U took to define critical thinking was heavily influenced by the APA definitions.
Halpern also led a noteworthy research and assessment effort on critical thinking. In her 2003 book, Halpern defined critical thinking as
…the use of those cognitive skills or strategies that increase the probability of a desirable outcome. It is used to describe thinking that is purposeful, reasoned, and goal directed—the kind of thinking involved in solving problems, formulating inferences, calculating likelihoods, and making decisions, when the thinker is using skills that are thoughtful and effective for the particular context and type of thinking task. (Halpern, 2003, p. 6)
Halpern's approach to critical thinking has a strong focus on the outcome or utility aspect of critical thinking, in that critical thinking is conceptualized as a tool to facilitate decision making or problem solving. Halpern recognized several key aspects of critical thinking, including verbal reasoning, argument analysis, assessing likelihood and uncertainty, making sound decisions, and thinking as hypothesis testing (Halpern, 2003).
These two research efforts, led by Facione and Halpern, lent themselves to two commercially available assessments of critical thinking, the California Critical Thinking Skills Test (CCTST) and the Halpern Critical Thinking Assessment (HCTA), respectively, which are described in detail in the following section, where we discuss existing assessments. Interested readers are also pointed to research concerning constructs overlapping with critical thinking, such as argumentation (Godden & Walton, 2007; Walton, 1996; Walton, Reed, & Macagno, 2008) and reasoning (Carroll, 1993; Powers & Dwyer, 2003).
Existing Assessments of Critical Thinking
Multiple Themes of Assessments
As with the multivariate nature of the definitions offered for critical thinking, critical thinking assessments also tend to capture multiple themes. Table 2 presents some of the most popular assessments of critical thinking, including the CCTST (Facione, 1990a), California Critical Thinking Disposition Inventory (CCTDI; Facione & Facione, 1992), Watson–Glaser Critical Thinking Appraisal (WGCTA; Watson & Glaser, 1980), Ennis–Weir Critical Thinking Essay Test (Ennis & Weir, 1985), Cornell Critical Thinking Test (CCTT; Ennis, Millman, & Tomko, 1985), ETS® Proficiency Profile (EPP; ETS, 2010), Collegiate Learning Assessment+ (CLA+; Council for Aid to Education, 2013), Collegiate Assessment of Academic Proficiency (CAAP Program Management, 2012), and the HCTA (Halpern, 2010). The last column in Table 2 shows how critical thinking is operationally defined in these widely used assessments. The assessments overlap in a number of key themes, such as reasoning, analysis, argumentation, and evaluation. They also differ along a few dimensions, such as whether critical thinking should include decision making and problem solving (e.g., CLA+, HCTA, and California Measure of Mental Motivation [CM3]), be integrated with writing (e.g., CLA+), or involve metacognition (e.g., CM3).
|California Critical Thinking Disposition Inventory (CCTDI)||Insight Assessment (California Academic Press)a||Selected-response (Likert scale—extent to which students agree or disagree)||Online or paper/pencil||30 min||75 items (seven scales: 9–12 items per scale)||This test contains seven scales of critical thinking: (a) truth-seeking, (b) open-mindedness, (c) analyticity, (d) systematicity, (e) confidence in reasoning, (f) inquisitiveness, and (g) maturity of judgment (Facione, Facione, & Sanchez, 1994)|
|California Critical Thinking Skills Test (CCTST)||Insight Assessment (California Academic Press)||Multiple-choice (MC)||Online or paper/pencil||45 min||34 items (vignette based)||The CCTST returns scores on the following scales: (a) analysis, (b) evaluation, (c) inference, (d) deduction, (e) induction, and (f) overall reasoning skills (Facione, 1990a)|
|California Measure of Mental Motivation (CM3)||Insight Assessment (California Academic Press)||Selected-response (4-point Likert scale: strongly disagree to strongly agree)||Online or paper/pencil||20 min||72 items||This assessment measures and reports scores on the following areas: (a) learning orientation, (b) creative problem solving, (c) cognitive integrity, (d) scholarly rigor, and (e) technological orientation (Insight Assessment, 2013)|
|Collegiate Assessment of Academic Proficiency (CAAP) Critical Thinking||ACT||MC||Paper/pencil||40 min||32 items (includes four passages representative of issues commonly encountered in a postsecondary curriculum)||The CAAP Critical Thinking measures students' skills in analyzing elements of an argument, evaluating an argument, and extending arguments (CAAP Program Management, 2012)|
|Collegiate Learning Assessment+ (CLA+)||Council for Aid to Education (CAE)||Performance task (PT) and MC||Online||90 min (60 min for PT; 30 min for MC)||26 items (one PT; 25 MC)||The CLA+ PTs measure higher order skills including: (a) analysis and problem solving, (b) writing effectiveness, and (c) writing mechanics. The MC items assess (a) scientific and quantitative reasoning, (b) critical reading and evaluation, and (c) critiquing an argument (Zahner, 2013)|
|Cornell Critical Thinking Test (CCTT)||The Critical Thinking Co.||MC||Computer based (using the software) or paper/pencil||50 min (can also be administered untimed)||Level X: 71 items||Level X is intended for students in Grades 5–12+ and measures the following skills: (a) induction, (b) deduction, (c) credibility, and (d) identification of assumptions (The Critical Thinking Co., 2014)|
|Level Z: 52 items||Level Z is intended for students in Grades 11–12+ and measures the following skills: (a) induction, (b) deduction, (c) credibility, (d) identification of assumptions, (e) semantics, (f) definition, and (g) prediction in planning experiments (The Critical Thinking Co., 2014)|
|Ennis–Weir Critical Thinking Essay Test||Midwest Publications||Essay||Paper/pencil||40 min||Nine-paragraph essay/letter||This assessment measures the following areas of the critical thinking competence: (a) getting the point, (b) seeing reasons and assumptions, (c) stating one's point, (d) offering good reasons, (e) seeing other possibilities, and (f) responding appropriately to and/or avoiding argument weaknesses (Ennis & Weir, 1985)|
|ETS Proficiency Profile (EPP) Critical Thinking||ETS||MC||Online and paper/pencil||About 40 min (full test is 2 h)||27 items (standard form)||The Critical Thinking component of this test measures a students' ability to: (a) distinguish between rhetoric and argumentation in a piece of nonfiction prose, (b) recognize assumptions and the best hypothesis to account for information presented, (c) infer and interpret a relationship between variables, and (d) draw valid conclusions based on information presented (ETS, 2010)|
|Halpern Critical Thinking Assessment (HCTA)||Schuhfried Publishing, Inc.||Forced choice (MC, ranking, or rating of alternatives) and open-ended||Computer based||60–80 min, but test is untimed (Form S1)||25 scenarios of everyday events (five per subcategory)||This test measures five critical thinking subskills: (a) verbal reasoning skills, (b) argument and analysis skills, (c) skills in thinking as hypothesis testing, (d) using likelihood and uncertainty, and (e) decision-making and problem-solving skills (Halpern, 2010)|
|20 min, but test is untimed (Form S2)||S1: Both open-ended and forced choice items|
|S2: All forced choice items|
|Watson–Glaser Critical Thinking Appraisal tool (WGCTA)||Pearson||MC||Online and paper/pencil||Standard: 40–60 min (Forms A and B) if timed||80 items||The WGCTA is composed of five tests: (a) inference, (b) recognition of assumptions, (c) deduction, (d) interpretation, and (e) evaluation of arguments. Each test contains both neutral and controversial reading passages and scenarios encountered at work, in the classroom, and in the media. Although there are five tests, only the total score is reported (Watson & Glaser, 2008a, 2008b)|
|Short form: 30 min if timed||40 items|
|Watson–Glaser II: 40 min if timed||40 items||Measures and provides interpretable subscores for three critical thinking skill domains that are both contemporary and business relevant, including the ability to: (a) recognize assumptions, (b) evaluate arguments, and (c) draw conclusions (Watson & Glaser, 2010).|
The majority of the assessments exclusively use selected-response items such as multiple-choice or Likert-type items (e.g., CAAP, CCTST, and WGCTA). EPP, HCTA, and CLA+ use a combination of multiple-choice and constructed-response items (though the essay is optional in EPP), and the Ennis–Weir test is an essay test. Given the limited testing time, only a small number of constructed-response items can typically be used in a given assessment.
Test and Scale Reliability
Although constructed-response items have great face validity and have the potential to offer authentic contexts in assessments, they tend to have lower levels of reliability than multiple-choice items for the same amount of testing time (Lee, Liu, & Linn, 2011). For example, according to a recent report released by the sponsor of the CLA+, the Council for Aid to Education (Zahner, 2013), the reliability of the 60-min constructed-response section is only .43. The test-level reliability is .87, largely driven by the reliability of CLA+'s 30-min short multiple-choice section.
Because of the multidimensional nature of critical thinking, many existing assessments include multiple subscales and report subscale scores. The main advantage of subscale scores is that they provide detailed information about test takers' critical thinking ability. The downside, however, is that these subscale scores are typically challenged by their unsatisfactory reliability and the lack of distinction between scales. For example, CCTST reports scores on overall reasoning skills and subscale scores on five aspects of critical thinking: (a) analysis, (b) evaluation, (c) inference, (d) deduction, and (e) induction. However, Leppa (1997) reported that the subscales have low internal consistency, from .21 to .51, much lower than the reliabilities (i.e., .68 to .70) reported by the authors of CCTST (Ku, 2009). Another example is that the WGCTA provides subscale scores on inference, recognition of assumption, deduction, interpretation, and evaluation of arguments. Studies found that the internal consistency of some of these subscales was low and had a large range, from .17 to .74 (Loo & Thorpe, 1999). Additionally, there was no clear evidence of distinct subscales, since a single-component scale was discovered from 60 published studies in a meta-analysis (Bernard et al., 2008). Studies also reported unstable factor structure and low reliability for the CCTDI (Kakai, 2003; Walsh & Hardy, 1997; Walsh, Seldomridge, & Badros, 2007).
Comparability of Forms
Following reasons such as test security and construct representation, most assessments employ multiple forms. The comparability among forms is another source of concern. For example, Jacobs (1999) found that the Form B of CCTST was significantly more difficult than Form A. Other studies also found that there is low comparability between the two forms on the CCTST (Bondy, Koenigseder, Ishee, & Williams, 2001).
Table 3 presents some of the more recent validity studies for existing critical thinking assessments. Most studies focus on the correlation of critical thinking scores with scores on other general cognitive measures. For example, critical thinking assessments showed moderate correlations with general cognitive assessments such as SAT® or GRE® tests (e.g., Ennis, 2005; Giancarlo, Blohm, & Urdan, 2004; Liu, 2008; Stanovich & West, 2008; Watson & Glaser, 2010). They also showed moderate correlations with course grades and GPA (Gadzella et al., 2006; Giancarlo et al., 2004; Halpern, 2006; Hawkins, 2012; Liu & Roohr, 2013; Williams et al., 2003). A few studies have looked at the relationship of critical thinking to behaviors, job performance, or life events. Ejiogu, Yang, Trent, and Rose (2006) examined the scores on the WGCTA and found that they positively correlated moderately with job performance (corrected r = .32 to .52). Butler (2012) examined the external validity of the HCTA and concluded that those with higher critical thinking scores had fewer negative life events than those with lower critical thinking skills (r = −.38).
|Butler (2012)||HCTA||Community college students; state university students; and community adults||131||Significant moderate correlation with the real-world outcomes of critical thinking inventory (r(131) = −.38), meaning those with higher critical thinking scores reported fewer negative life events|
|Ejiogu et al. (2006)||WGCTA Short Form||Analysts in a government agency||84||Significant moderate correlations corrected for criterion unreliability ranging from .32 to .52 with supervisory ratings of job performance behaviors; highest correlations were with analysis and problem solving (r(68) = .52), and with judgment and decision making (r(68) = .52)|
|Ennis (2005)||Ennis–Weir Critical Thinking Essay Test||Undergraduates in an educational psychology course (Taube, 1997)||198|
Moderate correlation with WGCTA (r(187) = .37)
Low to moderate correlations with personality assessments ranging from .24 to .35
Low to moderate correlations with SAT verbal (r(155) = .40), SAT quantitative (r(155) = .28), and GPA (r(171) = .28)
|Malay undergraduates with English as a second language (Moore, 1995)||60||Correlations with SAT verbal (pretest: r(60) = .34, posttest: r(60) = .59), TOEFL® (pre: r(60) = .35, post: r(60) = .48), ACT (pre: r(60) = .25, post: r(60) = .66), TWE® (pre: r(60) = −.56, post: r(60) = −.07), SPM (pre: r(60) = .41, post: r(60) = .35)|
|10th-, 11th-, and 12th-grade students (Norris, 1995)||172||Low to moderate correlations with WGCTA (r(172) = .28), CCTT (r(172) = .32), and Test on Appraising Observations (r(172) = .25)|
|Gadzella et al. (2006)||WGCTA Short Form||State university students (psychology, educational psychology, and special education undergraduate majors; graduate students)||586||Low to moderately high significant correlations with course grades ranging from .20 to .62 (r(565) = .30 for total group; r(56) = .62 for psychology majors)|
|Giddens and Gloeckner (2005)||CCTST; CCTDI||Baccalaureate nursing program in the southwestern United States||218||Students who passed the NCLEX had significantly higher total critical thinking scores on the CCTST entry test (t(101) = .2.5*, d = 1.0), CCTST exit test (t(191) = 3.0**, d = .81), and the CCTDI exit test (t(183) = 2.6**, d = .72) than students who failed the NCLEX|
|Halpern (2006)||HCTA||Study 1: Junior and senior students from high school and college in California||80 high school, 80 college||Moderate significant correlations with the Arlin Test of Formal Reasoning (r = .32) for both groups|
|Study 2: Undergraduate and second-year masters students from California State University, San Bernardino||145 undergraduates, 32 masters||Moderate to moderately high correlations with the Need for Cognition scale (r = .32), GPA (r = .30), SAT Verbal (r = .58), SAT Math (r = .50), GRE Analytic (r = .59)|
|Giancarlo et al. (2004)||CM3||9th- and 11th-grade public school students in northern California (validation study 2)||484||Statistically significant correlation ranges between four CM3 subscales (learning, creative problem solving, mental focus, and cognitive integrity) and measures of mastery goals (r(482) = .09 to .67), self-efficacy (r(482) = .22 to .47), SAT9 Math (r(379) = .18 to .33), SAT9 Reading (r(387) = .13 to .43), SAT9 Science (r(380) = .11 to .22), SAT9 Language/Writing (r(382) = .09 to .17), SAT9 Social Science (r(379) = .09 to .18), and GPA (r(468) = .19 to .35)|
|9th- to 12th-grade all-female college preparatory students in Missouri (validation study 3)||587||Statistically significant correlation ranges between four CM3 subscales (learning, creative problem solving, mental focus, and cognitive integrity) and PSAT Math (r(434) = .15 to .37), PSAT Verbal (r(434) = .20 to .31), PSAT Writing (r(291) = .21 to .33), PSAT selection index (r(434) = .23 to .40), and GPA (r(580) = .21 to .46)|
|Hawkins (2012)||CCTST||Students enrolled in undergraduate English courses at a small liberal arts college||117||Moderate significant correlations between total score and GPA (r = .45). Moderate significant subscale correlations with GPA ranged from .27 to .43|
|Liu and Roohr (2013)||EPP||Community college students from 13 institutions||46,402|
Students with higher GPA and students with more credit hours performed higher on the EPP as compared to students with low GPA and fewer credit hours
GPA was the strongest significant predictor of critical thinking (β = .21, η2 = .04)
|Watson and Glaser (2010)||WGCTA||Undergraduate educational psychology students (Taube, 1997)||198||Moderate significant correlations with SAT Verbal (r(155) = .43), SAT Math (r(155) = .39), GPA (r(171) = .30), and Ennis–Weir (r(187) = .37). Low to moderate correlations with personality assessments ranging from .07 to .33|
|Three semesters of freshman nursing students in eastern Pennsylvania (Behrens, 1996)||172||Moderately high significant correlations with fall semester GPA ranging from .51 to .59|
|Education majors in an educational psychology course at a southwestern state university (Gadzella, Baloglu, & Stephens, 2002)||114||Significant correlation between total score and GPA (r = .28) and significant correlations between the five WGCTA subscales and GPA ranging from .02 to .34|
|Williams et al. (2003)||CCTST; CCTDI||First-year dental hygiene students from seven U.S. baccalaureate universities||207|
Significant correlations between the CCTST and CCTDI at baseline (r = .41) and at second semester (r = .26)
Significant correlations between CCTST and knowledge, faculty ratings, and clinical reasoning ranging from .24 to .37 at baseline, and from .23 to .31 at the second semester. For the CCTDI, significant correlations ranged from .15 to .19 at baseline with knowledge, faculty ratings, and clinical reasoning, and with faculty reasoning (r = .21) at second semester
The CCTDI was a more consistent predictor of student performance (4.9–12.3% variance explained) than traditional predictors such as age, GPA, number of college hours (2.1–4.1% variance explained)
|Williams, Schmidt, Tilliss, Wilkins, and Glasnapp (2006)||CCTST; CCTDI||First-year dental hygiene students from three U.S. baccalaureate dental hygiene programs||78|
Significant correlation between CCTST and CCTDI (r = .29) at baseline
Significant correlations between CCTST and NBDHE Multiple-Choice (r = .35) and Case-Based tests (r = .47) at baseline and at program completion (r = .30 and .33, respectively). Significant correlations between CCTDI and NBDHE Case-Based at baseline (r = .25) and at program completion (r = .40)
CCTST was a more consistent predictor of student performance on both NBDHE Multiple-Choice (10.5% variance explained) and NBDHE Case-Based scores (18.4% variance explained) than traditional predictors such as age, GPA, number of college hours
Our review of validity evidence for existing assessments revealed that the quality and quantity of research support varied significantly among existing assessments. Common problems with existing assessments include insufficient evidence of distinct dimensionality, unreliable subscores, noncomparable test forms, and unclear evidence of differential validity across groups of test takers. In a review of the psychometric quality of existing critical thinking assessments, Ku (2009) reported a phenomenon that the studies conducted by researchers not affiliated with the authors of the tests tend to report lower psychometric quality of the tests than the studies conducted by the authors and their affiliates.
For future research, a component of validity that is missing from many of the existing studies is the incremental predictive validity of critical thinking. As Kuncel (2011) pointed out, evidence is needed to clarify critical thinking skills' prediction of desirable outcomes (e.g., job performance) beyond what is predicted by other general cognitive measures. Without controlling for other types of general cognitive ability, it is difficult to evaluate the unique contributions that critical thinking skills make to the various outcomes. For example, the Butler (2012) study did not control for any measures of participants' general cognitive ability. Hence, it leaves room for an alternative explanation that other aspects of people's general cognitive ability, rather than critical thinking, may have contributed to their life success.
Challenges in Designing Critical Thinking Assessment
Authenticity Versus Psychometric Quality
A major challenge in designing an assessment for critical thinking is to strike a balance between the assessment's authenticity and its psychometric quality. Most current assessments rely on multiple-choice items when measuring critical thinking. The advantages of such assessments lie in their objectivity, efficiency, high reliability, and low cost. Typically, within the same amount of testing time, multiple-choice items are able to provide more information about what the test takers know as compared to constructed-response items (Lee et al., 2011). Wainer and Thissen (1993) reported that the scoring of 10 constructed-response items costs about $30, while the cost for scoring multiple-choice items to achieve the same level of reliability was only 1¢. Although multiple-choice items cost less to score, they typically cost more in assessment development than constructed-response items. That being said, the overall cost structure of multiple-choice versus constructed-response items will depend on the number of scores that are derived from a given item over its lifecycle.
Studies also show high correlations of multiple-choice items and constructed-response items of the same constructs (Klein et al., 2009). Rodriguez (2003) investigated the construct equivalence between the two item formats through a meta-analysis of 63 studies and concluded that these two formats are highly correlated when measuring the same content—mean correlation around .95 with item stem equivalence and .92 without stem equivalence. The Klein et al. (2009) study compared the construct validity of three standardized assessments of college learning outcomes (i.e., EPP, CLA, and CAAP) including critical thinking. The school-level correlation between a multiple-choice and a constructed-response critical thinking test was .93.
Given that there may be situations where constructed-response items are more expensive to score and that multiple-choice items can measure the same constructs equally well in some cases, one might argue that it makes more sense to use all multiple-choice items and disregard constructed-response items; however, with constructed-response items, it is possible to create more authentic contexts and assess students' ability to generate rather than select responses. In real-life situations where critical thinking skills need to be exercised, there will not be choices provided. Instead, people will be expected to come up with their own choices and determine which one is more preferable based on the question at hand. Research has long established that the ability to recognize is different from the ability to generate (Frederiksen, 1984; Lane, 2004; Shepard, 2000). In the case of critical thinking, constructed-response items could be a better proxy of real-world scenarios than multiple-choice items.
We agree with researchers who call for multiple item formats in critical thinking assessments (e.g., Butler, 2012; Halpern, 2010; Ku, 2009). Constructed-response items alone will not be able to meet the psychometric standards due to their low internal consistency, one type of reliability. A combination of multiple item formats offers the potential for an authentic and psychometrically sound assessment.
Instructional Value Versus Standardization
Another challenge of designing a standardized critical thinking assessment for higher education is the need to pay attention to the assessment's instructional relevance. Faculty members are sometimes concerned about the limited relevance of general student learning outcomes' assessment results, as these assessments tend to be created in isolation from curriculum and instruction. For example, although most institutions think that critical thinking is a necessary skill for their students (AAC&U, 2011), not many offer courses to foster critical thinking specifically. Therefore, even if the assessment results show that students at a particular institution lack critical thinking skills, no specific department, program, or faculty would claim responsibility for it, which greatly limits the practical use of the assessment results. It is important to identify the common goals of general higher education and translate them into the design of the learning outcomes assessment. The VALUE rubrics created by AAC&U (Rhodes, 2010) are great examples of how a common framework can be created to align expectations about college students' critical thinking skills. While one should pay attention to the assessments' instructional relevance, one should also keep in mind that the tension will always exist between instructional relevance and standardization of the assessment. Standardized assessment can offer comparability and generalizability across institutions and programs within an institution. An assessment designed to reflect closely the objectives and goals of a particular program will have great instructional relevance and will likely offer rich diagnostic information about the students in that program, but it may not serve as a meaningful measure of outcomes for students in other programs. When designing an assessment for critical thinking, it is essential to find that balance point so the assessment results bear meaning for the instructors and provide information to support comparisons across programs and institutions.
Institutional Versus Individual Use
Another concern is whether the assessment should be designed to provide results for institutional use or individual use, a decision that has implications for psychometric considerations such as reliability and validity. For an institutional level assessment, the results only need to be reliable at the group level (e.g., major, department), while for an individual assessment, the results have to be reliable at the individual test-taker level. Typically, more items are required to achieve acceptable individual-level reliability than institution-level reliability. When assessment results are used only at an aggregate level, which is how they are currently used by most institutions, the validity of the test scores is in question as students may not expend their maximum effort when answering the items. Student motivation when taking a low-stakes assessment has long been a source of concern. A recent study by Liu, Bridgeman, and Adler (2012) confirmed that motivation plays a significant role in affecting student performance on low-stakes learning outcomes assessment in higher education. Conclusions about students' learning gains in college could significantly vary depending on whether they are motivated to take the test or not. If possible, the assessment should be designed to provide reliable information about individual test takers, which allows test takers to possibly benefit from the test (e.g., obtaining a certificate of achievement). The increased stakes may help boost students' motivation while taking such assessments.
General Versus Domain-Specific Assessment
Critical thinking has been defined as a generic skill in many of the existing frameworks and assessments (e.g., Bangert-Drowns & Bankert, 1990; Ennis, 2003; Facione, 1990b; Halpern, 1998). On one hand, many educators and philosophers believe that critical thinking is a set of skills and dispositions that can be applied across specific domains (Davies, 2013; Ennis, 1989; Moore, 2011). The generalists depict critical thinking as an enabling skill similar to reading and writing, and argue that it can be taught outside the context of a specific discipline. On the other hand, the specifists' view about critical thinking is that it is a domain-specific skill and that the type of critical thinking skills required for nursing would be very different from those practiced in engineering (Tucker, 1996). To date, much of the debate remains at the theoretical level, with little empirical evidence confirming the generalization or specificity of critical thinking (Nicholas & Labig, 2013). One empirical study has yielded mixed findings. Powers and Enright (1987) surveyed 255 faculty members in six disciplinary domains to gain understanding of the kind of reasoning and analytical abilities required for successful performance at the graduate level. The authors found that some general skills, such as “reasoning or problem solving in situations in which all the needed information is not known,” were valued by faculty in all domains (p. 670). Despite the consensus on some skills, faculty members across subject domains showed marked difference in terms of their perceptions of the importance of other skills. For example, “knowing the rules of formal logic” was rated of high importance for computer science but not for other disciplines (p. 678).
Tuning USA is one of the efforts that considers critical thinking in a domain-specific context. Tuning USA is a faculty-driven process that aims to align goals and define competencies at each degree level (i.e., associate's, bachelor's, and master's) within a discipline (Institute for Evidence-Based Change, 2010). For Tuning USA, there are goals to foster critical thinking within certain disciplinary domains, such as engineering and history. For example, for engineering students who work on design, critical thinking suggests that they develop “an appreciation of the uncertainties involved, and the use of engineering judgment” (p. 97) and that they understand “consideration of risk assessment, societal and environmental impact, standards, codes, regulations, safety, security, sustainability, constructability, and operability” at various stages of the design process (p. 97).
In addition, there is insufficient empirical evidence showing that, as a generic skill, critical thinking is distinguishable from other general cognitive abilities measured by validated assessments such as the SAT and GRE tests (see Kuncel, 2011). Kuncel, therefore, argued that instead of being a generic skill, critical thinking is more appropriately studied as a domain-specific construct. This view may be correct, or at least plausible, but there also needs to be empirical evidence demonstrating that critical thinking is a domain-specific skill. It is true that examples of critical thinking offered by members of the nursing profession may be very different from those cited by engineers, but content knowledge plays a significant role in this distinction. Would it be reasonable to assume that skillful critical thinkers can be successful when they transfer from one profession to another with sufficient content training? Whether and how content knowledge can be disentangled from higher order critical thinking skills, as well as other cognitive and affective faculties, await further investigation.
Despite the debate over the nature of critical thinking, most existing critical thinking assessments treat this skill as generic. Apart from the theoretical reasons, it is much more costly and labor-intensive to design, develop, and score a critical thinking assessment for each major field of study. If assessments are designed only for popular domains with large numbers of students, students in less popular majors are deprived of the opportunity to demonstrate their critical thinking skills. From a score user perspective, because of the interdisciplinary nature of many jobs in the 21st century workforce, many employers value generic skills that can be transferable from one domain to another (AAC&U, 2011; Chronicle of Higher Education, 2012; Hart Research Associates, 2013), which makes an assessment of critical thinking in a particular domain less attractive.
Total Versus Subscale Scores
Another challenge related to critical thinking assessment is whether to offer subscale scores. Given the multidimensional nature of the critical thinking construct, it is a natural tendency for assessment developers to consider subscale scores for critical thinking. Subscale scores have the advantages of offering detailed information about test takers' performance on each of the subscales and also have the potential to provide diagnostic information for teachers or instructors if the scores are going to be used for formative purposes (Sinharay, Puhan, & Haberman, 2011). However, one should not lose sight of the psychometric requirements when offering subscale scores. Evidence is needed to demonstrate that there is a real and reliable distinction among the subscales. Previous research reveals that for some of the existing critical thinking assessments, there is lack of support for the factor structure based on which subscale scores are reported (e.g., CCTDI; Kakai, 2003; Walsh & Hardy, 1997; Walsh et al., 2007
In consideration of the Website and any services provided by Defined Learning in connection therewith (hereafter, the "Service"), Subscribers agree to: keep their user name and password confidential. Each Subscriber is responsible for maintaining the security and confidentiality of all log-In information, including user name and password information.
(C) Use of the Material
(i) copy, modify, transmit, perform,create derivative works of, publish, sub-license, or otherwise sharethe Website, the Material, or any portion thereof;
(ii) disassemble, reverse engineer or decompile the Website or any portion thereof;
(iii) use any "deep-link", "page-scrape", "robot", "spider," "offline readers," or any equivalent manual process, to access or monitor any portion of the Website;
(iv) collect any personally identifiable information, including user names or passwords, from the Website;
(v) take any actions that may circumvent, disable or damage the Website's security;
(vi) use the Material in a manner that disparages the Website, Defined Learning, or the Material; or
(vii) use the Material or Website in any manner that Defined Learning may, in its sole and exclusive discretion, deem inappropriate.
Subscriber acknowledge and understand that all information, data, text, software, music, sound, photographs, graphics, video, messages, content or other submissions ("Submission"), whether publicly posted or privately transmitted, are the sole responsibility of the person from which such Materials originated.
(ii) harm minors in any way;
(iii) impersonate any person or entity, including, but not limited to, any law enforcement officer, or any Defined Learning official, forum leader, guide or host, or falsely state or otherwise misrepresent your affiliation with a person or entity;
(iv) forge headers or otherwise manipulate identifiers in order to disguise the origin of any Submission transmitted through the Website;
(v) upload, post, email or otherwise transmit any Submission that you do not have a right to transmit under any law or under contractual or fiduciary relationship (such as inside information, proprietary and confidential information learned or disclosed as part of an employment relationship or under any nondisclosure agreement);
(vi) upload, post, email or otherwise transmit any Submission that infringes, contributes to the infringement of, or induces others to infringe, any patent, trademark, trade secret, copyright or other proprietary rights of any party;
(vii) upload, post, email or otherwise transmit any unsolicited or unauthorized advertising, promotional materials, "junk mail," "spam," "chain letters," "pyramid schemes," or any other form of solicitation.
(viii) upload, post, email or otherwise transmit anything that contains software viruses or any other computer code, files or programs designed to interrupt, destroy or limit the functionality of any computer software or hardware or telecommunications equipment or in any other way cause damage to a user's data against the will of the user;
(ix) disrupt the normal flow of dialogue, cause a screen to "scroll" faster than other users of the Website are able to type, or otherwise act in a manner that negatively affects other users' ability to engage in real-time exchanges; interfere with or disrupt the Website or servers or networks connected to the Website, or disobey any requirements, procedures, policies or regulations of networks connected to the Website; intentionally or unintentionally violate any applicable local, state, national or international law, any rules of any national or other securities exchange, and any regulations having the force of law;
(x) "stalk" or otherwise harass another; or collect or store personal data about other users.
(1) created by Defined Learning or on the Website, or
(2) submitted to Defined Learning or the Website, including without limitation information in any Defined Learning bulletin boards and in all other parts of the Website.
(i) comply with legal process;
(ii) enforce these terms of service;
(iii) respond to claims that any Content violate the rights of third-parties; or
(iv) protect the rights, property, or personal safety of Defined Learning and the Website, its users and the public.
(i) transmissions over various networks; and
(ii) changes to conform and adapt to technical requirements of various networks or devices to which the Website is connected.
The content of the Website, including but not limited to text, availability and descriptions may contain errors or inaccuracies, and may not be complete or current. Defined Learning reserves the right to correct any errors, inaccuracies or omissions and to change or update information at any time without prior notice. We apologize for any inconvenience this may cause you.
If in the event a user or users infringes on any third party's rights, or engage in conduct that is illegal, tortuous or that interferes with the technological operation of this site, this Agreement is subject to termination by either party thirty (30) days after notification by letter prior to the end of the then-current contract year. If by the end of the 30 day period the conduct, infringement, or activity is rectified agreement will not be terminated. However, if conduct, infringement, or activity persists this agreement will be terminated and Subscriber must destroy all Materials obtained from Defined Learning or the Website and all copies thereof.
Defined Learning MAKES, AND THE WEBSITE INCLUDES, NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO Defined Learning OR THE CONTENTS OF THE WEBSITE, INCLUDING THE MATERIAL, WHICH ARE PROVIDED FOR USE "AS IS" AND "AS AVAILABLE." Defined Learning AND THE WEBSITE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, WITH RESPECT TO Defined Learning AND THE WEBSITE AND ANY WEB SITE WITH WHICH IT IS LINKED OR CONTENT PROVIDED BY SUPPLIERS TO THE WEBSITE. Defined Learning DOES NOT WARRANT THE FUNCTIONS, INFORMATION OR LINKS CONTAINED ON OR THE WEBSITE OR THAT ITS CONTENTS WILL MEET YOUR REQUIREMENTS, THAT Defined Learning OR THE WEBSITE, OR ITS CONTENTS, ARE FIT FOR ANY PARTICULAR PURPOSE OR THAT THE OPERATION OF THE WEBSITE OR ITS CONTENTS, WILL BE UNINTERRUPTED OR ERROR-FREE, THAT DEFECTS WILL BE CORRECTED, OR THAT THIS SITE OR THE SERVER THAT MAKES IT AVAILABLE ARE FREE OF VIRUSES, WORMS, TROJAN HORSES, CANCELBOTS OR OTHER HARMFUL COMPONENTS. Defined Learning DOES NOT WARRANT OR MAKE ANY REPRESENTATIONS REGARDING THE USE OR THE RESULTS OF THE USE OF THE MATERIALS ON THE WEBSITE IN TERMS OF THEIR CORRECTNESS, ACCURACY, RELIABILITY, OR OTHERWISE. YOU (AND NOT Defined Learning OR THE WEBSITE) ASSUME THE ENTIRE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. APPLICABLE LAW MAY NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO THE ABOVE EXCLUSION MAY NOT APPLY TO YOU.
(H) Limitation of Liability
UNDER NO CIRCUMSTANCES, INCLUDING, BUT NOT LIMITED TO, NEGLIGENCE, SHALL Defined Learning OR THE WEBSITE BE LIABLE FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES THAT RESULT FROM THE USE OF, OR THE INABILITY TO USE, THE MATERIALS ON THE WEBSITE, EVEN IF Defined Learning, OR AN AUTHORIZED REPRESENTATIVE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. APPLICABLE LAW MAY NOT ALLOW THE LIMITATION OR EXCLUSION OF LIABILITY FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES, SO THE ABOVE LIMITATION OR EXCLUSION MAY NOT APPLY TO YOU. IN NO EVENT SHALL Defined Learning'S TOTAL LIABILITY TO YOU FOR ALL DAMAGES, LOSSES AND CAUSES OF ACTION (WHETHER IN CONTRACT, OR NOT (INCLUDING, BUT NOT LIMITED TO, NEGLIGENCE) OR OTHERWISE) EXCEED THE AMOUNT PAID BY YOU, IF ANY, FOR ACCESSING THE WEBSITE. UNDER NO CIRCUMSTANCES SHALL Defined Learning BE RESPONSIBLE FOR ANY DAMAGES RELATED TO VIRUS INFECTION AND THE RESULTING CORRUPTION, DESTRUCTION, OR ALTERATION OF A USER OR SUPPLIER'S SYSTEM (INCLUDING FILES, SOFTWARE AND HARDWARE LOCATED ON THEIR SYSTEMS).
(I) Jurisdiction and Export
The Website is controlled and operated by Defined Learning from its office in Deerfield, IL, United States of America. Defined Learning and the Website make no representation that materials on the Website are appropriate or available for use in other locations. Those who choose to access the Website from other locations do so on their own initiative and are responsible for compliance with local laws, if and to the extent local laws are applicable. Data from this site is further subject to United States export controls. No software from Defined Learning or the Website may be downloaded or otherwise exported or re-exported into (or to a national or resident of (i) Cuba, Iran, Iraq, Libya, North Korea, Syria or any other country to which the United States has embargoed goods; or (ii) to anyone on the U.S. Treasury Department's list of Specially Designated Nationals or the U.S. Commerce Department's Table of Deny Orders. By downloading or using the Data, you represent and warrant that you are not located in, under the control of, or a national or resident of any such country or on any such list.