Beyond nutrient-based food indices: a data mining approach to search for a quantitative holistic index reflecting the degree of food processing and including physicochemical properties

Anthony Fardet; Sanaé Lakhssassi; Aurélien Briffaz

doi:10.1039/C7FO01423F

View PDF VersionPrevious ArticleNext Article

DOI: 10.1039/C7FO01423F (Paper) Food Funct., 2018, 9, 561-572

Beyond nutrient-based food indices: a data mining approach to search for a quantitative holistic index reflecting the degree of food processing and including physicochemical properties†

Anthony Fardet *^a, Sanaé Lakhssassi ^a and Aurélien Briffaz ^b
^aUniversité Clermont Auvergne, INRA, UNH, Unité de Nutrition Humaine, CRNH Auvergne, F-63000 Clermont-Ferrand, France. E-mail: anthony.fardet@clermont.inra.fr; Fax: +33(0)4 73 62 47 55; Tel: +33(0)473 62 47 04
^bCIRAD, UMR Qualisud, TA B-95/16, 73 rue J-F. Breton, F- 34398 Montpellier Cedex 5, France

Received 13th September 2017 , Accepted 18th December 2017

First published on 18th December 2017

Abstract

Processing has major impacts on both the structure and composition of food and hence on nutritional value. In particular, high consumption of ultra-processed foods (UPFs) is associated with increased risks of obesity and diabetes. Unfortunately, existing food indices only focus on food nutritional content while failing to consider either food structure or the degree of processing. The objectives of this study were thus to link non-nutrient food characteristics (texture, water activity (a_w), glycemic and satiety potentials (FF), and shelf life) to the degree of processing; search for associations between these characteristics with nutritional composition; search for a holistic quantitative technological index; and determine quantitative rules for a food to be defined as UPF using data mining. Among the 280 most widely consumed foods by the elderly in France, 139 solid/semi-solid foods were selected for textural and a_w measurements, and classified according to three degrees of processing. Our results showed that minimally-processed foods were less hyperglycemic, more satiating, had better nutrient profile, higher a_w, shorter shelf life, lower maximum stress, and higher energy at break than UPFs. Based on 72 food variables, multivariate analyses differentiated foods according to their degree of processing. Then technological indices including food nutritional composition, a_w, FF and textural parameters were tested against technological groups. Finally, a LIM score (nutrients to limit) ≥8 per 100 kcal and a number of ingredients/additives >4 are relevant, but not sufficient, rules to define UPFs. We therefore suggest that food health potential should be first defined by its degree of processing.

Introduction

Up to now, food health potential has mainly been evaluated by scientists based on indices that indicate nutrients to encourage and which to limit, e.g., the Nutrient-Rich Food Index¹ or the NDS (Nutrient Density Score) and LIM (LIMited nutrient score) indices.² However two food matrices of identical nutritional composition but whose structure is not the same may not have the same health potential due to differences in nutrient bioavailability and satiety.³ In other words, despite identical composition, one calorie of a given food is not necessarily the equivalent of one calorie of another food. For example, foods may have a different glycemic index due to different degrees of starch gelatinization, food density, food matrix disintegration or particle size⁴ as has been clearly demonstrated for apple,⁵ bread,⁶ wheat grain,⁷ and plantain.⁸ Therefore although nutritional composition tables may be useful in providing an overview of which macro-nutrients and micronutrients the food supplies to the organism, they fail to mention nutrient bioavailability or satiety. Yet the scientific literature clearly shows that this is a fundamental aspect of food health potential.^4,9

In practice, food health potential needs to be defined by both the structure of the food (qualitative aspect) and nutrient composition (quantitative aspect).⁴ The problem today is that very few data are available on the structure of foods (i.e., density, hardness, thickness porosity, water activity (a_w), water holding capacity, viscosity, etc.); and no table of structural food characteristics exists. What is more, only a few studies have linked these characteristics with health potential in animals¹⁰ and humans.^4,11 Given the increasing consumption of ultra-processed foods, there is therefore an urgent need to collect physical and physical-chemical food characteristics, and to link them with health effects in humans.

Recently, Brazilian epidemiologists published a new classification of foods based on their degree of processing (international NOVA classification),^12,13 and showed that populations consuming the most ultra-processed foods (UPF) present the highest prevalence of metabolic deregulations such as obesity,¹⁴ lipid dysregulation¹⁵ and metabolic syndrome,¹⁶ together with the worst nutrient profiles.^17,18 Such metabolic deregulations may then constitute the first stages of more serious diseases such as type 2 diabetes, cardiovascular diseases and some cancers.¹⁹ Monteiro et al. defined UPF as “formulations mostly of cheap industrial sources of dietary energy and nutrients plus additives, using a series of processes (hence ‘ultra-processed’). Altogether, they are energy-dense, high in unhealthy types of fat, refined starches, free sugars and salt, and poor sources of protein, dietary fiber and micronutrients. UPF are made to be hyper-palatable and attractive, with long shelf-life, and able to be consumed anywhere, any time. Their formulation, presentation and marketing often promote overconsumption”.²⁰ In the end, ultra-processing therefore negatively impacts both food structure and nutrient composition, leading to unstructured, fractionated and recombined energy-dense and micronutrient-poor foods.

In agreement with the Brazilian studies, it has been shown that the more processed the food, the less satiating it is, the higher its glycemic index and the worse its nutrient profile.^21,22 In a recent study, we showed that the elderly French population tends to consume more minimally-processed and processed foods than ultra-processed foods.²¹ The reasons for this are probably to be found in their habits of consuming whole foods and at home meals in their childhood, and they have kept this habit during adulthood and later, and also because retired people have more time to cook fresh foods.

The objectives of this study were therefore to: (i) link textural parameters, water activity, and other non-nutrient food characteristics with degree of processing; (ii) search for associations between these non-nutrient characteristics with nutritional composition; (iii) search for a holistic technological food index including both food structure parameters and nutrient composition in relation with degree of processing; and (iv) go beyond qualitative international NOVA classification to determine quantitative rules for a food to be defined as ultra-processed.

Materials and methods

Data selection

We used already available data on participants in the French Nutrinet-Santé study more than 65 years old.^21,23 The ongoing Internet-based NutriNet-Santé French cohort was launched in 2009. A dedicated secure Web site (http://www.etude-nutrinet-sante.fr) is used for subject enrollment, data collection and follow-up. Adults aged 18 years or older with Internet access are eligible for recruitment, which mostly relies on recurrent multimedia campaigns. Each enrolled subject gives his/her informed consent and provides an electronic signature. The NutriNet-Santé cohort received ethical approval (IRB INSERM no. 0000388FWA00005831 and CNIL 908450 and 909216).

Each subject completes a set of five baseline questionnaires, providing information about sociodemographic and lifestyle (health/risk behaviors), physical activity, anthropometrics, health status, and diet. For the present analysis, we made the following selection: volunteers aged 65 years or older, residing in metropolitan France who had filled in and submitted at least three 24 h dietary records in the first two years after inclusion in the cohort.

Food selection

As reported in our previous study, a total of 280 generic foods the most widely consumed by this population (>5%, i.e., considering at least 5% of the population consuming these foods) were selected.²¹ Among them, some semi-solid and all liquid foods were removed and 139 solid foods were selected for textural and a_w measurements. Both compression and shear measurements could only be conducted on 117 foods, as some foods were too small (e.g., oat flakes, quinoa) or too liquid (e.g., yoghurt) for such measurements. The percentage consumption of these 117 foods (i.e., the proportion of individuals in our sample who consumed the food), the average daily quantity consumed (g day⁻¹), their nutrient composition (n = 55 including energy, carbohydrates, fiber, minerals and trace elements, vitamins, water, etc.: see ESI Table 1†), glycemic (glycemic index, GI and glycemic glucose equivalents, GGE) and satiety (fullness factor, FF) potentials, nutrient density score (NDS) and limited nutrient score (LIM) indices, shelf life, degree of processing (according to the international NOVA classification), together with data obtained in the present study are presented in ESI Table 1.† Except for the specific measurements carried out in this study, the methods used to collect other food data, i.e., NOVA technological groups, NDS, LIM, GI/GGE, and FF indices are described in detail in our previous paper.²¹ Among the 139 selected foods 73, 34 and 32 were classified in NOVA technological groups G1 (minimally-processed), G2 (processed) and G3 (ultra-processed) groups, respectively (which corresponded, due to foods not analyzed for textural and a_w measurements, to 59, 26 and 32 foods in ESI Table 1†) as reported previously.²¹

Textural analyses

Compression and shear measurements were chosen as representative of the action of molars (crushing) and incisors (shearing) during chewing.

Food preparation. To ensure the same treatment of all the samples, for both types of rheological measurements (compression and shear), the (cooked) food was cut into parallelepipeds with standard dimensions (width: 1.0 cm; thickness: 2.0 cm) but the length was adjusted to the shape of the food. Measurements were made on ready-to-eat foods just prior to mouthing. Foods such as cooked meats, fish and vegetables and pasta were prepared on site using the optimum cooking time for each product.

Measurements. Compressive and shear forces were measured using an Instron-type universal testing machine (Elancourt, France). Real time data acquisition was performed using Bluehill 3 software (Elancourt, France).

Compression measurements. This test consists of measuring the compression force (N) required to crush a sample. The cell used for the compression test is made of a fixed lower part with a groove in the top in which the sample is placed, and a movable upper part connected directly to a force sensor. The moving part descends vertically until the sample is crushed (50 mm per minute). The forces are recorded at two compression rates (20% and 80%) as well as the maximum force.

A minimum of 10 compression measurements per sample were made for each food matrix, sometimes more if the food had a heterogeneous structure such as shallots, chorizo, meat and sausages (n = 15), Swedish crispbread (n = 12), oranges and clementines (n = 20). The compression test made it possible to calculate three parameters: (1) maximum stress (MS, N cm⁻²), i.e., the maximum force in relation to the section of the sample; (2) the stress at 20% deformation (N cm⁻²) corresponds to the force at 20% deformation of the sample in relation to the section of the sample; and (3) the stress at 80% deformation (N cm⁻²) corresponds to the 80% deformation force of the sample in relation to the section of the sample.

Shear measurements. Measuring the shear force consists of measuring the force required to shear a sample, i.e., to cut it in half. Measurements of the shear forces were made using a Warner-Bratlzer type cell. The cell was made up of a fixed plate with a slot in it at the bottom and a movable shearing blade at the top. The sample was placed on the plate perpendicular to the slot and the movable blade descended vertically to shear the sample (50 mm per minute). Ten shear measurements were performed per sample except for mussels, chorizo, meat dough and sausages, when 15 measurements were taken. The parameters measured by the shear test were (1) the maximum strength (N) corresponds to the maximum value of the force during the shear test; (2) the shear or stress resistance (N cm⁻²) corresponds to the maximum force in relation to the section of the sample; (3) the energy at break (EB) (J) corresponds to the area under the force–distance curve until the sample breaks; (4) the energy at the maximum strength (J) corresponds to the area under the force–distance curve until displacement to the maximum force; and (5) the movement at maximum strength (mm) corresponds to the displacement of the tool up to the maximum force.

Because some foods were too small and/or semi-solid, e.g., oat flakes, yogurts, sweet maize, chocolate mousse, cooked quinoa and brown rice, shear measurements were carried out on only 117 foods.

Water activity

The a_w measurements were carried out at 20 °C using a LabStart-a_wWater Activity Meter (Aw Sprint TH-500, Novasina, Switzerland) designed to determine the a_w at adjustable constant temperature. The temperature in the measuring chamber can be set between 0 and 50 °C, with an accuracy of ±0.2 °C. The a_w measuring range was between 0.05 and 1.00. The reproducibility of the a_w measurements was ±0.005. Water activity was measured with an accuracy of ±0.01.

For each product, the samples were fragmented and placed in a dry plastic cup filled to two-thirds, and the cup was placed in the a_w measuring chamber of the apparatus at a temperature of 20 °C. Each a_w measurement was replicated three times. Measurement time depended on the characteristics of the sample. The a_w-meter was connected to a (Novasina, NOVOLOG) software that makes it possible to visualize and save the a_w curve in real time. When equilibrium was reached between the product and the relative humidity of the measuring chamber air, the a_w was considered stable and equal to that of the sample placed in the cup.

Other food data

The number of ingredients and/or additives used in the manufacture of foods, as well as their shelf life, may be linked to their degree of processing: therefore, for each food, the list of ingredients and/or the number of additives (N_i/a) were recorded directly from the packages of the purchased products, and the shelf life of the different foods were determined from the StillTasty database (http://www.stilltasty.com/).

A priori and supervised quantitative technological index models

The technological index (TI) model included food nutrient indices, composition and physical-chemical/textural characteristics, according to the following equation:

The number of ingredients, shelf life and the glycemic index (GI) were not included in the equation because the NOVA classification is firstly based on the number of ingredients, and shelf life and GI were not available for all selected foods.

In our model TI increases with degree of processing: therefore, as LIM was generally considered as increasing with degree of processing, and NDS, FF and a_w decreasing with degree of processing, TI becomes:

Having no prior knowledge of changes in textural measurements (n = 8) with the degree of processing, the final textural measurements for TI model were selected as follows: first those that discriminated the most NOVA technological groups; then, among them, if some measurements were significantly correlated only one was finally selected.

Data mining: statistical and machinery learning analyses

After controlling for normality (Lilliefors and Kolmogorov-Smirnorov's tests), median NDS, LIM, GI, GGE, FF, a_w, textural parameters and TI values of each NOVA technological group (G1, G2 and G3) were compared using Kruskal–Wallis's test for non-parametric data (comparison of ranks) followed by a post hoc Dunn's test for multiple means comparison. Median TI values of UPFs and non-UPF (G1 and G2 values were clustered in one group) were then compared using the Wilcoxon–Mann Whitney's test for non-parametric data.

Multivariate analyses (Principal Component Analysis, PCA, and Hierarchical Cluster Analysis, HCA) were performed on the 117 foods having a measured value for the 72 variables (consumption profile, n = 2; nutritional composition, n = 55; technological groups (n = 3; G1–G3); FF; number of ingredients; number of additives; a_w; compression measurements, n = 3; shear measurements, n = 5; NDS and LIM indices), giving a “117 × 72” matrix. Then PCA was applied to foods having GI and GGE values, i.e., a “36 × 74” matrix. Due to the absence of data for several foods, shelf life was excluded from PCA analysis.

Decision trees and Bayesian networks were applied to the “117 × 71” matrix to define rules for foods belonging to UPF (G 3). Groups 1 and G2 were clustered as non-UPF to compare them with UPF. The variables “number of ingredients” and “number of additives” were removed from the analyses since they are already used to allocate foods to the NOVA technological groups. Several machine learning algorithms were tested to find the best model for food classification for both non-UPF and UPF: Chi-square automatic interaction detector (CHAID), classification and regression trees (CART) and C4.5 algorithms for decision tree analyses; and naive, tree and forest algorithms for Bayesian network analyses. For Bayesian network algorithms, because we are dealing with a supervised learning problem, we used a discretization method that accounts for the variable to be predicted. In this context, we used the Minimum Description Length Principle Cut (MDLPC) method²⁴ which is the best known in automatic learning. For each algorithm, 82% of the foods were used for the learning sample (including a pruning sample), and the 18% remaining for the test sample (model validation). Algorithm efficiency was evaluated through “recall”, “precision” and “overfitting” parameters.

All statistical and data mining analyses were performed using Coheris Analytics SPAD9.0 software (Coheris, Suresnes, France). For all tests, a P value <0.05 indicated a significant effect.

Results

The NOVA technological groups, consumption profile, nutritional composition, functional and nutritional properties, and textural characteristics of the 117 foods are listed in ESI Table 1.†

Effect of food processing on food properties

Most of the parameters measured significantly discriminated G1 (minimally-processed foods) from G2 (processed foods) and G3 (ultra-processed foods), but the distinction between G2 and G3 groups was less clear – except for N_i/a and Movement at maximum strength (Table 1). Minimally-processed foods (G1 group) had significantly higher a_w (+5 and +8%, respectively), Nutrient Density Score (NDS) (+53 and +68%, respectively), Fullness factor (F) (+36 and +42%, respectively), Energy at break (+40 and +39%, respectively), and Energy at the maximum strength (+456 and +36%, respectively), and significantly shorter shelf life (−775 and −43%, respectively), N_i/a (−75 and −100%, respectively), Maximum stress (−106 and −146%, respectively), Limited nutrient (LIM) score (−585 and −628%, respectively), and glycemic potential (GI) −46 and −33%, respectively; and Glycemic Glucose Equivalents (GGE) (−271 and −414%, respectively) than processed foods and UPF (G2 and G3 groups, respectively). For other textural parameters, there was no significant difference although G1 foods tended to exhibit lower stress at 80% deformation than G3 foods (food group effect, P = 0.095). Finally only N_i/a and Movement at maximum strength significant distinguished processed foods (G2 group) from ultra-processed foods (G3 group).

Table 1 Influence of processing on median food properties^a

	G1 minimally processed	G2 processed	G3 ultra-processed	P	Number of foods
a _w, water activity; FF, fullness factor; G1–G3, NOVA technological groups (see Methods); NDS, nutrient density score; N_i/a: number of ingredients and/or additives; LIM, mean percentage of the maximal recommended values for 3 disqualifying nutrients; GGE, glycemic glucose equivalent; GI, glycemic index.a Values are medians, calculated from ESI Table 1.b P-Values from Kruskal–Wallis's test for non-parametric data followed by a post hoc Dunn's test for means multiple comparison. Medians with different superscripts in the same row are significantly different (P < 0.05).
a _w	0.991*	0.942**	0.912**	1.4 × 10⁻¹⁵	139
Shelf-life (in days)	4*	35**	7**	0.0003	110
N _i/a	1*	4**	11***	9.9 × 10⁻²⁶	136
Maximum stress (N cm⁻²)	25.7*	52.9^,*	63.2**	0.032	139
Stress at 20% deformation (N cm⁻²)	4.5*	5.1*	2.2*	0.370	139
Stress at 80% deformation (N cm⁻²)	12.2*	21.9*	20.5*	0.095	138
Maximum strength (N)	8.7*	11.7*	7.9*	0.194	120
Energy at break (J)	0.057*	0.034**	0.035**	0.009	120
Shear/stress resistance (N cm⁻²)	8.4*	13.2*	7.7*	0.231	120
Energy at the maximum strength (J)	0.028*	0.015^,*	0.018**	0.024	120
Movement at maximum strength (mm)	7^,*	7**	10*	0.023	120
NDS	9.6*	4.5**	3.1**	7.0 × 10⁻¹¹	139
LIM score	0.4*	23.8**	25.5**	2.1 × 10⁻¹¹	139
FF	3.3*	2.1**	1.9**	1.0 × 10⁻¹⁴	139
GI	48*	70**	64**	0.001	45
GGE (g per 100 g)	7*	26**	36**	0.000	45

Associations between the degree of food processing, nutrient and consumption profiles, matrix properties, and glycemic and satiety impacts

Due to the lack of data on the glycemic impact of some foods, PCAs were performed both without (Fig. 1A and B) and with glycemic indices (ESI Fig. 1A and B†).


	Fig. 1 A–B: Principal component analysis loading (A) and score (B) plots derived from the “117 (food items) × 72 (food variables)” matrix (the PC1 × PC2 plane represents 34% of total variance). The 72 active variables are shown on the loading plot.

PCA without glycemic indices (n = 117 foods). The PC1 × PC2 plane of the PCA represented almost 34% of food variability of the 72 variables considered (Fig. 1B). On the loading plot (Fig. 1A) two orthogonal arrows indicate absence of correlation (correlation coefficients close to 0), two opposite arrows indicate negative correlations (correlation coefficients close to −1), and two similar arrows indicate positive correlations. All correlation coefficients between variables and their significance (P < 0.05) are listed in ESI Table 2.†

The degree of processing (NOVA variable) was clearly significantly and negatively correlated with a_w (R = −0.42), water content (R = −0.70), FF (R = −0.65), energy at break (J) (R = −0.29) and consumption profiles (R = −0.38 for consumption in % population and −0.30 for consumption in g day⁻¹), and significantly and positively correlated with LIM score (R = +0.63), N_i/a (R = +0.81 for N_i and +0.62 for N_a), and maximum stress (N cm⁻²) (R = +0.25). Concerning textural parameters, maximum stress was significantly and positively correlated with stress at 20% and 80% deformation, and maximum strength was significantly and positively correlated with shear/stress resistance, energy at break and energy at maximum strength. Lastly, maximum stress was significantly and positively correlated with maximum strength and shear/stress resistance, and negatively with movement at maximum strength.

Concerning correlations between nutrient composition variables and other parameters, maximum stress was significantly and positively correlated with total carbohydrates, plant proteins, plant lipids and fiber. The same was observed for shear/stress resistance and maximum strength parameters, but to a lesser extent.

PCA was also performed without the NOVA groups to test the importance of this variable on correlations of other food variables (results not shown). Results produced approximatively the same plots, indicating that including the NOVA variable did not bias other correlations.

The loading plot shows the 117 food tested (Fig. 1B). Foods can be separated according to two clear axes representing the degree of processing (from minimal- to UPF) and the origin of the food groups (plant-based versus animal-based foods). Foods in the lower right-hand side of the plot are more processed than those in the upper left-hand side. According to these two axes, animal-based foods are clearly distinguished from minimally-processed plant-based foods, starchy foods, i.e., processed cereal-based products (pastries and bakery products), and confectionary.

PCA with glycemic indices (n = 36 foods). The PC1 × PC2 plane of PCA represented almost 48% of food variability for the 74 variables considered (ESI Fig. 1B†). Clearly the more processed the foods, the higher their GI, GGE, N_i/a, LIM score and added sugars. Otherwise NOVA groups and consumption profiles (% and g day⁻¹) correlate poorly with textural parameters. Textural parameters are all positively correlated with each other, except for movement at maximum strength. Fiber, plant proteins and plant lipids are also positively and significantly correlated with main textural parameters.

Concerning ready-to-eat foods, minimally-processed plant-based foods all clustered on the right-hand side of the plot while more processed foods clustered on the left-hand side.

HCA (n = 117 foods). Hierarchical cluster analysis (HCA) clusters foods with identical profiles regarding the 72 variables measured. The first separator clearly clustered foods according to their degree of processing, and distinguished unprocessed/minimally-processed and processed foods (Class 1: G1 and G2) from processed foods/UPF (Class 2: G2 and G3) (Fig. 2). Then, in Class 1, foods clustered in two other classes: animal-based and bakery foods, and plant-based foods and in Class 2, foods also clustered in two other classes: pastries, confectionary and bakery products, and processed animal-based foods.


	Fig. 2 Plot of the hierarchical classification derived from the “117 (food items) × 73 (food variables)” matrix. The higher the dissimilarity, the more the nutritional profile of the ready-to-eat foods differs.

More specifically, crunchy and crispy bakery foods, i.e., whole-meal toast, puffed cereal patties, wasa-type bread, croutons, Swedish crispbread, whole-meal rusk, crackers, craquotte-type rusks and breadsticks, clustered together. The same is true for confectionary, biscuits, and pastries; for white, black, whole-meal, soft and Viennese breads; and for cheeses and processed meat. The proximity of raw carrots, black radish and asparagus with red and white meat-based products (upper left-hand side of HCA) was unexpected.

Development of a quantitative and holistic technological index: an a priori approach

Maximum stress (MS, N cm⁻²), energy at break (EB, J) and energy at the maximum strength (J) were the most discriminating textural measurements for G1, G2 and G3 groups (Table 1). Since EB and energy at the maximum strength were significantly correlated (ESI Table 2†) only EB was selected because the most in relation with the action of incisors. Because MS tends to increase and EB to decrease with degree of processing (Table 1) the TI model becomes:

TI = (LIM/NDS) × (1/FF) × (1/a_w) × (MS/EB).

Next, several combinations were tested among these variables by removing parameters one by one to search for the TI that best distinguished the three technological groups (G1–G3) and UPF from non-UPF (G1 and G2 clustered in one group). TIs were ranked from the highest to the lowest level of differentiation for both comparisons (Table 2). As expected, N_i/a clearly distinguished non-UPF from UPF, being at the basis of defining G1–G3 groups (results not shown). Consequently N_i/a was not included in the other models. Although the values of all TI models increased with the degree of processing – except LIM/a_w – significant differences were only found between G1 and G2–G3 (Table 2). All the models included LIM score and the most discriminating TI was [LIM/NDS ×( 1/EB)]. Adding MS, FF or a_w to [LIM/NDS × (1/EB)] and removing EB to [LIM/NDS × (1/EB)] did not modify very much the degree of differentiation.

Table 2 Relations between NOVA technological groups and the median values of technological indices^a

Technological indices (TI)	G1 minimally processed	G2 processed	G3 ultra-processed	P	Non-UPF G1 + G2	P
a _w, water activity; EB, energy at break (J); FF, fullness factor; G1–G3, NOVA technological groups (see Methods); NDS, nutrient density score; MS, maximum stress (N m⁻²); N_i/a: number of ingredients and additives; LIM, mean percentage of the maximal recommended values for 3 disqualifying nutrients. UPF, ultra-processed foods.a Values are medians, calculated from ESI Table 1.b P-Values from the Kruskal–Wallis’ test for non-parametric data followed by a post hoc Dunn's test for means multiple comparison. Medians with different superscripts in the same row are significantly different (P < 0.05).c P-Values from Wilcoxon–Mann Whitney's test for non-parametric data. P < 0.05 indicates that median values for the G1 + G2 group were significantly different from median values for the G3 group (UPF).
(LIM/NDS) × (1/EB)	0.45*	126.38**	164.60**	4.4 × 10⁻¹⁸	2.30	1.9 × 10⁻⁹
(LIM/NDS) × (MS/EB)	17.28*	4207.16**	7927.89**	7.7 × 10⁻¹⁸	79.43	4.2 × 10⁻⁹
(LIM/NDS) × (1/FF) × (MS/EB)	3.65*	1949.43**	3854.28**	7.8 × 10⁻¹⁸	25.91	4.3 × 10⁻⁹
LIM/NDS	0.03*	4.19**	7.50**	8.2 × 10⁻¹⁸	0.11	1.5 × 10⁻⁹
(LIM/NDS) × (1/a_w)	0.03*	7.57**	8.40**	8.5 × 10⁻¹⁸	0.11	4.9 × 10⁻⁹
(LIM/NDS) × (1/(FF × a_w))	0.01*	3.57**	4.29**	9.4 × 10⁻¹⁸	0.04	7.0 × 10⁻⁹
(LIM/NDS) × (1/(FF × a_w)) × (MS/EB)	3.78*	2059.97**	5998.75**	1.0 × 10⁻¹⁷	26.01	8.4 × 10⁻⁹
(LIM/NDS) × (1/FF)	0.01*	2.87**	4.10**	1.1 × 10⁻¹⁷	0.04	3.0 × 10⁻⁹
(LIM/NDS) × (1/a_w) × (MS/EB)	17.39*	4776.87**	11624.52**	1.1 × 10⁻¹⁷	79.74	9.8 × 10⁻⁹
LIM/a_w	0.33*	27.41**	27.13**	1.7 × 10⁻¹⁷	2.32	8.1 × 10⁻⁸
LIM × (MS/EB)	223.61*	22142.93**	28142.30**	2.4 × 10⁻¹⁷	1028.12	2.8 × 10⁻⁸
(LIM/NDS) × (1/(FF × a_w) × MS)	0.22*	176.33**	325.57**	6.5 × 10⁻¹⁶	1.45	1.0 × 10⁻⁷
(LIM/NDS) × MS	0.94*	274.69**	546.24**	2.3 × 10⁻¹⁵	3.91	8.4 × 10⁻⁸
(LIM/NDS) × (1/a_w) × MS	0.94*	408.90**	638.23**	2.3 × 10⁻¹⁵	3.96	2.2 × 10⁻⁷
LIM × (1/a_w) × MS	17.53*	2176.02**	1969.38**	9.7 × 10⁻¹⁴	40.81	2.9 × 10⁻⁶

Due to the absence of a significant difference between G2 and G3 for all TI models, we then compared non-UPFs (G1 and G2 clustered in one group) with UPFs (G3) medians. In all TI models UPF median values were significantly different from non-UPFs median values (Table 2). The highest differences were found for [LIM/NDS × (1/EB)] and [LIM/NDS] TI models.

Rules to define an ultra-processed product: an a posteriori data mining approach

Among decision tree and Bayesian network algorithms C4.5 (83% precision and overfitting only +9 foods) and Tree (75% precision and overfitting −8 foods) algorithms, respectively, were the most efficient in classifying UPF (n = 32 foods used for analysis) versus non-UPF (n = 85 foods used for analysis) (Table 3: see Recall, Precision and Overfitting parameters).

Table 3 Recall, precision and overfitting (accuracy) of decision tress and Bayesian network machine learning algorithms

	Decision tree algorithms			Bayesian network algorithms
	CHAID	CART	C4.5	Naive	Tree	Forest
CHAID, Chi-square automatic interaction detector; CART, classification and regression trees.
Recall	67	83	83	100	100	100
Precision	57%	71%	83%	60%	75%	67%
Overfitting	10	11	9	−8	−8	−12

C4.5 decision tree. First, water activity and textural measurements were not included in the rules for defining UPF versus non-UPF. Among the 117 foods, 60 had a LIM score <7.97 (Fig. 3). Of these 117 foods, 98% (n = 59) were non-UPF. Of the 57 remaining foods, 26 were non-UPF and 31 were UPF with a LIM score ≥7.97. Among the 26 non-UPF, 9 had a SFA content ≥16.35 g per 100 g and 17 < 16.35 g per 100 g. All 31 UPF had a SFA content <16.35 g per 100 g. Among the 17 non-UPFs, 5 had a vitamin E content <0.165 mg per 100 g. All 31 UPFs had a vitamin E content ≥0.165 mg per 100 g. To summarize the rules for defining UPF, they were defined first by a LIM score ≥7.97, second by a SFA content <16.35, third by a vitamin E content ≥0.165 mg per 100 g, and finally by an iodine content <14.5 mg per 100 g.


	Fig. 3 Plot of the C4.5 decision tree algorithm for ultra-processed foods (UPF) (brown) versus non-UPF (blue). SFA, saturated fatty acids.

Tree Bayesian network. The redder the arrow, the more important the variable in defining ultra-processed foods, and conversely for yellow arrows. Thus, SFA content was the most important variable in defining UPF, followed by LIM score, NDS, Iodine content, FF, water content, a_w, Lipid content, kcal content, etc.; and the less important variables were first vitamin K, then vitamin E and omega 3 fatty acids, etc. Like for the C4.5 decision tree algorithm, textural variables were poorly represented.

Important variables may be also affected by other variables: for example, SFA content was affected by MUFA content (see Table in the upper right corner of Fig. 4). Thus, when MUFA content was <0.1575 g per 100 g then 94% of non-UPF had a SFA content <0.23 g per 100 g, and 40% of UPF had a SFA content <0.23 g per 100 g. The LIM score was affected by kcal content: when kcal content was <215 per 100 g, 89% of non-UPF had a LIM score <7.35, and only 17% UPF had a LIM score <7.35.


	Fig. 4 Plot of the tree Bayesian network algorithm for UPF. The redder the arrow, the more important the variable in defining ultra-processed foods, and conversely for yellow arrows. a_w, water activity; Chol, cholesterol; Lip, total lipids; MUFA, mono-unsaturated fatty acid; Na, sodium; NDS, nutrient density score; P, phosphorus; Prot, total proteins; PUFA, poly-unsaturated fatty acid; Se, selenium; SFA, saturated fatty acids; UPF, ultra-processed foods; Zn, zinc.

Discussion

First our results showed that minimally-processed foods were less hyperglycemic, more satiating, had a better nutrient profile, had higher a_w, shorter shelf-life, lower maximum stress (N cm⁻²), and higher energy at break (J) than ultra-processed foods (UPF). Second, multivariate analyses satisfactorily distinguished foods according to their degree of processing, which is notable, given the very high number of food variables considered (n = 72). Third, as a proof of concept, we showed that it is possible to develop a quantitative and holistic TI that accounts for both food nutritional composition and textural parameters. Finally, although this needs to be confirmed in more foods, in the context of our sample of selected foods, we showed that N_i/a > 4 and a LIM score ≥8 per 100 kcal are relevant, but not sufficient, rules for a food to be defined as UPF.

Based on the same food database as that we used in our previous studies,^21,22 the results of the present study confirm that UPFs are less satiating, more hyperglycemic and have a less satisfactory nutrient profile than non-UPFs. As expected, UPFs also have a lower a_w and longer shelf-life: indeed UPF are generally designed by the agro-food industry primarily for long storage, and a low a_w is one of parameters that enable this expectation to be fulfilled. Our range of a_w values, e.g., vegetables (range 0.976–1.000), fruit (0.981–0.992), breads (0.891–0.994), oleaginous nuts (0.425), cheeses (0.895–0.997), red and white meats (range: 0.983–0.990), crispy bakery products (0.381–0.435), etc. (ESI Table 1†), are in good agreement with those reported in the exhaustive previously published a_w-table by food groups.²⁵

Concerning textural parameters, the distinction between food groups was less clear than for other food data with no significant difference in stress at 20/80% deformation (N cm⁻²), and shear/stress resistance (N cm⁻²), probably due to the more heterogeneous textures of foods in group G3, and perhaps the still too few foods in G2 (n = 31) and G3 (n = 27) compared to G1 (n = 59). However, when G1 and G2 clustered as non-UPF and compared to UPF, the effects became significant – or at the limit of significance – with medians of 9.7 and 7.7 for shear resistance (N cm⁻²), respectively (P = 0.034, results not shown); with medians of 9.0 and 7.9 for maximum strength (N), respectively (P = 0.050, results not shown); and with medians of 6.4 and 9.6 for movement at maximum strength (mm), respectively (P = 0.075, results not shown); showing that UPF tend to be less resistant to shear and easier to break than non-UPF.

No comprehensive table is available for textural characteristics of foods, only some scattered data that can be extracted from a few studies. In the first study, fracture strain (%) and maximum force (N) were measured on three snack foods using a texture measuring instrument (different from Instron), i.e., fried chickpea batter drops (that can be ranked as non-UPF in G2), extruded-cooked corn balls and puffed rice (that can be ranked as UPF in G3).²⁶ Extruded-cooked corn and puffed rice exhibited higher maximum force (at least +58%) and lower fracture strain (at least −47%) than fried chickpea batter drop, in agreement with our magnitudes comparing UPF versus non-UPF. In another study, hardness (equivalent to maximum stress in our study) and fracturability (equivalent to shear resistance in our study) were measured on 29 foods, mostly UPFs, commercial sweet and savory snacks, and confectionary, except a few non-UPF products such as peanuts, fresh California carrots and canned peaches.²⁷ Like in our study, fresh carrots had the highest fracturability score compared to snacks, but also among the highest hardness values, which differed from our relative compression values, in which fresh foods showed lower maximum stress than UPFs. The higher energy at break of minimally-processed foods may be due to their natural fiber (plant-based foods) and/or protein (both plant- and animal-based foods) networks that offer resistance to sectioning. This could explain why – when using HCA – the profiles of raw carrot, black radish and asparagus were close to those of white and red meat-based foods: indeed, their textural profiles were very similar (ESI Table 1†), probably due to the presence of natural protein and fibrous networks that are resistant to shear and break forces. In UPFs, such natural networks are generally unstructured or have been removed by refining, resulting in foods that are easier to section.

Overall, our results show that non-UPF foods tends to be more compressible (role of molars during chewing) but more difficult to section (role of incisors during chewing), although these results need to be confirmed on more foods. In a previous study, we showed that the elderly French population had a relatively healthy diet and consumed more minimally-processed foods (i.e., non-UPFs) than UPFs, suggesting that foods that are resistant to sectioning do not pose a chewing problem for this specific population.²¹ Beyond a culture-based greater preference for natural foods in this population, another possible explanation is that the elderly take more care of their incisors than of their molars, which are less visible. But the preference for softer foods needs to be confirmed.

Otherwise, the degree of food processing appears to be a very good discriminator of our selected foods using either PCA or HAC; better than classification according to the usual food groups, such as raw plant-based foods, animal-based foods, bakery products, confectionary and pastries. As discussed in a previous paper,²⁸ these results strongly suggest that foods should first be classified according to degree of processing, and second according to usual food groups, and not the reverse. Notably, UPFs were very well distinguished from minimally-processed foods.

In addition to the number of ingredients and additives (N_i/a), which is one of the primary parameters used in the NOVA classification to distinguish minimally-processed and processed foods from UPFs, other important parameters can be used to distinguish foods according to their degree of processing: thus, in our sample of 31 UPFs, the rules for defining them were a LIM score ≥8 per 100 kcal, a SFA content <16.4 g per 100 g and a vitamin E content >0.17 mg per 100 g. The LIM score thus appears to be a better discriminator than textural parameters, such as those tested in this study, or a_w. Similar results were obtained by Darmon et al. for 148 ready-to-eat foods, in which most processed foods (equivalent to the G2 and G3 technological groups in our study) were defined by a LIM score >7.5 per 100 kcal and a NDS score <5 per 100 kcal.²⁹ However, in the study by Darmon et al., the processed foods that matched those in our G2 group did not very well distinguish from UPFs, as defined in our study, probably because NDS and LIM scores do not include the number of ingredients as in the NOVA classification. In addition, ‘light’ products with a low LIM score may be wrongly considered as a non-UPF. Consequently, if a rule based on a LIM score threshold of around 7.5–8 per 100 kcal can link a food and ultra-processing, it is not sufficient to objectively define an UPF.

Beyond qualitative classification and rules, we propose a holistic TI including food structure, functional nutritional properties and composition, i.e., NDS and LIM scores, FF, a_w and textural parameters to quantify the degree of processing. This is the first time such an index has been proposed, as all previous indices were only based on nutrient composition.^1,29,30 In our study, leaving aside N_i/a, overall most TI models performed similarly, and the most discriminating TI model was [(LIM/NDS) × (1/EB)]. Adding MS, FF and a_w values to this model did not increase it strength. Therefore, in the context of our 117 selected foods, N_i/a and the [(LIM/NDS) × (1/EB)] ratio are the best determinants of solid UPF.

There could be at least two explanations for the difficulty of textural parameters to strongly discriminate foods according to technological groups: (1) the discrepancy between the number of foods in each NOVA group, i.e., 59, 27 and 31 in G1, G2 and G3 groups, respectively; (2) compression and shear forces may not importantly reflect the degree of processing of our 117 selected foods. Other physical–chemical parameters, that could be related to unstructuration of the food matrix during processing, might also be measured in the future, e.g., density, hardness and/or water-holding. Probably an important issue would be to find an indicator able to discriminate natural food networks, such as fibrous and protein networks, to those encountered in UPFs. For carbohydrate-rich foods probably the GI could be considered as a good indicator to include in the TI because it reflects glucose bioavailability, and indirectly matrix unstructuration.

To be still more holistic, one can also considering including in the TI formulae other functional nutritional properties such as food PRAL and ORAC/FRAP indices, that reflect the acidifying³¹ and antioxidant^32,33 potentials of foods, respectively. Finally, it would be useful to define a TI threshold above which a food can be considered as UPF.

In conclusion, although more foods still need to be tested, our results suggest that it is possible to define a holistic quantitative TI to characterize the degree of processing. In addition, our results also show that MS and EB are only partially involved to differentiate UPFs from non-UPFs. Although MS and EB did differentiate UPF from minimally-processed foods, machinery learning analyses did not include them in the rules for the classification of a food as ultra-processed. However, due to the infinite recombination of ingredients found in UPFs, these latter generally exhibit a very wide range of textures, from very hard (like in hard candy) to very soft and/or liquid texture (like dairy desserts and sodas), suggesting that texture is not the only – and certainly not the most relevant – characteristic to take into consideration to objectively quantify the degree of processing. Therefore, in the context of our study, i.e., 117 foods tested (very few semi-solid foods, and no liquid or ‘light’ foods) and two types of textural measurement, the combination of N_i/a and [(LIM/NDS) × (1/EB)] ratio appear for now as the most relevant determinants of UPFs. It remains that measuring and quantifying the degradation of the “matrix effect” in foods with increasing process intensities appears as a tough challenge. In the end we suggest that food health potential should be first defined by its degree of processing.

Abbreviations

a _w	Water activity
EB	Energy at break
FF	Fullness factor
FRAP	Ferric reducing antioxidant power
G1–G3	NOVA technological groups 1 (minimally-processed), 2 (processed) and 3 (ultra-processed)
GGE	Glycemic glucose equivalents
GI	Glycemic index
HCA	Hierarchical cluster analysis
LIM	LIMited nutrient score
MS	Maximum strength
NDS	Nutrient density score
N _i/a	Number of ingredients/additives
ORAC	Oxygen radical absorbance capacity
PCA	Principal component analysis
PRAL	Potential renal acid load
SR	Shear resistance
TI	Technological index
UPF	Ultra-processed food

Financial support

This study was financially supported by the French National Research Agency-ANR (AlimaSSenS project no. 14-CE20-0003-01). The NutriNet-Santé is supported by the French Ministry of Health (DGS), the French Institute for Health Surveillance (InVS), the National Institute for Prevention and Health Education (INPES), the Foundation for Medical Research (FRM), the National Institute for Health and Medical Research (INSERM), the National Institute for Agricultural Research (INRA), the National Conservatory of Arts and Crafts (CNAM), and the University of Paris 13.

Authorship

The present study was developed by A. F.: A. F. formulated the research question, designed and carried out the study, collected and analyzed the data, and took the lead in writing the manuscript. S. L. carried out all textural and a_w measurements, and collected other food data, helped analyzing the data and writing the manuscript. A. B. also helped analyzing physical–chemical data, designing technological indices and writing the manuscript. All authors reviewed and approved the final manuscript.

Conflicts of interest

The authors declare no conflict of interest.

Acknowledgements

The authors thank Raphaël Favier, Stephane Portanguen and Pierre-Sylvain Mirade from QuaPA unit (Quality of meat products, INRA Clermont-Ferrand/Theix) for scientific and technical help in structural food measurements, Mélanie Petera (Unit of Human Nutrition, INRA Clermont-Ferrand/Theix) for help with checking statistical analyses, and Daphné Goodfellow (UMR Qualisud, Cirad Montpellier) for carefully checking and improving English of the manuscript.

References

A. Drewnowski and V. Fulgoni, Nutr. Rev., 2008, 66, 23–39 CrossRef PubMed.
M. Maillot, E. L. Ferguson, A. Drewnowski and N. Darmon, J. Nutr., 2008, 138, 1107–1113 CAS.
A. Fardet, J. Nutr. Health Food Eng., 2014, 1, 31 Search PubMed.
A. Fardet, Food Funct., 2015, 6, 363–382 CAS.
G. B. Haber, K. W. Heaton, D. Murphy and L. F. Burroughs, Lancet, 1977, 2, 679–682 CrossRef CAS.
P. Burton and H. J. Lightowler, Br. J. Nutr., 2006, 96, 877–882 CrossRef CAS PubMed.
S. H. Holt and J. B. Miller, Eur. J. Clin. Nutr., 1994, 48, 496–502 CAS.
A. Giraldo Toro, O. Gibert, A. Briffaz, J. Ricci, D. Dufour, T. Tran and P. Bohuon, Carbohydr. Polym., 2016, 147, 426–435 CrossRef CAS PubMed.
L. Chambers, Nutr. Bull., 2016, 41, 277–282 CrossRef.
K. Nojima, H. Ikegami, T. Fujisawa, H. Ueda, N. Babaya, M. Itoi-Babaya, K. Yamaji, M. Shibata and T. Ogihara, Diabetes Res. Clin. Pract., 2006, 74, 1–7 CrossRef PubMed.
A. Fardet, I. Souchon and D. Dupont, Structure des aliments et effets nutritionnels, Quae edn, 2013 Search PubMed.
J.-C. Moubarac, D. C. Parra, G. Cannon and C. A. Monteiro, Curr. Obes. Rep., 2014, 3, 256–272 CrossRef PubMed.
C. A. Monteiro, G. Cannon, J. C. Moubarac, A. P. Martins, C. A. Martins, J. Garzillo, D. S. Canella, L. G. Baraldi, M. Barciotte, M. L. Louzada, R. B. Levy, R. M. Claro and P. C. Jaime, Public Health Nutr., 2015, 18, 2311–2322 CrossRef PubMed.
M. L. Louzada, L. G. Baraldi, E. M. Steele, A. P. Martins, D. S. Canella, J. C. Moubarac, R. B. Levy, G. Cannon, A. Afshin, F. Imamura, D. Mozaffarian and C. A. Monteiro, Prev. Med., 2015, 81, 9–15 CrossRef PubMed.
F. Rauber, P. D. B. Campagnolo, D. J. Hoffman and M. R. Vitolo, Nutr., Metab. Cardiovasc. Dis., 2015, 25, 116–122 CrossRef CAS PubMed.
L. F. Tavares, S. C. Fonseca, M. L. Garcia Rosa and E. M. Yokoo, Public Health Nutr., 2012, 15, 82–87 CrossRef PubMed.
J.-C. Moubarac, M. Batal, M. L. Louzada, E. Martinez Steele and C. A. Monteiro, Appetite, 2016, 108, 512–520 CrossRef PubMed.
M. L. Louzada, A. P. Martins, D. S. Canella, L. G. Baraldi, R. B. Levy, R. M. Claro, J. C. Moubarac, G. Cannon and C. A. Monteiro, Rev. Saude Publica, 2015, 49, 38 Search PubMed.
A. Fardet and Y. Boirie, Nutr. Rev., 2014, 72, 741–762 CrossRef PubMed.
C. A. Monteiro, G. Cannon, J.-C. Moubarac, R. B. Levy, M. L. C. Louzada and P. C. Jaime, Public Health Nutr., 2018, 21, 5–17 CrossRef PubMed.
A. Fardet, C. Méjean, H. Labouré, V. A. Andreeva and G. Féron, Food Funct., 2017, 8, 651–658 CAS.
A. Fardet, Food Funct., 2016, 7, 2338–2346 CAS.
C. Julia, S. Péneau, C. Buscail, R. Gonzalez, M. Touvier, S. Hercberg and E. Kesse-Guyot, BMJ Open, 2018, 21, 27–37 Search PubMed.
U. M. Fayyad and K. B. Irani, Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, Mach. Learn., 1993, 1022–1027 Search PubMed.
S. J. Schmidt and A. J. Fontana, in Water Activity in Foods, Blackwell Publishing Ltd, 2008, pp. 407–420 Search PubMed.
R. Ravi, B. S. Roopa and S. Bhattacharya, J. Text. Stud., 2007, 38, 135–152 CrossRef.
Y. K. Peng, X. Z. Sun, L. Carson and C. Setser, J. Text. Stud., 2002, 33, 135–148 CrossRef.
A. Fardet, E. Rock, J. Bassama, P. Bohuon, P. Prabhasankar, C. Monteiro, J.-C. Moubarac and N. Achir, Adv. Nutr., 2015, 6, 629–638 CrossRef CAS PubMed.
N. Darmon, F. Vieux, M. Maillot, J.-L. Volatier and A. Martin, Am. J. Clin. Nutr., 2009, 89, 1227–1236 CrossRef CAS PubMed.
A. Drewnowski, Am. J. Clin. Nutr., 2010, 91, 1095S–1101S CrossRef CAS PubMed.
T. Remer and F. Manz, J. Am. Diet. Assoc., 1995, 95, 791–797 CrossRef CAS PubMed.
X. Wu, G. R. Beecher, J. M. Holden, D. B. Haytowitz, S. E. Gebhardt and R. L. Prior, J. Agric. Food Chem., 2004, 52, 4026–4037 CrossRef CAS PubMed.
M. H. Carlsen, B. L. Halvorsen, K. Holte, S. K. Bøhn, S. Dragland, L. Sampson, C. Willey, H. Senoo, Y. Umezono, C. Sanada, I. Barikmo, N. Berhe, W. C. Willett, K. M. Phillips, D. R. Jacobs and R. Blomhoff, Nutr. J., 2010, 9, 3 CrossRef PubMed.

Footnote

† Electronic supplementary information (ESI) available. See DOI: 10.1039/c7fo01423f

Click here to see how this site uses Cookies. View our privacy policy here.