Cross-Database Characterization of Flavonoids and Phenolic Acids: Integrating Drug-likeness Metrics, Molecular Interactions, and Dietary Sources
Christmas Maria Vidal de Barros Rêgo, Zafirah Muhammad Rahman, Anna Paula Aguiar, Tatiane Fabiane Ferreira dos Santos, Sergio Senar, Luciana Aparecida Campos, Ovidiu Constantin Baltatu

TL;DR
This study evaluates the drug potential and food sources of flavonoids and phenolic acids using a unified database approach.
Contribution
A novel integrative framework combining drug-likeness metrics, molecular interactions, and dietary sources for flavonoids and phenolic acids.
Findings
Isoflavones showed the best drug-likeness profiles with a mean QED score of 0.62.
Flavonoids had higher binding affinities and targeted more proteins than phenolic acids.
Herbs and spices were identified as the richest sources of these compounds.
Abstract
Background: Flavonoids and phenolic acids are recognized for their diverse therapeutic potential, yet their translation into clinical applications remains limited by varying bioavailability and fragmented characterization across databases. A systematic integrative approach is needed to comprehensively evaluate these compounds’ drug-likeness properties based on computational metrics, molecular interactions, and dietary sources within a unified framework. Methods: We analyzed 954 compounds (715 flavonoids, 239 phenolic acids) by integrating data from PhytoHub, Phenol-Explorer, ChEMBL, and FoodDB databases. Drug-likeness was assessed using established metrics, including QED (Quantitative Estimate of Drug-likeness) and DataWarrior drug-likeness scores. Molecular interaction patterns were characterized through ChEMBL activity data, and food source distributions were systematically mapped…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26
Figure 27- —Anima Institute—AI
- —National Council for Scientific and Technological Development
- —Alfaisal University
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhytochemicals and Antioxidant Activities · Phytoestrogen effects and research · Computational Drug Discovery Methods
1. Introduction
Flavonoids and phenolic acids constitute major classes of plant-derived secondary metabolites extensively investigated for their wide-ranging biological activities and potential health benefits. Found in fruits, vegetables, tea, and wine, these compounds demonstrate diverse biological effects through multiple cellular mechanisms [1,2]. Their therapeutic relevance stems from their ability to modulate key cellular signaling pathways and inhibit specific enzymes such as xanthine oxidase and cyclo-oxygenase [3]. Experimental studies have indicated their potential in various health conditions, including cardiovascular diseases, neurodegenerative disorders, and metabolic conditions [4,5].
Phenolic acids and flavonoids exhibit distinct molecular interaction patterns that influence their biological activities. Previous studies have shown that flavonoids generally demonstrate interactions with a broader range of molecular targets and exhibit higher binding affinities compared to phenolic acids [6]. While flavonoids show strong binding affinity with DNA repair proteins and hormone receptors, phenolic acids exhibit focused interactions with membrane transporters, suggesting complementary roles in cellular processes [7,8,9,10].
Flavonoid supplementation has been shown to improve cardiovascular health by reducing inflammation, enhancing endothelial function, and reducing oxidative stress, thus reducing the risk of atherosclerosis [2,11]. However, the therapeutic efficacy observed in controlled settings often fails to translate effectively in clinical practice, largely due to the pharmacokinetic limitations [12]. The clinical application of flavonoids and phenolic acids faces significant challenges due to their limited bioavailability. Key limiting factors include poor aqueous solubility, extensive first-pass metabolism, and limited gastrointestinal absorption [13,14,15].
Recent advances in delivery systems, particularly nanotechnology-based approaches, show promise in addressing these limitations, but optimizing these compounds for therapeutic applications requires a systematic understanding of their molecular properties, interaction patterns with biological targets, and distribution across food sources [16,17].
Despite considerable evidence supporting the therapeutic potential of flavonoids and phenolic acids, information on these compounds is dispersed across multiple specialized databases, each providing distinct but complementary data: PhytoHub and Phenol-Explorer for compound identification and classification, ChEMBL for molecular interactions and bioactivity data, and FoodDB for dietary source distribution. Current approaches often evaluate these compounds using data from individual databases in isolation, without systematically integrating drug-likeness assessment, molecular interaction patterns, and natural source distribution. Additionally, while established drug-likeness metrics such as the Quantitative Estimate of Drug-likeness (QED) and DataWarrior drug-likeness scores provide valuable frameworks for pharmaceutical evaluation, their application to flavonoids and phenolic acids has not been systematically compared across compound subclasses.
This study aimed to address these gaps by developing an integrative cross-database characterization framework for flavonoids and phenolic acids. Specifically, we sought to: (1) systematically characterize 954 compounds across flavonoid and phenolic acid subclasses using multiple database sources; (2) evaluate drug-likeness using established metrics and assess their complementary nature; (3) analyze molecular interaction patterns and target selectivity differences between compound classes using ChEMBL bioactivity data; and (4) map compound distribution across dietary sources to identify food matrices with high flavonoid and phenolic acid content.
2. Methods
2.1. Data Collection and Molecular Dataset Preparation
PhytoHub (version 1.4) and PhenolExplorer (version 3.6) databases were selected as primary sources for compound identification and chemical characterization, yielding a dataset of 715 flavonoids and 239 phenolic acids (n = 954). All compounds were systematically characterized using standard International Chemical Identifier (InChI) keys, molecular registry numbers, and systematic nomenclature, and classified into structural classes and subclasses based on their chemical features.
The characterized compounds were cross-referenced with the ChEMBL database (version 36) to identify compounds with existing molecular registry numbers (molregno). For compounds that had InChI keys but lacked ChEMBL molecular registry numbers, structural similarity searches were conducted within ChEMBL to identify and analyze the physicochemical properties of analogous compounds.
Lastly, the FoodDB database (version 1.0) was employed to map these compounds to their dietary sources. This step facilitated the identification of food matrices and concentration ranges for the characterized flavonoids and phenolic acids.
2.2. Computational Analysis and Molecular Property Metrics Calculations
Molecular property analysis was performed using RDKit (version 2023.03.1) in Python (version 3.8). Using canonical SMILES representations of the chemical structures, eight molecular descriptors were calculated to assess compliance with Lipinski’s Rule of 5 (Ro5) and Veber Rules:
- Molecular weight (MW): Calculated based on atomic composition, representing the sum of atomic weights in the molecule. MW serves as a fundamental indicator of size-dependent membrane permeation and diffusion characteristics.
- LogP (Octanol-Water Partition Coefficient): Computed using RDKit’s implementation of Crippen’s method, representing the logarithm of the partition coefficient between n-octanol and water. This parameter quantifies lipophilicity, a critical determinant of membrane permeability.
- Hydrogen bond donors (HBD): Calculated as the sum of -OH and -NH groups in the molecule. HBD influences protein binding interactions and membrane permeation potential through hydrogen bonding capacity.
- Hydrogen bond acceptors (HBA): Determined by the total count of oxygen and nitrogen atoms. HBA complements HBD in characterizing a molecule’s capacity for hydrogen bonding, affecting both protein interactions and membrane passage.
- Rotatable bonds: calculated as the number of single bonds capable of free rotation (excluding terminal bonds). This parameter quantifies molecular flexibility, with fewer rotatable bonds (≤10) correlating with better oral bioavailability due to reduced entropy loss upon binding.
- Topological polar surface area (TPSA): Calculated using the Ertl et al. method as the sum of surface contributions from polar atoms (primarily oxygen and nitrogen) in Å^2^ [18]. This parameter serves as a key predictor of membrane permeability, with lower values (≤140 Å^2^) typically indicating better membrane passage.
- Aromatic rings: Determined using RDKit’s aromaticity perception algorithms. This parameter quantifies the number of aromatic systems, reflecting potential for π-π stacking interactions and molecular stability.
- Chiral centers: Identified using RDKit’s stereochemistry detection algorithms. The number of chiral centers indicates three-dimensional structural complexity and potential stereochemical influences on biological interactions.
Two sets of criteria were used to assess molecular property metrics related to drug-likeness: Lipinski’s Rule of 5 (Ro5) and Veber Rules. Ro5 dictates that a compound should ideally have a molecular weight no greater than 500 Daltons, a LogP value no greater than 5, no more than 5 hydrogen bond donors, and no more than 10 hydrogen bond acceptors. Compounds were considered Ro5 compliant if they violated no more than one of these rules. The Veber Rules stipulate that a compound should have 10 or fewer rotatable bonds and a topological polar surface area (TPSA) of 140 Å^2^ or less. Full compliance with Veber Rules required meeting both criteria.
2.3. Drug-likeness Assessment—QED and DataWarrior Scores
Drug-likeness was evaluated using two established computational metrics: the Quantitative Estimate of Drug-likeness (QED) score and the DataWarrior Drug-likeness score. Physicochemical parameters (molecular weight, LogP, hydrogen bond donors/acceptors, rotatable bonds, topological polar surface area) were calculated using RDKit (version 2023.03.1) in Python to support drug-likeness interpretation.
QED scores were calculated using the RDKit implementation, which combines eight molecular descriptors weighted according to the methodology described by Bickerton et al. [19]. QED calculation required complete canonical SMILES representations parseable by RDKit; compounds with missing or incompatible structural data were excluded (n = 349 of 954). DataWarrior accepts broader input formats, including InChI and molfiles, enabling drug-likeness calculation for a larger subset (n = 847).
Drug-likeness was assessed using DataWarrior software (version 5.5.0), accessing the Open Molecules database. The DataWarrior algorithm evaluates molecular substructure fragments, summing fragment contributions and normalizing by the square root of the total number of substructures [20]. Positive values indicate drug-like properties, while negative values suggest building block-like characteristics [21].
2.4. Molecular Interaction Analysis
Molecular interaction data were extracted from the ChEMBL database (version 33) for compounds with established molecular registry numbers (molregno). Bioactivity data were retrieved using the ChEMBL web services API, filtering for compound–protein binding activities with reported activity values.
ChEMBL activity scores, expressed as pChEMBL values (−log_10_ of the molar IC_50_, EC_50_, Ki, or Kd), were used as standardized measures of binding affinity. Only activity records meeting the following criteria were included: (1) defined target relationship (direct interaction), (2) standard activity types (IC_50_, EC_50_, Ki, Kd), and (3) activity values with specified units (nM or μM). Activity values were converted to pChEMBL scores, where higher values indicate stronger binding affinity.
Protein targets were identified using ChEMBL target identifiers and classified according to their biological function (e.g., kinases, transporters, nuclear receptors, DNA repair proteins). For each compound-target pair, the highest reported activity score was retained to avoid redundancy from multiple assays. Target diversity was quantified as the number of unique protein targets with reported binding data for each compound class.
Differences in molecular interaction strengths between flavonoids and phenolic acids were assessed using two-tailed independent samples t-tests. Distribution characteristics were visualized using violin plots with embedded box plots showing median, interquartile range, and 1.5× interquartile range. Statistical significance was set at p < 0.05.
2.5. Food Source Mapping and Concentration Analysis
Food source data were systematically extracted from the FoodDB database (version 1.0) using compound identifiers (InChI keys and common names) established through PhytoHub and Phenol-Explorer cross-referencing.
Compound concentrations in food matrices were retrieved and standardized to mg/kg (equivalent to mg/100 g × 10) to enable cross-food comparison. For compounds with multiple reported concentration values from different studies or analytical methods, minimum, maximum, and median concentrations were calculated. Only concentration values with defined units and quantifiable amounts (above detection limits) were included in the analysis.
Food sources were categorized into eight major food groups based on FoodDB classification and common nutritional categorization: herbs and spices, fruits, vegetables, beverages (non-tea), teas, cereals and cereal products, nuts and seeds, soy and soy products. Subclasses classifications followed botanical taxonomy (e.g., Rosaceae fruits, Brassicaceae vegetables) where applicable.
Total flavonoid content per food source was calculated as the sum of all individual compound concentrations. For each food group, the following parameters were determined: (1) total concentration (sum of all compound concentrations), (2) compound diversity (number of unique compounds detected), (3) primary compound identification (compound contributing the highest percentage to total concentration), and (4) concentration range (minimum to maximum individual food source concentrations). Hierarchical analysis stratified food sources into concentration tiers: high-content (>7000 mg/kg), medium-content (2000–7000 mg/kg), and lower-content (500–2000 mg/kg).
2.6. Statistical Analysis and Visualization
Statistical analyses were performed using Python’s NumPy (v1.21) and Pandas (v1.3) libraries. Distribution analyses included calculation of means, standard deviations, and compliance percentages. Kernel Density Estimation (KDE) plots were generated using Seaborn (v0.11.2) with bandwidth selection using Scott’s rule. Visualizations were created using Matplotlib (v3.4.2). Statistical computations used 64-bit floating-point precision, employing complete case analysis for handling missing values.
3. Results
3.1. Datasets Composition: Distribution and Overlap of Chemical Identifiers Across Flavonoid and Phenolic Acid Classes
Chemical identifiers for flavonoids and phenolic acids were sourced from three major databases: PhytoHub, PhenolExplorer, and ChEMBL. Chemical structure identification was achieved through two standardized identifiers: International Chemical Identifiers (InChI Keys) and Molecular Registry Numbers. The distribution of chemical identifiers across phenolic compound classes revealed varying levels of structural annotation and database coverage (Figure 1).
Among flavonoids, flavonols represented the largest subclass with 324 unique compounds, of which 187 (58%) had Registry Numbers in ChEMBL. Isoflavonoids and flavones followed with 198 and 156 unique compounds, respectively, showing similar proportions of ChEMBL-indexed structures (52% and 54%). Notably, anthocyanins, despite being a smaller subclass with 89 compounds, had the highest proportion of Registry Numbers (71%). In contrast, chalcones showed the lowest database coverage, with only 38% of their 45 unique structures having Registry Numbers.
For phenolic acids, hydroxycinnamic acids emerged as the predominant subclass with 167 unique compounds, of which 103 (62%) were indexed in ChEMBL. Hydroxybenzoic acids showed a similar pattern with 89 unique compounds and 58% ChEMBL coverage. Overall, this analysis demonstrates substantial variation in the database coverage of different phenolic subclasses, with anthocyanins showing the highest representation in ChEMBL and chalcones the lowest.
During cross-database verification, three compounds—Lambertianin C (2805.91 g/mol), Sanguiin H-6 (1871.28 g/mol), and Punicalagin (1084.72 g/mol)—were identified as misclassified hydroxybenzoic acids in Phenol-Explorer and showed molecular weights that deviated substantially from the typical range of phenolic acids (mean: 357.41 ± 176.96 g/mol). Structural analysis revealed that these compounds are actually ellagitannins, a subclass of hydrolyzable tannins, rather than simple phenolic acids.
3.2. Drug-likeness Assessment Using Established Metrics
Drug-likeness was evaluated using two established computational metrics: the Quantitative Estimate of Drug-likeness (QED) score and the DataWarrior Drug-likeness score.
QED Score Distribution
QED scores were calculated for 349 compounds with complete structural data. The analysis revealed a distinctive bimodal distribution with a mean of 0.48 ± 0.24 (Figure 2A). QED scores showed a slight negative skew (skewness = −0.155), with an interquartile range from 0.25 to 0.69, suggesting the presence of two major compound populations with distinct drug-like characteristics. This bimodality likely reflects the structural diversity within the dataset, particularly in terms of molecular complexity and pharmaceutical relevance.
The top-performing compounds by QED score included Sativanone (0.940), Dihydroformononetin (0.910), and Violanone (0.900), approaching the theoretical maximum QED value of 1.0. The median QED score of 0.530 suggests that a substantial portion of the analyzed compounds possess favorable drug-like properties according to this metric.
DataWarrior Drug-Likeness Score Distribution
DataWarrior drug-likeness scores were determined for 847 compounds, displaying a notably left-skewed distribution (skewness = −5.364) with a mean of −2.46 ± 4.38 (Figure 2B). The broad range of scores, from a minimum of −70.14 to a maximum of 2.65, with a median of −0.98, highlights the heterogeneous nature of the compound set. Positive values indicate drug-like properties, while negative values suggest building block-like characteristics.
Notably, 25% of the compounds showed favorable drug-likeness scores above 0.05, identifying a substantial subset of compounds with favorable computed profiles suggestive of drug-like characteristics. Modified flavonoids demonstrated superior drug-likeness properties, with 3′-O-Methyl-(-)-epicatechin 4′-O-sulfate achieving the highest score (2.648), followed by Genistein-7-O-glucuronide-4′-sulfate (2.562). This suggests that specific structural modifications, particularly sulfation and glucuronidation, may enhance the drug-like properties of flavonoid compounds.
3.3. Correlation Analysis of Drug-likeness Metrics
The correlation matrix (Figure 3) reveals distinct relationships between drug-likeness scores and molecular descriptors across the compound library. QED and DataWarrior drug-likeness scores showed moderate positive correlation (r = 0.41). This correlation was calculated using the 349 compounds with both QED and DataWarrior scores available. The subset with complete data for both metrics may be biased toward structurally simpler compounds, and correlation results should be interpreted within this context. QED exhibited negative correlations with molecular weight (r = −0.39), hydrogen bond donors (r = −0.45), hydrogen bond acceptors (r = −0.37), and topological polar surface area (r = −0.43). Strong positive correlations were observed between molecular weight and both hydrogen bond acceptors (r = 0.67) and TPSA (r = 0.64). Hydrogen bond donors and acceptors were also strongly correlated (r = 0.62).
3.4. Compound Classification by Drug-likeness Profiles
Analysis of drug-likeness metrics across compound subclasses revealed distinct patterns (Table 1). Compounds were stratified based on their QED and DataWarrior scores to identify subclasses with favorable pharmaceutical profiles.
Isoflavones demonstrated the most favorable drug-likeness profiles, with the highest mean QED score (0.62 ± 0.18) and positive DataWarrior scores (0.42 ± 2.1). This subclass also showed the best compliance with Lipinski parameters, with moderate molecular weight (364.4 Da), acceptable LogP (1.60), and hydrogen bonding capacity within optimal ranges. In contrast, flavan-3-ols and flavonols showed lower drug-likeness scores, primarily due to higher molecular weights, excessive hydrogen bond donors, and elevated polar surface areas.
Table 2 presents the top 10 compounds identified by each established drug-likeness metric, including their 2D chemical structures to facilitate rapid structural assessment. The QED analysis identified isoflavones and flavanones as top performers, with sativanone achieving the highest score (0.940). These compounds share structural features conducive to oral bioavailability: moderate molecular weight, balanced lipophilicity, and limited hydrogen bonding capacity. The DataWarrior analysis highlighted the importance of specific structural modifications. Sulfated and glucuronidated flavonoid conjugates dominated the top rankings, suggesting that these phase II metabolite forms may possess enhanced drug-like properties compared to their parent aglycones.
The QED analysis identified isoflavones and flavanones as top performers, with sativanone achieving the highest score (0.940). These compounds share structural features conducive to oral bioavailability: moderate molecular weight, balanced lipophilicity, and limited hydrogen bonding capacity. The DataWarrior analysis highlighted the importance of specific structural modifications. Sulfated and glucuronidated flavonoid conjugates dominated the top rankings, suggesting that these phase II metabolite forms may possess enhanced drug-like properties compared to their parent aglycones.
3.5. Differential Molecular Interaction Patterns Between Flavonoids and Phenolic Acids
Analysis of molecular interaction strengths revealed distinct binding patterns between flavonoids and phenolic acids across their protein targets (Supplementary sheet ‘molTargetDiseaseEdges’ in Excel file ‘flavonoidTables Manuscript.xlsx’). Flavonoids exhibited significantly higher overall interaction strengths (mean ChEMBL activity score: 7.26 ± 1.09) compared to phenolic acids (6.98 ± 0.94; p = 0.014, two-tailed t-test). The distribution of activity scores showed greater variability for flavonoids (range: 6.01–9.70) than phenolic acids (range: 6.01–9.00), suggesting more diverse interaction patterns. Flavonoids demonstrated exceptional binding affinity with DNA repair proteins, particularly the Bloom syndrome protein (activity score: 9.70), and the thyroid hormone receptor beta-1 (activity score: 9.22). In contrast, phenolic acids showed more concentrated interaction patterns, with their highest affinities consistently observed for membrane transporters, specifically the SLCO1B1 and SLCO1B3 proteins (activity scores: 9.00). This specialized targeting of transport proteins by phenolic acids suggests a potential role in cellular uptake and distribution mechanisms. The broader distribution of flavonoid interaction strengths, evidenced by the wider violin plot (Figure 4), indicates greater molecular promiscuity compared to phenolic acids. This characteristic may explain their diverse biological effects reported in previous studies. The higher median activity score for flavonoids (6.89) compared to phenolic acids (6.69) further supports their broader interaction potential with biological targets.
Flavonoids demonstrated interactions with a broader range of targets (67 unique targets) compared to phenolic acids (33 targets), with higher overall binding affinities across their target spectrum (Table 3).
3.6. Distribution of Flavonoids Across Food Sources
The analysis of flavonoid content across diverse food sources revealed distinctive patterns in total concentrations, with values ranging from 500 to 14,500 mg/kg (Figure 5; Supplementary sheet ‘parentFlavonoidsInFoods’ in Excel file ‘flavonoidTables Manuscript.xlsx’). These concentration values represent chemical abundance in food matrices, not bioavailable or nutritionally efficacious amounts; actual nutritional impact depends on bioaccessibility, absorption efficiency, and metabolic fate, which vary substantially among compounds and individuals. Evening primrose emerged as the predominant source, containing approximately 14,500 mg/kg of total flavonoids, substantially higher than other analyzed sources. Concentration values represent single measurements reported in FoodDB; the relationship between total concentrations and flavonoid subclass composition across food groups is presented in Figure 6. This was followed by mango (9000 mg/kg) and red wine (7500 mg/kg), establishing the top tier of flavonoid-rich sources. Tea varieties demonstrated consistently high concentrations at approximately 7200 mg/kg, indicating their significance as a reliable flavonoid source.
The distribution of flavonoid content demonstrated clear categorical patterns. Herbs and spices emerged as particularly rich sources, with evening primrose, hyssop (5500 mg/kg), and parsley (2500 mg/kg) representing the highest concentrations within this category. Fruit sources showed notable variation, with mango, common grape (4500 mg/kg), and sweet orange (4000 mg/kg) containing the highest concentrations among fruits. Beverages, particularly red wine and tea, constituted another significant category, consistently showing high flavonoid content above 7000 mg/kg.
A hierarchical analysis revealed three distinct tiers of flavonoid content. The high-content tier (>7000 mg/kg) included evening primrose, mango, red wine, and tea varieties. The medium-content tier (2000–7000 mg/kg) encompassed several herbs and fruits, including hyssop, common grape, sweet orange, and parsley. The lower-content tier (500–2000 mg/kg) included a diverse range of sources such as bilberry, Mexican oregano, common walnut, and black raspberry, demonstrating that even foods with relatively lower concentrations can contribute meaningfully to dietary flavonoid intake.
Notably, certain food categories showed consistent patterns within their groups. Herbs and spices generally maintained high concentrations, while fruits demonstrated more variable content. Cereals and nuts, represented by common buckwheat (4000 mg/kg) and common walnut (1500 mg/kg, respectively, showed moderate flavonoid content. These findings suggest that optimal dietary flavonoid intake might be achieved through a diverse diet incorporating multiple food categories, with particular emphasis on herbs, spices, and specific fruits.
3.7. Distribution and Composition of Bioactive Compounds Across Food Groups
Analysis of flavonoid distribution revealed distinct patterns of compound accumulation and diversity across major food groups (Figure 6).
Herbs and spices emerged as the richest source of flavonoids, with a total concentration of 45,516 mg/kg, characterized by a predominance of quercetin (32.74%) and substantial compound diversity (49 unique compounds). This was followed by fruits, which exhibited the second-highest total concentration (40,490 mg/kg) and demonstrated the greatest compound diversity with 62 unique flavonoids, primarily composed of gallic acid (22.23%).
Teas represented the third most concentrated source (37,101 mg/kg), showing a more specialized profile dominated by ent-gallocatechin 3-gallate (11.02%). Notably, cereals and cereal products contained significant flavonoid levels (14,350 mg/kg), with quercetin 3-rutinoside as the primary compound (27.88%). Vegetables, while showing moderate total concentrations (8935 mg/kg), maintained considerable compound diversity with 45 unique flavonoids. Beverages displayed the highest compound specificity among all food groups, with 2-phenylethanol constituting 95.04% of total flavonoids (7963 mg/kg). Nuts and soy products, while containing lower total concentrations (4859 and 1762 mg/kg, respectively), showed distinct compound profiles, with nuts rich in 2-phenylethanol (32.93%) and soy characterized by isoflavones (36.91%).
These findings highlight the differential accumulation patterns of flavonoids across food groups, revealing both specialized (e.g., beverages) and diverse (e.g., fruits, herbs and spices) compound profiles. The data suggest that dietary sources of flavonoids vary not only in total concentration but also in compound composition and diversity, which may have implications for their biological activities and potential health benefits.
4. Discussion
The analysis of 954 phenolic compounds (715 flavonoids, 239 phenolic acids) integrated data from multiple databases to examine their drug-likeness characteristics, molecular interactions, and food source distribution. This integrative approach addresses the fragmentation of phytochemical data across specialized databases and provides a comprehensive characterization framework for these bioactive compounds.
Chemical identifier analysis revealed differential representation across subclasses, with anthocyanins showing 71% representation in ChEMBL compared to chalcones (38%), reflecting database cataloging patterns and research prioritization rather than inherent biological significance. For phenolic acids, hydroxycinnamic acids showed 62% ChEMBL coverage, consistent with their extensive documentation in nutritional research literature [22,23]. These coverage variations highlight the importance of integrating multiple databases to achieve comprehensive compound characterization, as reliance on a single source may introduce systematic biases in compound evaluation.
Structural analysis identified three high-molecular-weight compounds (Lambertianin C, Sanguiin H-6, and Punicalagin) incorrectly categorized as hydroxybenzoic acids in Phenol-Explorer, which molecular examination confirmed as ellagitannins. This classification discrepancy is consistent with previously documented challenges in polyphenol categorization [24,25], as noted by Neveu et al. [26] and Rothwell et al. [27]. Such misclassifications underscore the need for systematic structural verification when integrating data across databases and highlight the value of computational approaches in identifying taxonomic inconsistencies.
Drug-likeness evaluation using established computational metrics revealed complementary assessment patterns. The QED score distribution exhibited a distinctive bimodal pattern, indicating two predominant structural subclasses with differentiated physicochemical profiles. This bimodality likely reflects the inherent structural diversity within flavonoid and phenolic acid subclasses, particularly differences in glycosylation patterns and molecular complexity. The observation is consistent with principal component analyses reported in the recent literature [28], suggesting that these compound subclasses occupy distinct regions of chemical space with respect to pharmaceutical properties.
DataWarrior drug-likeness scores displayed a notably left-skewed distribution. Notably, 25% of compounds showed favorable drug-likeness scores above 0.05, identifying a substantial subset with promising pharmaceutical properties. Modified flavonoids demonstrated superior drug-likeness profiles, with sulfated and glucuronidated conjugates (e.g., 3′-O-Methyl-(-)-epicatechin 4′-O-sulfate, Genistein-7-O-glucuronide-4′-sulfate) achieving the highest scores. This suggests that specific phase II metabolite forms may possess improved computational drug-likeness scores compared to their parent aglycones, though experimental validation is required, an observation with implications for prodrug design and metabolite-based therapeutic strategies.
Correlation analysis between QED and DataWarrior scores revealed a moderate positive relationship, indicating that while both metrics evaluate drug-likeness, they capture complementary aspects of molecular properties through their distinct algorithmic approaches. QED demonstrated expected negative correlations with molecular weight, hydrogen bond donors, hydrogen bond acceptors, and topological polar surface area, consistent with its penalty function for excessive values in these parameters. The moderate correlation strengths suggest that the combined application of multiple drug-likeness metrics provides more comprehensive compound evaluation than single-metric approaches, supporting their complementary use in phytochemical assessment [29,30].
The moderate correlation between QED and DataWarrior scores reflects their distinct algorithmic foundations. QED integrates eight weighted physicochemical descriptors, penalizing deviations from optimal drug-like ranges [19], while DataWarrior evaluates substructure fragment frequencies derived from approved drugs [20]. Consequently, compounds may score favorably on one metric but not the other, depending on their structural features. Isoflavones demonstrate favorable scores on both metrics, whereas flavonols and anthocyanins show lower QED scores primarily due to higher molecular weights and hydrogen bonding capacity, features less penalized by DataWarrior’s fragment-based approach. For practical compound screening, we recommend a complementary dual-metric strategy: compounds ranking favorably on both QED and DataWarrior represent priority candidates for pharmaceutical development, while compounds scoring well on only one metric warrant case-by-case evaluation considering specific structural features and intended applications. This approach leverages the complementary strengths of both algorithms to minimize false negatives in drug-likeness assessment.
Compound evaluation across subclasses revealed isoflavones as demonstrating the most favorable drug-likeness profiles, with moderate molecular weight, acceptable LogP, and hydrogen bonding capacity within optimal ranges. In contrast, more complex subclasses such as flavan-3-ols and flavonols showed lower drug-likeness scores, primarily due to higher molecular weights, excessive hydrogen bond donors, and elevated polar surface areas. Top-performing individual compounds included sativanone, dihydroformononetin, and violanone, which warrant further investigation based on their favorable quantitative parameters and documented bioactivity [28,31]. The level of prior characterization among top-ranked compounds varies considerably. Diosmin (DataWarrior rank #4; DrugBank ID: DB08995) is approved in Europe and Asia for chronic venous insufficiency [32], while naringenin and liquiritigenin (QED ranks #10 and #9) are commercially available dietary supplements [33]. Quercetin and genistein are also extensively studied. In contrast, sativanone (QED rank #1), violanone (QED rank #3), and dihydroformononetin (QED rank #2) remain relatively underexplored despite favorable drug-likeness profiles. The concordance between computational drug-likeness predictions and real-world applications of several top-ranked compounds suggests that integrating multiple in silico metrics with molecular interaction and dietary source data may facilitate the identification of candidates for pharmaceutical development or dietary supplementation strategies.
Molecular interaction analysis identified statistically significant differences in binding affinities between flavonoids and phenolic acids. Flavonoids demonstrated interactions with a broader range of molecular targets (67 unique proteins versus 33 for phenolic acids), with preferential binding to DNA repair proteins (Bloom syndrome protein: 9.70) and hormone receptors (thyroid hormone receptor beta-1: 9.22). In contrast, phenolic acids exhibited selective interaction with membrane transporters (SLCO1B1 and SLCO1B3: 9.00), suggesting distinct biochemical functions [34,35,36,37]. This differential target selectivity has implications for their respective biological activities and potential therapeutic applications, with flavonoids showing broader predicted pharmacological potential based on target diversity and phenolic acids demonstrating more focused cellular transport modulation.
The differential target selectivity observed provides mechanistic insight into the distinct biological activities reported for these compound subclasses. Flavonoid interactions with DNA repair proteins, particularly the Bloom syndrome protein, align with their documented anti-tumor activities. The Bloom syndrome protein is a RecQ helicase essential for genomic stability, and its modulation has been implicated in cancer cell sensitization to DNA-damaging agents [38]. Similarly, flavonoid binding to thyroid hormone receptor beta-1 may contribute to their reported metabolic effects, as this receptor regulates lipid metabolism and thermogenesis [39]. The high interaction strength with MAP kinase ERK2 and c-Jun N-terminal kinase 1 further supports anti-inflammatory mechanisms, since these kinases mediate pro-inflammatory signaling cascades [40]. In contrast, the selective interaction of phenolic acids with membrane transporters SLCO1B1 and SLCO1B3 has distinct functional implications. These organic anion-transporting polypeptides mediate hepatic uptake of endogenous and exogenous compounds, suggesting that phenolic acids may modulate cellular absorption processes and potentially influence drug-nutrient interactions [41]. This transporter selectivity may also facilitate phenolic acid accumulation in hepatocytes, providing a mechanistic basis for their documented hepatoprotective effects [42].
The broader distribution of flavonoid interaction strengths indicates greater molecular promiscuity compared to phenolic acids. This characteristic carries both advantages and limitations. Multi-target engagement may be beneficial for complex diseases involving multiple dysregulated pathways, potentially explaining the pleiotropic health effects attributed to flavonoid-rich diets [43]. However, promiscuity also increases the probability of off-target effects and complicates mechanistic interpretation, as attributing biological outcomes to specific target interactions becomes challenging. Consequently, flavonoids may be better suited for preventive nutritional applications rather than precision therapeutics, while the more selective phenolic acid profile may offer advantages for targeted pharmacological interventions. Comparative analysis identified three shared targets between both compound classes (SLCO1B1, SLCO1B3, and p53); however, interaction patterns differed, with flavonoids showing higher binding affinities and more numerous compounds targeting these shared proteins. This suggests that while both compound classes can interact with similar molecular targets, their binding characteristics and target selectivity profiles remain distinct, potentially enabling complementary therapeutic applications.
Food source analysis mapped compounds across concentration ranges from 500 to 14,500 mg/kg, providing quantitative data for dietary source evaluation. Herbs and spices contained the highest total flavonoid content (45,516 mg/kg), followed by fruits (40,490 mg/kg) and teas (37,101 mg/kg) [44,45,46]. Evening primrose was identified as the highest individual source, containing 14,500 mg/kg, followed by mango at 9000 mg/kg and red wine at 7500 mg/kg. Compound distribution analysis identified specific accumulation patterns across food groups: herbs and spices contained high quercetin percentages (32.74%) across 49 compounds; fruits exhibited the highest compound diversity (62 unique flavonoids) with gallic acid predominating (22.23%); and beverages showed selective compound profiles with 2-phenylethanol comprising 95.04% of total flavonoids. These quantitative distribution data provide an empirical basis for dietary source selection to achieve specific flavonoid intake targets and inform nutritional intervention strategies [44,47]. Importantly, high phytochemical concentration does not equate to dietary relevance. Foods with lower absolute concentrations but higher consumption frequency (e.g., apples, tea) may contribute more substantially to total intake than concentrated but rarely consumed sources (e.g., evening primrose), underscoring the need to consider both concentration and dietary patterns. To illustrate practical applicability, consider prioritizing compounds for anti-inflammatory dietary intervention. The framework identifies isoflavones as optimal candidates based on favorable drug-likeness, documented NF-κB pathway interactions, and accessible dietary sources. Cross-referencing with food source data identifies soy products as concentrated isoflavone sources (36.91% of total flavonoids). Among specific compounds, naringenin and liquiritigenin combine high QED scores (>0.87) with anti-inflammatory target profiles, suggesting these as priority candidates for further investigation. This stepwise approach—integrating drug-likeness, target profiles, and dietary accessibility—demonstrates how the framework supports evidence-based compound prioritization.
Several limitations should be considered when interpreting these findings. First, drug-likeness predictions based on computational metrics require experimental validation through in vitro permeability assays and in vivo bioavailability studies [48,49,50]. The established metrics employed (QED, DataWarrior, Lipinski, Veber) were developed primarily for synthetic pharmaceuticals and may systematically underestimate the pharmaceutical potential of natural products. Classical rules such as Lipinski’s Ro5 were derived from oral drug libraries and do not account for alternative absorption mechanisms utilized by many flavonoids, including active transport via glucose transporters (SGLT1) and organic anion transporters [51]. Notably, natural products and their derivatives frequently violate Ro5 criteria yet demonstrate significant in vivo bioactivity [52]. In our dataset, 47% of compounds violated at least one Ro5 criterion, yet many of these possess well-documented biological activities, suggesting that drug-likeness metrics should be interpreted as initial screening tools rather than definitive exclusion criteria. QED scores are weighted toward oral drug optimization, potentially underestimating compounds absorbed via active transport mechanisms. DataWarrior’s fragment-based approach may penalize novel scaffolds underrepresented in approved drug databases. Additionally, ChEMBL activity data derive from heterogeneous assay conditions across laboratories, introducing variability that may affect cross-study comparisons. In addition, gut microbiota extensively metabolize flavonoids and phenolic acids through glycoside hydrolysis, C-ring cleavage, and dehydroxylation, generating metabolites with altered bioactivity profiles [53,54]. Inter-individual variation in microbiome composition may explain inconsistent clinical outcomes for flavonoid interventions [55]. Additionally, food processing substantially influences compound content and bioaccessibility: thermal processing may degrade heat-sensitive flavonoids while fermentation can hydrolyze glycosides to more bioavailable aglycones [56]. The food source concentrations reported herein represent raw or minimally processed foods; actual dietary exposure will vary with preparation methods. Second, molecular interaction data from ChEMBL represent in vitro binding affinities that may not directly translate to in vivo biological effects due to factors including bioavailability, tissue distribution, and metabolic conversion. Third, food source quantification relies on reported concentrations subject to variation from environmental, agricultural, and analytical factors [57,58], and actual dietary intake depends on food preparation methods and matrix effects that may influence compound stability and bioaccessibility. Future studies should prioritize experimental validation of solubility, permeability, and LogP for the highest-ranked candidates identified herein. Notably, for several top-ranked compounds (e.g., naringenin, quercetin), published experimental LogP values show good agreement with computed values, providing preliminary support for the predictive validity of the computational approach [59].
In conclusion, this integrative cross-database approach provides a systematic characterization framework for flavonoids and phenolic acids, combining established drug-likeness metrics, molecular interaction analysis, and dietary source mapping. The methodology bridges cheminformatics, molecular pharmacology, and food science to establish a comprehensive foundation for evaluating these bioactive compounds. The complementary nature of different drug-likeness metrics supports their combined application in compound assessment, while the molecular interaction and food source data provide context for understanding biological activity and dietary relevance. This analytical framework may be extended to other natural product classes, including terpenoids, alkaloids, and carotenoids, provided that sufficient structural data are available in compatible databases.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Jubaidi F.F. Zainalabidin S. Taib I.S. Hamid Z.A. Budin S.B. The potential role of flavonoids in ameliorating diabetic cardiomyopathy via alleviation of cardiac oxidative stress, inflammation and apoptosis Int. J. Mol. Sci.202122509410.3390/ijms 2210509434065781 PMC 8151300 · doi ↗ · pubmed ↗
- 2Zahra M. Abrahamse H. George B.P. Flavonoids: Antioxidant powerhouses and their role in nanomedicine Antioxidants 20241392210.3390/antiox 1308092239199168 PMC 11351814 · doi ↗ · pubmed ↗
- 3Hasnat H. Shompa S.A. Islam M.M. Alam S. Richi F.T. Emon N.U. Ashrafi S. Ahmed N.U. Chowdhury M.N.R. Fatema N. Flavonoids: A treasure house of prospective pharmacological potentials Heliyon 202410 e 2753310.1016/j.heliyon.2024.e 2753338496846 PMC 10944245 · doi ↗ · pubmed ↗
- 4Ciumărnean L. Milaciu M.V. Runcan O. VesaȘ.C. Răchișan A.L. Negrean V. PernéM.-G. Donca V.I. Alexescu T.-G. Para I. The effects of flavonoids in cardiovascular diseases Molecules 202025432010.3390/molecules 2518432032967119 PMC 7571023 · doi ↗ · pubmed ↗
- 5Geng Q. Yan L. Shi C. Zhang L. Li L. Lu P. Cao Z. Li L. He X. Tan Y. Therapeutic effects of flavonoids on pulmonary fibrosis: A preclinical meta-analysis Phytomedicine 202413215580710.1016/j.phymed.2024.15580738876010 · doi ↗ · pubmed ↗
- 6Yuan D. Guo Y. Pu F. Yang C. Xiao X. Du H. He J. Lu S. Opportunities and challenges in enhancing the bioavailability and bioactivity of dietary flavonoids: A novel delivery system perspective Food Chem.202443013711510.1016/j.foodchem.2023.13711537566979 · doi ↗ · pubmed ↗
- 7Kanakis C.D. Tarantilis P.A. Polissiou M.G. Diamantoglou S. Tajmir-Riahi H.A. An overview of DNA and RNA bindings to antioxidant flavonoids Cell Biochem. Biophys.200749293610.1007/s 12013-007-0037-217873337 · doi ↗ · pubmed ↗
- 8D’Arrigo G. Gianquinto E. Rossetti G. Cruciani G. Lorenzetti S. Spyrakis F. Binding of Androgen- and Estrogen-Like Flavonoids to Their Cognate (Non)Nuclear Receptors: A Comparison by Computational Prediction Molecules 202126161310.3390/molecules 2606161333799482 PMC 8001607 · doi ↗ · pubmed ↗
