Open access peer-reviewed chapter - ONLINE FIRST

Method for Developing Object and Image Recognition Systems Using Transfer Learning in Computer Vision

Written By

Lenin Huayta Flores, Lucio Ticona Carrizales, Lenin Samir Huayta Poma and Betty Campos Segales

Submitted: 03 June 2025 Reviewed: 24 June 2025 Published: 07 August 2025

DOI: 10.5772/intechopen.1011744

Transfer Learning - Unlocking the Power of Pretrained Models IntechOpen
Transfer Learning - Unlocking the Power of Pretrained Models Edited by Pier Luigi Mazzeo

From the Edited Volume

Transfer Learning - Unlocking the Power of Pretrained Models [Working Title]

Dr. Pier Luigi Mazzeo and Associate Prof. Alessandro Bruno

Chapter metrics overview

3 Chapter Downloads

View Full Metrics

Abstract

Object and image recognition through transfer learning is a key area in computer vision and deep learning. This process encompasses several fundamental stages for developing efficient models. It begins with data collection, ensuring a representative and high-quality dataset with proper annotation and regulatory compliance. Next, preprocessing enhances image quality using techniques, such as normalization and segmentation. Subsequently, the use of pretrained deep neural networks leverages prior knowledge to optimize learning. During the training phase, the model fine-tunes its parameters through iterative optimization to identify relevant patterns. In the validation phase, the model’s performance is evaluated, allowing for hyperparameter tuning to improve generalization and robustness across different scenarios. Finally, in the recognition phase, the trained model makes predictions on new data, enabling automated decision-making or actions. This structured approach enhances recognition accuracy and model generalization, ensuring robust performance across various computer vision applications.

Keywords

  • computer vision
  • convolutional neural networks
  • image preprocessing
  • image recognition
  • transfer learning

1. Introduction

In the current context of artificial intelligence (AI), image recognition systems have gained increasing relevance in various fields, such as medicine, security, agriculture, and industry, thanks to advances in deep learning. Transfer learning (TL) reuses a pretrained architecture (e.g., Classification: AlexNet, ConvNeXt, MaxVit, RegNet, ShuffleNet V2, SwinTransformer, Wide ResNet. Semantic Segmentation: DeepLabV3, FCN (fully convolutional network), and LRASPP (Lite Reduced Atrous Spatial Pyramid Pooling). Object Detection: Faster R-CNN (faster region-based convolutional neural network), FCOS (fully convolutional one-stage object detection), RetinaNet, SSD (single shot detector), and SSDlite. Video Classification: Video MViT (multiscale vision transformer), Video ResNet, Video S3D, and Video SwinTransformer. Optical Flow: RAFT (Recurrent All-Pairs Field Transforms). Instance Segmentation: Mask R-CNN (mask region-based convolutional neural network). Keypoint Detection: Keypoint R-CNN. Quantized Models: Quantized GoogLeNet, Quantized MobileNet V2, Quantized MobileNet V3, Quantized ShuffleNet V2, Quantized InceptionV3, Quantized ResNet, and Quantized ResNeXt) for a new problem [1, 2] or for a similar task. This is particularly useful when data are scarce, private, or costly to obtain or label, as it enables leveraging knowledge from a broader domain [2, 3]. There are multiple approaches to transfer learning: reusing only the network structure, using both the architecture and the pretrained weights—either partially adjusting (e.g., final layers) or fully fine-tuning all layers—or adding additional layers to adapt the model to the new task. This technique optimizes computational resources and improves performance on problems with limited or small-scale data [4]. This chapter presents a comprehensive methodological approach for the development of an image recognition system aimed at computer vision (CV) tasks, based on modern deep learning techniques.

The proposed methodology consists of six sequential phases: data collection, image preprocessing, pretraining, training, validation, and recognition. Each of these phases fulfills a specific and complementary role within the overall process. The data collection phase includes the precise definition of the task, acquisition and labeling of datasets, as well as verifying their quality and compliance with legal and ethical requirements. Image preprocessing aims to optimize the model’s input conditions through techniques, such as resizing, normalization, contrast adjustment, segmentation, and data augmentation. Next, the pretraining phase incorporates previously trained models by leveraging transfer learning or feature extraction, allowing for reduced training time and improved performance. The training phase focuses on pattern learning by DCNNs (deep convolutional neural networks), using optimization algorithms such as gradient descent. The model validation phase evaluates performance with metrics like accuracy, precision, and recall, using methods like cross-validation and hyperparameter tuning to prevent overfitting.

Finally, the recognition phase involves applying the trained model to real-world contexts, enabling high-accuracy image identification and classification. This chapter details each stage, offering a practical and conceptual guide for those developing intelligent computer vision systems.

2. Phases

The six phases comprise a sequential methodological process that guides the development of image recognition systems based on deep learning and transfer learning. These phases are data collection, image preprocessing, pretraining, training, validation, and recognition, as shown in Figure 1. It is worth mentioning that, although they are presented in an orderly manner, the completion of these phases is not necessarily rigid or strict. In practice, the process may require iterations between phases, on-the-fly adjustments, or even the omission of some stages depending on the context, the nature of the problem, data availability, technical resources, or the specific objectives of the project.

Figure 1.

Method for developing object recognition systems using transfer learning in computer vision.

2.1 Data collection

A dataset of images is a structured collection of visual representations, categorized into one or more classes, stored locally or in the cloud, and designed to be processed and used for training machine learning (ML) models. In this phase, the image dataset [5, 6, 7, 8, 9] containing examples or data related to the task at hand is obtained. The framework begins with defining the task and designing the dataset to align it with existing resources, followed by data acquisition and immediate legal compliance verification. Subsequently, data quality is evaluated, and labeling adjustments are made, applying dataset augmentation if the information is insufficient. Finally, each stage is continuously documented and iterated as necessary, ensuring efficient, ethical, and effective use of available data and models while minimizing redundancy in the process. Data collection can be carried out through a logistical activity (e.g., setting up traffic cameras), a technical activity (software connected to a digital catalog database), or a commercial activity (purchasing an image archive) [10]. Data sharing and reuse [11]; discovery and exploration [12].

Task definition involves understanding the image recognition task to be performed; whether it is object recognition [13, 14, 15, 16] (locating multiple instances with bounding boxes), image classification [17, 18] (assigning a global label), or face detection [19, 20, 21, 22, 23, 24, 25] (identifying human faces, possibly extending to facial recognition). Each variant requires different technical approaches, from neural network architectures (YOLO (You Only Look Once), ResNet (Residual Network), MTCNN (multi-task cascaded convolutional neural network)) to specialized datasets (COCO (Common Objects in Context), ImageNet, WIDER Face), determining the choice of metrics and computational resources needed for effective implementation.

The design of the image dataset, or acquiring a representative dataset for the task, is a fundamental stage in developing computer vision systems. This process involves the careful collection of images from diverse sources, such as digital cameras, scanners, mobile devices, specialized sensors, or public databases [26] available in academic repositories and specialized platforms. Additionally, when existing data are insufficient or do not meet specific project requirements, it may be necessary to manually generate images [27] using techniques like controlled photography, three-dimensional (3D) rendering, or image synthesis via generative algorithms. It is crucial that the acquired dataset is diverse, balanced (SMOTE (Synthetic Minority Oversampling Technique)) [28], and representative [29, 30] of the real-world conditions where the system will operate, considering factors such as lighting variations, angles, resolutions, and possible noise or artifacts to ensure that trained models can generalize properly in practical scenarios.

Data acquisition is the process of finding datasets to train models [31]. Three approaches are identified [32]: Data discovery involves searching and indexing datasets in repositories such as corporate data lakes or the web using tools like DataHub, Google Fusion Tables, or Kaggle, which combine collaborative analysis, public management, and access to massive repositories. Search specializes in locating data either in corporate lakes or via HTML (Hypertext Markup Language) table extraction. Data augmentation complements existing datasets by integrating external sources or generating new data through crowdsourcing or synthetic data [31, 33]. Finally, data generation creates datasets from scratch using crowdsourcing (with procedural or declarative tasks) or synthetic techniques, such as generative adversarial networks (GANs) [34], reinforcement learning policies, or specific methods (Semantic Clustering and Partitioning (SCPN)/Semantic Extraction and Relation (SEAR) in natural language processing (NLP)) [32, 33].

Consent and legal compliance entail that collecting visual data involving people or objects in public or private spaces must be conducted under strict adherence to legal frameworks related to information protection and privacy [35, 36]. When collected data allow direct or indirect identification of individuals, obtaining informed, free, and explicit consent is essential, following principles of legality, transparency, and accountability. Complementary measures, such as data minimization [37, 38], anonymization [39, 40], and access controls, must be considered to ensure image processing is proportional, secure, and consistent with predefined purposes.

Ensuring the quality of collected data involves cleaning the data by removing defective images (artifacts, low resolution, distortions), correcting labeling errors by verifying that each label exactly corresponds to the visual content and fixing ambiguities or inconsistencies, and normalizing data [41] by standardizing formats, sizes, color spaces, and statistical distributions to guarantee consistency during training. Data quality [42, 43, 44] and labeling accuracy are critical, as errors in this phase can propagate during training and negatively impact the model’s final performance.

Labeling adjustments are made to effectively train a deep learning model, usually requiring prior data labeling [4, 45, 46]. This process assigns specific labels or categories [47] to each image in the dataset, enabling the model to learn patterns through supervised learning. Labeling data, when building knowledge bases from the web, initially assumes extracted facts are correct. Techniques include using existing labels [31]; crowd-based methods such as active learning and collective collaboration [32]; weak supervision, which generates imperfect labels at scale (e.g., data programming using multiple labeling functions instead of individual ones) [31, 48]; and data extraction leveraging existing knowledge bases to obtain entity attributes. These strategies balance volume, accuracy, and efficiency, adapting to contexts with initially scarce labels.

Dataset augmentation is performed if the existing image dataset is small and it is necessary to increase its size to improve model performance; this includes collecting more data, generating synthetic data, or increasing diversity within the dataset [2, 49, 50, 51].

Dataset documentation is very important and should include information about its origin, label structure, and any other relevant details so others can understand and use the dataset [52, 53].

2.2 Image preprocessing

This second phase consists of the manipulation and enhancement of images before they are used in model training or other applications.

Resizing involves adjusting the size of images while maintaining the aspect ratio (width/height). We can resize an image with a defined aspect ratio [54], which is the width of the image divided by its height.

Normalization ensures that pixel values are within a specific range, such as [0, 1] or [−1, 1] [55]. Normalization helps standardize data and speeds up the training process.

Adjusting the contrast and brightness of images improves the visibility of details [56]; this is useful when working with images of variable quality.

Cropping removes pixels from an image’s edges [57]; sometimes, it is necessary to crop the region of interest to focus on a specific part of the image for recognition tasks.

Rotation and transformation involve aligning all images to the same orientation or applying geometric transformations to correct distortions. In image processing, the rotation anchor point is the top-left corner at (0, 0) [58]. An image is rotated by an angle θ by transforming the (x, y) coordinates of each pixel relative to the origin [59, 60]. Morphological transformations, focused on the shape and structure of image features [61], include basic operations like erosion, which shrinks objects by removing edge pixels, and dilation, which expands objects by adding pixels [62]. Opening smooths contours and removes protrusions; closing fills holes and connects regions [63]. Advanced techniques such as hit-or-miss transformation (specific shape detection), morphological gradient (difference between dilation and erosion to highlight edges), and top-hat transforms (extracting small elements: white for bright, black for dark) optimize structure analysis [62, 64]. Algorithms like boundary extraction, hole filling, skeletonization (structural representation via iterative erosions), and convex hull (convexity approximation) allow describing shapes and patterns [65]. Operations such as thinning (thinning without losing connectivity) and pruning (removal of spurious pixels) refine results, while morphing (interpolation between images) gradually transforms structures [66]. Applicable to binary and grayscale images, these techniques are essential in pattern recognition, medical image processing, and computer vision [67]. Geometric transformations rearrange image pixels through basic operations, such as translation (displacement along x and y axes using matrices) [54, 57, 68], rotation (coordinate adjustment with trigonometric formulas for angle θ) [58, 60], scaling (resizing while preserving aspect ratio) [54, 69], flipping (reflection on axes) [58, 70], and cropping (extracting subregions using window and plop operators) [66]. Techniques like interpolation—nearest neighbor, bilinear, and bicubic—adjust resolution while preserving visual quality [71, 72], whereas advanced transformations like affine (combination of rotation, scaling, shear, and translation in homogeneous coordinates) [61] and projective (mapping quadrilaterals for perspective correction) [73] allow complex manipulation. Methods like deformation [61, 72], polar mapping [66], and barrel/pincushion transformations (non-linear radial distortion) [66] adapt images to specific geometric systems. Image arithmetic, consisting of addition, subtraction, and multiplication with values clipped to [0,255] [54, 68], and masking (focusing on specific areas) [54, 74] optimize data for analysis, while techniques like Procrustes alignment (translation, scaling, and rotation to preserve shape) [61] ensure accuracy in computer vision applications. These operations, based on matrix algebra and coordinate systems [60, 66], are essential in image processing to improve quality, adaptability, and computational efficiency.

Data augmentation creates modified images (rotations, shifts, brightness) to boost diversity and improve model generalization. Models trained with data augmentation on minority classes and transfer learning provide the highest classification accuracies [49]. There are many methods for data augmentation, among which Kumar [75] mentions deriving latent semantics, entity augmentation, and data integration. Advanced data augmentation techniques include adversarial training [76], synthetic data generation: GANs [34] or variational autoencoders (VAEs) [77], advanced geometric transformations, semantic-based augmentations, Mixup [78], CutMix [79], and controlled noise-based data augmentation.

Color correction involves adjusting the color balance of images to correct lighting and color representation problems [80, 81].

Image histogram transformation is an abstraction of an image where the frequency of each image value (contrast/brightness/intensity) is determined [59, 71, 74, 82]. A histogram shows the distribution of pixel intensities in an image, visualized as a graph that summarizes intensity levels [74]. Types of histograms include one-dimensional (1D) histograms or grayscale image histograms that graphically represent the distribution of pixel values, where the x-axis indicates intensities (0–255) and the y-axis their frequencies [74, 82]. In color images, histograms can analyze individual channels (RGB (Red, Green, Blue), HSL (Hue, Saturation, Lightness)) or two-dimensional (2D)/3D combinations, though the latter require reduced quantization to avoid complexity [64, 71]. Masks allow focusing on specific regions by calculating partial histograms to adjust contrast or segment objects [74]. Histogram equalization improves contrast by redistributing values through the cumulative distribution function (CDF), idealizing a uniform histogram [64, 83]. Techniques like histogram matching transform values to align distributions with a desired target, useful in color correction [83]. Additionally, histogram comparison and backprojection facilitate retrieving similar images or detecting objects through probability analysis [64, 71]. These tools are essential in digital processing to optimize quality, segmentation, and comparative analysis. Histogram transformation consists of aligning the histogram of an image to redistribute intensity levels to highlight certain features.

Regarding filtering and smoothing, noise can be caused by factors, such as poor lighting or distortion during image acquisition; filtering techniques are used to remove unwanted details, improve, smooth, or enhance features in images. Filters modify the appearance of an image through operations in the spatial domain (direct manipulation of pixels) or frequency domain (transforms like Fourier). In the spatial domain, linear filters (weighted combination of neighboring pixels) include the mean filter (smoothing but blurs edges) [61] and Gaussian filter (noise reduction with kernel based on σ) [67], while non-linear filters like the median filter (removes “salt and pepper” noise) [61] and order filters (max/min to enhance bright/dark points) [67] preserve edges. Edge detection filters use derivatives: Sobel and Prewitt (horizontal/vertical gradients) [84], Canny (non-maximum suppression and hysteresis thresholding) [85], and Laplacian (second derivative for zero crossings) [61]. In the frequency domain, low-pass filters (smoothing: ideal, Butterworth, Gaussian) [60] and high-pass filters (edge enhancement: Butterworth, Gaussian) [86] adjust spectral components, while techniques like CLAHE (contrast-limited adaptive histogram equalization) [87] and homomorphic filtering (illumination/reflectance correction via logarithm and Fourier) [88] enhance contrast in complex images. Advanced methods like Frangi filter (detection of tubular structures via Hessian eigenvalues) [89] and blur masking (sharpening by subtracting a smoothed version) [61] optimize applications in computer vision, medical processing, and image restoration.

Artifact removal involves eliminating unnecessary or irrelevant data from the image, such as watermarks, text, or unwanted artifacts.

Binarization consists of converting an image into a binary image where pixels are classified as black or white. This is useful for object segmentation applications.

Segmentation refers to separating objects of interest from the background or other objects in the image. Segmentation is useful when specific objects need to be identified and analyzed. It divides an image into homogeneous regions based on attributes such as discontinuities (edges) or similarities (intensity, texture) to facilitate interpretation [61, 62, 90]. Edge-based methods use operators like Sobel, Prewitt, Canny, and Scharr to identify abrupt intensity transitions [62, 64], while histogram-based approaches apply global thresholding (Otsu, variance maximization; Kapur, maximum entropy) or generalized entropies (Tsallis, Rényi) to separate objects and background [90, 91, 92]. In environments with variable lighting, adaptive thresholds (Niblack, Sauvola, Bernsen) dynamically adjust thresholds based on local statistics (mean, standard deviation, range), optimizing segmentation in subregions [82, 93, 94]. These techniques, fundamental in computer vision, enable applications such as Optical Character Recognition (OCR), medical diagnosis, and texture analysis, balancing accuracy and computational efficiency.

2.3 Pretraining

A pretrained model is a deep learning model trained on large datasets for general tasks like image or text recognition, ready to be adapted for various uses.

Regarding transfer learning, large pretrained models exhibit a remarkable ability to generalize and transfer knowledge across different datasets and tasks, making them highly effective tools for a wide range of applications. Transfer learning fine-tunes a pretrained model for a specific task or smaller dataset [76, 95]. Instead of training the model from scratch, the weights and parameters of the pretrained model are adjusted to adapt it to the new task. This is especially useful when the new dataset is small or when there is insufficient time to train a model from scratch, since the pretrained model has already captured useful general features. Transfer learning techniques can be implemented in various ways. One option is to use only the base architecture as a template, without incorporating the previously learned weights, which corresponds to the feature extraction approach. Another option is to leverage both the structure and pretrained weights of the original model, known as direct utilization. In this latter case, different strategies are possible: adjusting only certain layers, such as the output layer, to adapt them to the new domain; performing full fine-tuning by modifying all the model parameters during retraining; or expanding the architecture by adding new specialized layers at the end, which is associated with the multitask learning approach. Transfer learning methodologies are categorized into four approaches: feature extraction, fine-tuning, domain adaptation, and multitask learning [76], being a successful strategy in domains like computer vision and natural language processing [95].

Feature extraction uses a pretrained convolutional neural network’s (CNN’s) learned features as input for a new dataset or model without fine-tuning the entire model [16, 96].

In some cases, pretrained models are also used directly without fine-tuning, especially if the tasks are general ones for which the model was originally trained, such as object recognition in images or natural language processing. We can potentially improve the estimation of a local model by relying on the global model without using it directly; this idea can be called soft transfer learning, as opposed to hard transfer learning, where the global model is used directly without adjustments [97].

Fine-tuning focuses on adapting the upper layers while preserving the generic features learned in the initial layers from the source task, thereby reducing overfitting [76]. Fine-tuning a model trained on a source domain helps reduce the amount of data and training time required in a new target domain by avoiding training from scratch and decreasing data collection costs [95], significantly improving classification accuracy [96]. Fine-tuning adjusts a pretrained model’s weights for a task, improving performance but needing more computation. Feature extraction uses pretrained weights without changes, saving resources for small data [76, 96]; ablation studies help determine which layers should be fine-tuned depending on the case.

Domain adaptation reduces differences between source and target domains so models work well across them [95, 98]. Rule adaptation reuses general knowledge from previous tasks to facilitate new ones, while domain adaptation aligns datasets with similar distributions to enable the use of a single classifier across different contexts [96].

Multitask learning trains a model on multiple related tasks simultaneously, leading to better generalization and performance by sharing representations—particularly useful when data per task are limited [96]. However, it faces challenges, such as overfitting, catastrophic forgetting, and task interference; therefore, in NLP, separate models are often preferred for each task [99]. Multitask transfer learning improves performance and efficiency by sharing representations across related tasks in real-world applications.

2.4 Training

In this phase, a model is taught to recognize patterns, features, or classes in images.

The training dataset consists of labeled images, where each label indicates the class or category to which the image belongs. When training with reduced training datasets, Lu and Li [100] proposes a method combining data augmentation with transfer learning, allowing convolutional neural networks (CNNs) to avoid the overfitting problem and achieve excellent classification accuracy.

The choice of an appropriate model, such as a DCNN, is important. The model has layers of neurons that will be trained to learn patterns in images. Model selection involves balancing approximation and estimation errors, adding complexity due to the transfer distance between source and target distributions, which varies by hypothesis class [101].

Parameter initialization of the model with random values [102], which can be adjusted during training so that the model can make accurate predictions.

In optimization, a loss function measures the gap between model predictions and true labels in training.

The goal is to minimize loss by iteratively adjusting model parameters using optimization algorithms like gradient descent. In transfer learning, when pretrained models are reused and adapted to a new classification task, the cross-entropy loss function remains one of the most commonly used, especially in multi-class classification tasks. This function combines LogSoftmax (logarithm of the softmax function) and NLLLoss (negative-log-likelihood loss), just like in training from scratch [21].

The training process involves iterations through the training set. In each iteration, the model processes one or several images, makes predictions, and adjusts its parameters based on the calculated loss. This process repeats many times (epochs) until the model converges or a predefined stopping criterion is met.

When necessary, hyperparameters—learning rate, batch size, or architecture—can be adjusted [101].

2.5 Validation

It is performed after the training of the model and aims to evaluate its performance and generalization capability on unseen data.

The validation dataset is an independent set containing images not used during training; it adjusts hyperparameters to prevent overfitting. These images are also labeled with their corresponding categories, and the test set evaluates the final model performance.

Predictions on the validation set are made using the trained model. Each image is passed through the model, which generates class predictions or continuous values, depending on the specific task.

The analysis and interpretation of predictions aims to explain the model’s decisions, detect failures, and validate that it uses relevant features through interpretability tools. Some interpretability techniques include: Grad-CAM (Gradient-weighted Class Activation Mapping) [103], saliency maps, LIME (Local Interpretable Model-Agnostic Explanations) [104], SHAP (SHapley Additive Explanations) [105], feature importance in models, such as Random Forest or XGBoost, PDP (Partial Dependence Plot), and integrated gradients.

Performance evaluation consists of comparing the model’s output with the true labels in the validation set. The model’s performance is assessed using metrics appropriate for the task, such as accuracy, precision, recall, F1-score, among others [8], and these reflect the quality of the model’s predictions on unseen data.

Model hyperparameters may need adjustment if the performance on the validation set is unsatisfactory. Such adjustments can include changes in learning rate, model architecture, batch size, among others.

Validation helps detect and prevent overfitting, where the model fits training data too closely and fails on new data. Ahmed [1], Dawson et al. [7], and Alhanaee et al. [22] use image augmentation to expand training data, improving generalization and reducing overfitting. Selectively fine-tuning the top layers while sharing the initial layers with the source task reduces overfitting and favors better feature reuse [76].

In some cases, cross-validation is used, a technique that involves splitting the dataset into multiple partitions (folds) [1, 6, 7, 97] and performing multiple rounds of training and validation. This provides a more robust estimate of the model’s performance. Cross-validation allows evaluating the statistical significance between models, eliminating dependence on chance [6].

The model is saved once it achieves satisfactory performance, for later use in the recognition stage.

2.6 Recognition

It involves using a pretrained model to identify objects, patterns, or features in images, prerecorded videos, or real-time videos.

First, there must be a pretrained model; this model has learned to identify patterns and features in images, thanks to the training phase. The trained model can be used directly for the target task or a new task [98]. The model’s performance may degrade due to changes in the network, traffic patterns, usage behavior, or the incorporation of new devices [95].

The input data provide images for the model to recognize. These images can be captured in real-time through a camera or loaded from a file.

Before feeding the images to the model, preprocessing may be necessary, which includes normalization [55], resizing [54] and in some cases segmentation [61, 62, 90] or noise removal [60, 61, 67, 84, 85, 86, 87, 88].

The model processes the images and generates predictions; in most cases, it provides a class label or a probability associated with each class, depending on whether you are performing image classification or an object detection task.

Predictions may require post-processing depending on the application; this could include removing low-confidence predictions, grouping similar predictions, or generating additional information from the predictions, such as bounding boxes in the case of object detection.

The results are presented through the predictions in a meaningful way for the end user; this consists of displaying class labels, locations of detected objects in the image, or any other relevant information.

Automated decisions can be made or information provided based on the model’s predictions to a human user for informed decision-making.

Regarding iteration, the image recognition system can operate continuously, providing real-time predictions and improving over time as more data are collected and the model is recalibrated in real-world applications.

3. Conclusion

The comprehensive methodology for the development of image/object recognition systems based on deep learning is structured into six sequential phases: data collection, preprocessing, pretraining, training, validation, and recognition. By leveraging transfer learning with pretrained architectures, computational resources are optimized and performance is improved in scenarios with limited data, as is the case in specialized domains, while prioritizing ethical and legal aspects in the acquisition, labeling, and anonymization of datasets. The preprocessing phase integrates advanced visual manipulation techniques to ensure quality and diversity in the model inputs, while training and validation employ optimization algorithms, evaluation metrics, and strategies against overfitting. Finally, the recognition phase applies the model in real-world contexts, demonstrating its capability to classify, detect objects, or segment images with high accuracy. This systemic approach, supported by technical examples and practical considerations, offers a robust guide for researchers and professionals, facilitating the implementation of efficient and scalable solutions in computer vision, with cross-disciplinary applications in medicine, security, industry, and more, thereby contributing to the advancement of adaptive and ethical artificial intelligence systems.

Conflict of interest

The authors declare that they have no conflict of interest.

Acronyms

1D

one-dimensional

2D

two-dimensional

3D

three-dimensional

AI

artificial intelligence

CNN

convolutional neural network

COCO

common objects in context

CV

computer vision

DCNN

deep convolutional neural network

GAN

generative adversarial networks

ML

machine learning

MTCNN

multi-task cascaded convolutional neural networks

NLP

natural language processing

TL

transfer learning

YOLO

you only look once

References

  1. 1. Ahmed MR. Leveraging convolutional neural network and transfer learning for cotton plant and leaf disease recognition. International Journal of Image, Graphics and Signal. 2021;10:47
  2. 2. Fayaz S, Ahmad Shah SZ, Ud din NM, et al. Advancements in data augmentation and trans er learning: A comprehensive survey to address data scarcity challenges. Recent Advances in Computer Science and Communications (Formerly Recent Patents Comput Sci). 2024;17:14-35
  3. 3. Teixeira AC, Ribeiro J, Morais R, et al. A systematic review on automatic insect detection using deep learning. Agriculture. 2023;13:713
  4. 4. Elsen I, Ferrein A, Schiffer S. Anomaly Detection in Metal-Textile Industries. Epub ahead of print 2025. DOI: 10.5772/intechopen.1008411
  5. 5. Mutawa AM, Alnajdi S, Sruthi S. Transfer learning for diabetic retinopathy detection: A study of dataset combination and model performance. Applied Sciences. 2023;13:5685
  6. 6. Luna-Jiménez C, Griol D, Callejas Z, et al. Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors. 2021;21:7665
  7. 7. Dawson HL, Dubrule O, John CM. Impact of dataset size and convolutional neural network architecture on transfer learning for carbonate rock classification. Computational Geosciences. 2023;171:105284
  8. 8. Khan A, Hassan B, Khan S, et al. DeepFire: A novel dataset and deep transfer learning benchmark for forest fire detection. Mobile Information Systems. 2022;2022:5358359
  9. 9. Mei X, Liu Z, Robson PM, et al. RadImageNet: An open radiologic deep learning research dataset for effective transfer learning. Radiology: Artifical Intelligence. 2022;4:e210315
  10. 10. Lakshmanan V, Görner M, Gillard R. Practical Machine Learning for Computer Vision. Sebastopol, CA: O’Reilly Media, Inc.; 2021
  11. 11. Chapman A, Simperl E, Koesten L, et al. Dataset search: A survey. VLDB Journal. 2020;29:251-272
  12. 12. Paton NW, Chen J, Wu Z. Dataset discovery and exploration: A survey. ACM Computing Surveys. 2023;56:1-37
  13. 13. Funabashi S, Yan G, Hongyi F, et al. Tactile transfer learning and object recognition with a multifingered hand using morphology specific convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems. 2022;35(6):7587-7601. DOI: 10.1109/TNNLS.2022.3215723
  14. 14. Moghimi MK, Mohanna F. Reliable object recognition using deep transfer learning for marine transportation systems with underwater surveillance. IEEE Transactions on Intelligent Transportation Systems. 2022;24:2515-2524
  15. 15. Fan Z, Shi L, Liu Q, et al. Discriminative fisher embedding dictionary transfer learning for object recognition. IEEE Transactions on Neural Networks and Learning Systems. 2021;34:64-78
  16. 16. Murali PK, Wang C, Lee D, et al. Deep active cross-modal visuo-tactile transfer learning for robotic object recognition. IEEE Robotics and Automation Letters. 2022;7:9557-9564
  17. 17. Kabakus AT, Erdogmus P. An experimental comparison of the widely used pre-trained deep neural networks for image classification tasks towards revealing the promise of transfer-learning. Concurrency and Computation: Practice and Experience. 2022;34:e7216
  18. 18. Bai J. Research and application of deep learning based on transfer learning in image classification tasks. In: 2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA). Piscataway, NJ: IEEE; 2024. pp. 1292-1297
  19. 19. Narayan K, Nair NG, Xu J, et al. Petalface: Parameter efficient transfer learning for low-resolution face recognition. arXiv Prepr arXiv241207771. 2025:804-814. DOI: 10.1109/WACV61041.2025.00088. Epub ahead of print 2024
  20. 20. Kute R, Vyas V, Anuse A. Transfer learning for face recognition using fingerprint biometrics. Journal of King Saud University, Science. 2021;33(5):1-6. DOI: 10.1016/j.jksues.2021.07.011
  21. 21. Chambino LL, Silva JS, Bernardino A. Multispectral face recognition using transfer learning with adaptation of domain specific units. Sensors. 2021;21:4520
  22. 22. Alhanaee K, Alhammadi M, Almenhali N, et al. Face recognition smart attendance system using deep transfer learning. Procedia Computer Science. 2021;192:4093-4102
  23. 23. Sreekala K, Cyril CPD, Neelakandan S, et al. Capsule network-based deep transfer learning model for face recognition. Wireless Communications and Mobile Computing. 2022;2022:2086613
  24. 24. Elaggoune H, Belahcene M, Bourennane S. Hybrid descriptor and optimized CNN with transfer learning for face recognition. Multimedia Tools and Applications. 2022;81:9403-9427
  25. 25. Kwak N, Kim D. Study on masked face detection and recognition using transfer learning. International Journal of Advanced Culture Technology. 2022;10:294-301
  26. 26. Yi R, Tian H, Gu Z, et al. Towards artistic image aesthetics assessment: A large-scale dataset and a new method. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE – Institute of Electrical and Electronics Engineers; 2023. pp. 22388-22397
  27. 27. Mondal T, Mendoza R, Drumetz L. Physics informed and data driven simulation of underwater images via residual learning. arXiv Prepr arXiv240205281. 2024. DOI: 10.48550/arXiv.2402.05281. Epub ahead of print 2024
  28. 28. Mansourifar H, Shi W. Deep synthetic minority over-sampling technique. arXiv Prepr arXiv200309788. 2020. DOI: 10.48550/arXiv.2003.09788. Epub ahead of print 2020
  29. 29. Mekonnen KA. Balanced face dataset: Guiding stylegan to generate labeled synthetic face image dataset for underrepresented group. arXiv Prepr arXiv230803495. 2023. DOI: 10.48550/arXiv.2308.03495. Epub ahead of print 2023
  30. 30. Katare D, Noguero DS, Park S, et al. Analyzing and mitigating bias for vulnerable classes: Towards balanced representation in dataset. arXiv Prepr arXiv240110397. 2024. DOI: 10.48550/arXiv.2401.10397. Epub ahead of print 2024
  31. 31. Whang SE, Lee J-G. Data collection and quality challenges for deep learning. Proceedings of the VLDB Endowment. 2020;13:3429-3432
  32. 32. Roh Y, Heo G, Whang SE. A survey on data collection for machine learning: A big data-AI integration perspective. IEEE Transactions on Knowledge and Data Engineering. 2019;33:1328-1347
  33. 33. Kumar V. Data Collection. 2020. Available from: https://mgcub.ac.in/pdf/material/20200429021914369e0bded0.pdf
  34. 34. Esan DO, Owolawi PA, Tu C. Generative adversarial networks: Applications, challenges, and open issues. In: Deep Learning-Recent Findings and Research. London, UK: IntechOpen; 2023. Epub ahead of print 2023. DOI: 10.5772/intechopen.113098
  35. 35. Acharya S, Paul P. The future of medical imaging: Ensuring ethical and legal compliance. In: Smart Medical Imaging for Diagnosis and Treatment Planning. Boca Raton, FL: Chapman and Hall/CRC; 2025. pp. 221-243
  36. 36. Park H, Oh H, Choi JK. A consent-based privacy-compliant personal data-sharing system. IEEE Access. 2023;11:95912-95927
  37. 37. Tiwari S, Raja R, Wadawadagi RS, et al. Emerging biometric modalities and integration challenges. In: Online Identity-an Essential Guide. London, UK: IntechOpen; 2024. Epub ahead of print 2024. DOI: 10.5772/intechopen.1003148
  38. 38. Majeed A, Khan S, Hwang SO. Group privacy: An underrated but worth studying research problem in the era of artificial intelligence and big data. Electronics. 2022;11:1449
  39. 39. Siminiuc S. Advanced Methods of Data Anonymization in Medical Research. 2025. Available from: http://repository.utm.md/handle/5014/29236
  40. 40. Gopireddy RR. Data anonymization techniques: Ensuring privacy in big data analytics. European Journal of Advances in Engineering and Technology. 2020;7:68-74
  41. 41. Zhao M, Chen J. Sequential classification of hyperspectral images. In: Hyperspectral Imaging in Agriculture, Food and Environment. London, UK: IntechOpen; 2018. Epub ahead of print 2018. DOI: 10.5772/intechopen.73160
  42. 42. Chernov Y. Data quality measurement based on domain-specific information. In: Data Integrity and Data Governance. London, UK: IntechOpen; 2022. Epub ahead of print 2022. DOI: 10.5772/intechopen.106939
  43. 43. Vancauwenbergh S. Data quality management. Science Recent Advances. 2019;1:15
  44. 44. Christiaanse R. Quality 4.0: Data quality and integrity–a computational approach. In: Six Sigma and Quality Management. London, UK: IntechOpen; 2022. Epub ahead of print 2022. DOI: 10.5772/intechopen.108213
  45. 45. Baral P, Yang N, Weng N. IoT device identification using device fingerprint and deep learning. In: Deep Learning and Reinforcement Learning. London, UK: IntechOpen; 2023. Epub ahead of print 2023. DOI: 10.5772/intechopen.111554
  46. 46. Rangel JC, Cruz E, Cazorla M. Automatic understanding and mapping of regions in cities using Google street view images. Applied Sciences. 2022;12:2971
  47. 47. Rajput AS, Rajput DAS, Shukla DS, et al. Cutting-Edge Artificial Intelligence in Agriculture: Deepfusionnet Model for Tomato Leaf Disease Classification. 2024. Available SSRN 4942525. DOI: 10.2139/ssrn.4942525
  48. 48. Parameswaran AG, Garcia-Molina H, Park H, et al. Crowdscreen: Algorithms for filtering data with humans. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 2012. pp. 361-372
  49. 49. Saini M, Susan S. Data augmentation of minority class with transfer learning for classification of imbalanced breast cancer dataset using inception-V3. In: Pattern Recognition and Image Analysis: 9th Iberian Conference, IbPRIA 2019, Madrid, Spain, July 1-4, 2019, Proceedings, Part I 9. Cham: Springer; 2019. pp. 409-420
  50. 50. Damaševičius R. Introductory chapter: Current state and achievements of data augmentation. Deep Learn Find Research. 2024:1-10. DOI: 10.5772/intechopen.112284. Epub ahead of print
  51. 51. Jiang W, Zhang Y, Zheng S, et al. Data augmentation in human-centric vision. Vicinagearth. 2024;1:8
  52. 52. Yang X, Liang W, Zou J. Navigating dataset documentation in ML: A large-scale analysis of dataset cards on hugging face. In: NeurIPS 2023 Workshop on Regulatable ML. 2023. Epub ahead of print 2023. DOI: 10.48550/arXiv.2401.13822
  53. 53. Pushkarna M, Zaldivar A, Kjartansson O. Data cards: Purposeful and transparent dataset documentation for responsible AI. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery (ACM); 2022. pp. 1776-1826
  54. 54. González-Acuña RG, Chaparro-Romo HA, Melendez-Montoya I. Optics and Artificial Vision. Bristol: IOP Publishing; 2021. Available from: https://iopscience.iop.org/book/mono/978-0-7503-3707-6
  55. 55. Wankhede D, Athanikar A, Borude Y, et al. An experimental analysis of machine learning model in the context of sales forecasting:-A Walmart dataset as a case study. GRENZE International Journal of Engineering and Technology. 2024;10:1048-1057
  56. 56. Rajesh PJ, Balambica V, Achudhan M. Automated gear inspection using image processing and machine learning techniques. In: 2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). Piscataway, NJ: IEEE; 2024. pp. 1643-1648
  57. 57. Parkin A. Computing Colour Image Processing. Cham: Springer; 2018. Epub ahead of print 2018. DOI: 10.1007/978-3-319-74076-8
  58. 58. Chung BWC. Pro Processing for Images and Computer Vision with OpenCV. 2017. DOI: 10.1007/978-14842-2775-6. Epub ahead of print 2017
  59. 59. Demaagd K, Oliver A, Oostendorp N, et al. Practical Computer Vision with SimpleCV: The Simple Way to Make Technology See. Sebastopol, CA: O’Reilly Media, Inc.; 2012
  60. 60. Distante A, Distante C. Handbook of Image Processing and Computer Vision: Volume 2: From Image to Pattern. 2020. DOI: 10.1007/978-3-030-42374-2. Epub ahead of print 2020
  61. 61. Solomon C, Breckon T. Fundamentals of Digital Image Processing: A Practical Approach with Examples in Matlab. Chichester: John Wiley & Sons; 2011
  62. 62. Thanki RM, Kothari AM. Digital Image Processing Using SCILAB. Cham: Springer; 2019. Epub ahead of print 2019. DOI: 10.1007/978-3-319-89533-8
  63. 63. McAndrew A. A Computational Introduction to Digital Image Processing. Boca Raton, FL: CRC Press Taylor and Francis Group; 2016
  64. 64. Mordvintsev A, Abid K. Opencv-Python Tutorials Documentation. San José, CA, USA: OpenCV Project; 2016
  65. 65. Rafael CG, Richard EW. Digital Image Processing. 4th ed. New York, NY, USA: Pearson Education; 2017
  66. 66. Kinser JM. Image Operators: Image Processing in Python. Boca Raton, FL: CRC Press; 2018. Epub ahead of print 2018. DOI: 10.1201/9780429451188
  67. 67. Chityala R, Pudipeddi S. Image Processing and Acquisition Using Python. Boca Raton, FL: Chapman and Hall/CRC; 2020. Epub ahead of print 2020. DOI: 10.1201/9780429243370
  68. 68. Asad H, Shrimali VR, Singh N. The computer vision workshop. In: The Computer Vision Workshop: Develop the Skills you Need to Use Computer Vision Algorithms in your Own Artificial Intelligence Projects. Birmingham: Packt Publishing; 2020
  69. 69. Pajankar A. Python 3 Image Processing: Learn Image Processing with Python 3, NumPy, Matplotlib, and Scikit-Image. New Delhi: BPB Publications; 2019
  70. 70. Tyagi V. Understanding Digital Image Processing. Boca Raton, FL: CRC Press; 2018. Epub ahead of print 2018. DOI: 10.1201/9781315123905
  71. 71. Dawson-Howe K. A Practical Introduction to Computer Vision with Opencv. Chichester: John Wiley and Sons Ltd.; 2014
  72. 72. Szeliski R. Computer Vision: Algorithms and Applications. Cham: Springer Nature; 2022. Epub ahead of print 2022. DOI: 10.1007/978-3-030-34375-9
  73. 73. Glasbey CA, Mardia KV. A review of image-warping methods. Journal of Applied Statistics. 1998;25:155-171
  74. 74. Rosebrock A. Practical python and OpenCV. 2nd ed. Pyimagesearch; 2016. 160 p
  75. 75. Kumar DRV. Data Collection. Saarbrücken, Germany: LAP LAMBERT Academic Publishing; 2024
  76. 76. Majeed APPA, Engelbrecht A. Transfer learning: Leveraging the capability of pre-trained models across different domains. In: BoD–Books on Demand. Norderstedt, Germany: BoD – Books on Demand; 2025. DOI: 10.5772/intechopen.114788. Epub ahead of print 2025
  77. 77. Pinheiro Cinelli L, Araújo Marins M, Barros da Silva EA, et al. Variational autoencoder. In: Variational Methods for Machine Learning with Applications to Deep Networks. Cham: Springer; 2021. pp. 111-149
  78. 78. Zhang H, Cisse M, Dauphin YN, et al. Mixup: Beyond empirical risk minimization. arXiv Prepr arXiv171009412. 2018:1-13. DOI: 10.48550/arXiv.1710.09412. Epub ahead of print 2017
  79. 79. Yun S, Han D, Oh SJ, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway, NJ, USA: IEEE – Institute of Electrical and Electronics Engineers; 2019. pp. 6023-6032
  80. 80. Lai YL, Ang TF, Bhatti UA, et al. Color correction methods for underwater image enhancement: A systematic literature review. PLoS One. 2025;20. DOI: 10.1371/journal.pone.0317306. Epub ahead of print 2025
  81. 81. Guerra JP, Cuevas F. Application of digital image processing techniques for agriculture: A review. Digital Image Processing: Advanced Technologies and Applications. 2024. DOI: 10.5772/intechopen.1004767. Epub ahead of print 2024
  82. 82. Burger W, Burge MJ. Principles of digital image processing: Fundamental techniques. In: Springer Science and Business Media. 2010. Epub ahead of print 2010. DOI: 10.1007/978-1-84800-191-6
  83. 83. Gavet Y, Debayle J. Image Processing Tutorials with Python®. Lyon, France: Spartacus-Idh; 2019. Available from: https://hal-emse.ccsd.cnrs.fr/emse-02469242v1
  84. 84. Sánchez FJ. Medición y Análisis de las Variaciones en el Nivel de un Modelo Físico Empleando Imágenes. Azcapotzalco, Mexico: UAM; 2009
  85. 85. Rebaza JV. Detección de bordes mediante el algoritmo de Canny. 2007;4:1-8
  86. 86. Zhang Y-J. A Selection of Image Processing Techniques: From Fundamentals to Research Front. Boca Raton, FL: CRC Press; 2022. Epub ahead of print 2022. DOI: 10.1201/9781003241416
  87. 87. Zuiderveld K. Contrast limited adaptive histogram equalization. Graph Gems. 1994;IV:474-485
  88. 88. Distante A, Distante C. Handbook of Image Processing and Computer Vision: Volume 1: From Energy to Image. Cham: Springer; 2020. Epub ahead of print 2020. DOI: 10.1007/978-3-030-38148-6
  89. 89. Frangi AF, Niessen WJ, Vincken KL, et al. Multiscale vessel enhancement filtering. In: Medical Image Computing and Computer-Assisted Intervention — MICCAI’98. Berlin, Heidelberg: Springer; 2006. pp. 130-137
  90. 90. Cervantes H, Miguel S. Algoritmos metaheurísticos para la segmentación de imágenes. Madrid: Universidad Complutense de Madrid; 2019. Available from: https://hdl.handle.net/20.500.14352/17180
  91. 91. De Albuquerque MP, Esquef IA, Mello ARG. Image thresholding using Tsallis entropy. Pattern Recognition Letters. 2004;25:1059-1065
  92. 92. Oliva D, Cuevas E. Advances and Applications of Optimised Algorithms in Image Processing. Cham: Springer; 2017. Epub ahead of print 2017. DOI: 10.1007/978-3-319-48550-8
  93. 93. Sauvola J, Pietikäinen M. Adaptive document image binarization. Pattern Recognition. 2000;33:225-236
  94. 94. Bernse J. Dynamic thresholding of grey-level images. In: Proceedings of the 8th International Conference on Pattern Recognition. Piscataway, NJ, USA: IEEE – Institute of Electrical and Electronics Engineers; 1986. pp. 1251-1255
  95. 95. Vandikas K, Moradi F, Larsson H, et al. Transfer Learning and Domain Adaptation in Telecommunications. London, UK: IntechOpen; 2024. DOI: 10.5772/intechopen.114932. Epub ahead of print 2024
  96. 96. Wei X. Transfer Learning for Non-Invasive BCI EEG Brainwave Decoding. London, UK: IntechOpen; 2024. DOI: 10.5772/intechopen.115124. Epub ahead of print 2024
  97. 97. Hellum O, Pedersen LH, Rønn-Nielsen A. How global is predictability? The power of financial transfer learning. Power Finance Transfer Learning. 2023:66. DOI: 10.2139/ssrn.4620157. Epub ahead of print 2023
  98. 98. Xu W, He J, Shu Y. Transfer learning and deep domain adaptation. Advanced Applied Deep Learning. 2020;45:600-620. DOI: 10.5772/intechopen.94072. Epub ahead of print 2020
  99. 99. Pilault J, Elhattami A, Pal C. Conditionally adaptive multi-task learning: Improving transfer learning in NLP using fewer parameters and less data. arXiv Prepr arXiv200909139. 2020. DOI: 10.48550/arXiv.2009.09139. Epub ahead of print 2020
  100. 100. Lu C, Li W. Ship classification in high-resolution SAR images via transfer learning with small training dataset. Sensors. 2018;19:63
  101. 101. Hanneke S, Kpotufe S, Mahdaviyeh Y. Limits of model selection under transfer learning. In: The Thirty Sixth Annual Conference on Learning Theory. Cambridge, MA, USA: PMLR; 2023. pp. 5781-5812
  102. 102. Zhong X, Ban H. Pre-trained network-based transfer learning: A small-sample machine learning approach to nuclear power plant classification problem. Annals of Nuclear Energy. 2022;175:109201
  103. 103. Selvaraju RR, Das A, Vedantam R, et al. Grad-CAM: Gradient-weighted class activation mapping. Arxiv. 2020;128:3364-3381. DOI: 10.1007/s11263-019-01228-7. Epub ahead of print 2016
  104. 104. Ribeiro MT, Singh S, Guestrin C. ‘Why Should I Trust You?’ Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. pp. 1135-1144
  105. 105. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems; 30

Written By

Lenin Huayta Flores, Lucio Ticona Carrizales, Lenin Samir Huayta Poma and Betty Campos Segales

Submitted: 03 June 2025 Reviewed: 24 June 2025 Published: 07 August 2025