A Generic Workflow for Bioprocess Analytical Data: Screening Alignment Techniques and Analyzing their Effects on Multivariate Modeling

UV chromatographic data in combination with multivariate data analysis (MVDA) has been extensively used for bioprocess monitoring. However, they are usually attributed to shifts along the retention time and require preprocessing. Misaligned UV chromatographic data result in inconsistent MVDA models. Numerous preprocessing techniques are available, each varying in the number of meta-parameters to optimize, complexity and computational time. Therefore, we aimed at developing a generic workflow to screen for preprocessing techniques. We chose four datasets with increasing complexity containing UV chromatographic data from reverse-phase and size exclusion chromatography HPLC. We aligned all four datasets using three preprocessing techniques, namely icoshift, PAFFT and RAFFT algorithms. We chose several statistical tools to validate the performance of the preprocessing techniques and to screen for meta-parameters. We validated the performance of the preprocessing techniques in terms of data preservation, complexity and computational time, and identified the optimal ranges of meta-parameters for each dataset. Finally, we established principal component analysis (PCA) models to evaluate the chosen alignment technique. Summarizing, in this study a generic workflow has been developed to validate alignment of chromatographic data using statistical tools.


INTRODUCTION
UV chromatography is a powerful tool, extensively used in bioprocess analytical techniques for quantitative and qualitative analysis [1,2]. The main advantages of UV chromatography are short analysis time, ability to generate high amounts of data containing process information, wide variety of column chemistry and high precision. However, UV chromatographic data are prone to shifts along the retention time, which render subsequent automation and establishment of modeling techniques cumbersome or even impossible. Particularly in biochemical assays done with label free LC analysis, alignment of various analyte profiles to their respective retention time would be of utmost importance [3,4]. HPLC is often coupled with different techniques for biochemical analysis [5][6][7]. Automation of such assays for extracting valuable process information in bioprocesses for real time analysis would necessitate correcting misalignments in peak profiles. In the past decades, various alignment techniques have been used to correct shifts along the retention time. Peak alignment is necessary for peak identification and quantification, but more importantly for automation and application of subsequent chemometric models, such as principal component analysis (PCA), hierarchical cluster analysis (HCA) and partial lease squares (PLS). For establishing such multivariate models, the chromatographic dataset must contain information about the changes in the process, which are associated with changes in the UV chromatograms. In other words, the retention time of a particular compound must not vary across different samples, as otherwise the predictive ability of the model is compromised [8,9]. A typical UV chromatogram with retention time shifts is shown in Figure 1.
Various peak alignment approaches to correct misalignments in retention time have been proposed in literature. Most alignment techniques require a reference chromatogram and additional metaparameters for misalignment correction. These meta-parameters are dependent on the dataset and have to be screened in a case-by-case approach [10]. Various target functions for alignment are also used, with the most common being Pearson correlation coefficient [11], Euclidean distance [12], fast Fourier transform (FFT) cross correlation [13] and other even more sophisticated methods. In general, the peak alignment techniques have three different correction methods, namely shifting, insertion/deletion and polynomials models. A more detailed collection of various alignment techniques, their mode of function and relevant metaparameters has been published recently [14].
Although different alignment techniques are available, generic, generally accepted criteria for choosing an alignment technique for processing UV chromatographic data are not available. The three main challenges with aligning chromatographic data are 1) choosing a relevant reference spectrum, 2) defining meta-parameters and 3) data preservation. A more detailed description of these challenges is shown in Table 1.
The reference spectrum, to which all other spectra are aligned, plays a critical role in the overall performance of the alignment technique [10]. It is important that the reference spectrum represents all peaks in the entire dataset. Different approaches have been reported for calculating the reference spectrum, the most common being calculating the average (mean) or median of the entire dataset [15]. In addition to the reference spectrum, each peak alignment technique would require different meta-parameters. Alignment techniques are influenced by different meta-parameters, such as segment length or allowed shifts [16], which are defined prior to the alignment. However, these meta-parameters are dependent on the alignment technique and the dataset used and thus have to be screened. For multivariate modeling, the peak shape and intensity must not change during the alignment procedure, otherwise important information from the dataset is lost.
In this study, we established statistical tools to screen for metaparameters using correlation analysis, explained variance and peak factor. We compared the performances of three peak alignment techniques on three UV chromatographic datasets with different complexity based on the determined meta-parameters. We compared the peak alignment technique with the determined meta-parameters based on alignment correlation, peak factor and by visualization using heat maps and 2D plots. We chose three peak alignment techniques which use FFT cross correlation as target function, namely interval correlation optimized shifting (icoshift) algorithm [13,17], peak alignment by FFT (PAFFT) and recursive alignment by FFT (RAFFT) [18]. We chose them for their attributed low computational times and a lower complexity in terms of meta-parameters in comparison to warping peak algorithms [15]. We investigated different reference spectrum selection techniques for peak alignment and defined the optimal reference spectrum based on highest correlation of reference spectrum to each individual spectrum. Furthermore, we analyzed PCA models, established on the best and worst aligned UV chromatographic datasets and the original dataset, to highlight the impact of the peak alignment method on the multivariate models. Finally, we present a generic workflow for screening meta-parameters as well as choosing and evaluating different peak alignment methods for UV chromatographic data.

UV chromatographic datasets
Datasets 1 and 2: UV chromatographic data from size exclusion (SE-) HPLC: Samples from four different E. coli cultivations were used for analyzing protein purity through SEC. UV chromatographic data at 280 nm were acquired using a modular HPLC device (PATfinderTM) purchased from BIAseparations (Slovenia). The setup comprised of an autosampler (Optimas), a pump (Azura P 6.1L) and a UV detector (Azura MWD 2.1 L). The samples were loaded onto a Superdex 75 10/300 GL size exclusion chromatography (SEC) column purchased from GE Healthcare (Germany). A loading buffer with 20 mM potassium phosphate,

Challenge Requirements
Choosing a reference spectrum Reference spectrum must represent all peaks in the UV spectrum.
Defining meta-parameters Meta-parameters are usually defined on a case-by-case basis, since they are dependent on each peak alignment technique. The meta-parameters determined for a chosen dataset affect peak alignment.
Data preservation Peak alignment technique must not change peak shape, intensity and other important attributes which contain process information. The varying complexity of the datasets arises from the chromatographic method used. For clarity, the SEC-HPLC datasets (1 and 2) render Gaussian (or 'bell') shaped peaks which are broader in resolution, whereas RP-HPLC datasets are characterized by their needle shaped peaks. Furthermore, the number of peaks between SEC and RP-HPLC datasets vary enormously. Therefore, four datasets with varying complexity was considered for this study. Exemplary chromatograms to highlight the complexity in all four datasets considered in this study are shown in Figure 2.

Reference spectrum selection
The reference spectrum is usually selected based on a priori knowledge of the dataset. The reference spectrum must be representative of the (most) significant peaks in a dataset, which is important for extracting process information using multivariate models. Often, the reference spectrum is either calculated by determining the mean or median of the entire dataset or by choosing the latest sample in the sequence which usually represents the highest number of peaks [20,21]. Skov et al. [10] proposed a selection criterion for identifying the reference spectrum by calculating the product of correlation coefficients between the chosen reference spectrum and each individual sample [10]. The reference spectra and the rationale for selecting them are shown in Table 2.
Although, mean and median measures contain significant peak information, they can be biased towards a few peaks with high maxima. Thus, we opted for a bi-weighted mean approach, which imposes a bias-correction to avoid maximum peak intensities which influences the peak alignment procedure. The maximum of all chromatograms in the dataset captures all maximum values or significant peak information and therefore, was considered also as a reference spectrum. In total, seven different reference spectra were used for identifying the optimal reference spectrum for further peak alignment methods.
150 mM sodium chloride, pH 7.0 was used. The flow velocity was kept constant at 0.5 mL/min. The dataset of UV chromatograms at 280 nm from four different E. coli cultivations with 24 samples with each chromatogram having 9,001 data points is termed as Dataset 1.
Samples from downstream unit operations, in particular protein refolding, from E. coli bioprocesses were used for analyzing product yield and purity through SEC. The HPLC setup and analysis conditions were the same as from Dataset 1. UV chromatographic data at 280 nm were acquired with 15 samples with each chromatogram having 12001 data points is termed as Dataset 2.
Datasets 3 and 4: UV chromatographic data from reverse-phase (RP-) HPLC: Samples of corn steep liquor (CSL), which is used as media supplement for Penicillium chrysogenum cultivations [14], were analyzed for vitamin composition using a reverse-phase HPLC column (Acclaim PA; Thermo Fisher Scientific, USA). The HPLC setup (Ultimate 3000; Thermo Fisher Scientific, USA) comprised of a pump (LPG-3400SD), an autosampler (CTC autosampler), column oven (TCC-3000SD) and a diode array detector (DAD 3000). Samples were loaded with 25 mM potassium phosphate buffer, pH 3.5 and eluted with acetonitrile. A more detailed explanation of the data acquisition procedure is published elsewhere [19]. The flow rate was kept constant at 1 mL/min. The dataset of UV chromatograms at 260 nm was analyzed for vitamin composition from sixteen different CSL media stocks and termed as Dataset 3 comprising of 16 samples each with 4800 data points.
Samples from four different E. coli cultivations were used for quantifying metabolite concentrations through RP-HPLC column (Supelcogel C-610 H, Thermo Fisher Scientific, USA). Samples were loaded with a running buffer comprising of 0.1% phosphoric acid in distilled water. The flow rate was kept constant at 0.5 mL/min. The HPLC setup was the same as for Dataset 3.
The UV chromatograms at 210 nm were analyzed for metabolite concentrations from E. coli cultivations and termed as Dataset 4 with 51 samples each containing 9001 data points. For all Datasets, all samples were centrifuged and filtered prior to injection and a sample volume of 10 µL was injected.

Peak alignment techniques
Three different peak alignment methods were tested in this study. The main properties of the different alignment techniques are shown in Table 3.
All individual chromatograms which had buffer artefact peaks were considered as outliers and removed based on the Hotelling's T2 statistic from the PCA models on raw chromatographic dataset prior to peak alignment procedures.

Icoshift:
The icoshift algorithm was initially developed for ID NMR data [17], but it also has been used for UV-chromatographic data (e.g. [1,13]). The icoshift algorithm splits each UV chromatogram into segments and aligns these segments from the dataset to the segments in the reference spectrum by shifting the segments sideways to achieve maximum cross-correlation. It is driven by an FFT engine for simultaneous alignment and has been shown to outperform warping algorithms (e.g. COW; [13]). The main advantage of icoshift is its shifting procedure where the number of shifts of a particular segment can be determined either by the algorithm automatically or user-defined. In common warping algorithms, the search for the shift parameter is tedious as it is powered by dynamic programming (e.g. dynamic time warping (DTW); [20]). Some other advantages of the algorithm include high computational power, user-defined segments and option to fill in missing values (e.g. through interpolation) [17]. The algorithm is available from [22].
In this study, the number of segments was set between 1 (indicating the entire chromatogram of a sample as a segment), and the total number of data points in the datasets (eg. 4799 segments for Dataset 3). The maximum number of shifts allowed was not fixed and the algorithm was allowed to shift until it found the best fit. The chosen values for the different meta variables for icoshift are shown in the supplementary information (Table S1). Missing parts on segment edges were replaced by repeating the value of the segment edge.
Pafft: Similar to icoshift, the PAFFT algorithm also corrects misalignments by shifting the segments to achieve highest correlation. The optimal shift size is determined by sliding the segment of a sample over the corresponding segment in the reference spectrum to achieve maximum correlation. PAFFT does not allow addition of missing values with zeros or interpolations, therefore possible endpoint contamination (by addition of interpolated values) in the chosen segments may occur. On the other hand, since no extra data points are added to the UV chromatographic data, no artifacts are generated. Additionally, PAFFT provides an option to limit the number of shifts of a particular segment. PAFFT also uses the FFT engine for peak alignment. Since two meta-parameters need to be defined, we used a simple two factorial screening design for exploring the optimal meta-parameter combinations. The number of segments were chosen between 1 (corresponding to all data points in each chromatogram) and 1/16 of the chromatogram length (where the entire chromatogram is split into 16 parts, with each segment containing different data points in accordance with the dataset). The number of times the chromatograms were split (16) was chosen arbitrarily and can be changed. The number of shifts allowed by the PAFFT is dependent on the complexity of the dataset. In other words, it depends on the peak properties such as retention time and peak width in the dataset, therefore we assumed a maximum shift corresponding to 1 min in the retention time. Five combinations of shifts and segments based on the experimental design were chosen for the PAFFT algorithm and are shown in Table S1. The algorithm for PAFFT can be downloaded from [23].
Rafft: RAFFT is an extensively used peak alignment method which also uses FFT cross correlation for peak alignment [16,18]. In contrast to PAFFT, the RAFFT algorithm splits the entire spectrum into smaller segments for identifying the highest correlation. The maximum number of shifts allowed for each segment is specified by the user. At the beginning of the alignment procedure, the bigger segment is selected for alignment and this segment is gradually broken down to smaller segments until either the highest correlation is achieved or the maximum number of allowed shifts is reached. RAFFT has also been shown to be faster in comparison to other warping algorithms [16]. In this study, the maximum number of shifts allowed was fixed based on the retention time as in PAFFT. We assumed that the segment, comprising of a few peaks, should not shift more than 1 min of the retention time. Therefore, we chose fixed values with 61, 121, 181, 241 and 301 shifts, corresponding to 0.2, 0.4, 0.6, 0.8 and 1 min in retention time. The algorithm for RAFFT can be downloaded [23]. First injection represents all peaks at the beginning of the process 7 Last injection represents all peaks at the end of the process

Evaluation criteria
Correlation analysis: Correlation of the aligned samples from each peak alignment method with the chosen reference spectrum renders similarity measures. If all peaks in the sample dataset are aligned precisely to the reference spectrum, we obtain a correlation value of 1. However, this measure is only a rough estimate of the alignment procedure and depends entirely on the reference spectrum selection.
Explained variance: The explained variance measure calculated from the PCA model can be used to evaluate the performance of the alignment method. Perfectly aligned chromatograms have a higher variance explained in the first principal components in comparison to misaligned data. Therefore, the sum of the explained variance of the first principal component(s) was calculated for all aligned datasets by establishing PCA models on all datasets. The explained variance in combination with the correlation analysis indicate the optimal setting for a given peak alignment method.
Peak factor: Skov et al. proposed the peak factor as a measure for analyzing the performance of peak alignment techniques [10]. The peak factor measures absolute changes in the spectroscopic data due to peak alignment procedures. This is relevant since the alignment technique must not modify the actual data since any changes affect the subsequent multivariate models. The peak factor is calculated by comparing the Euclidian length (norm) of a UV chromatogram before and after alignment. For warping algorithms such as DTW, peaks from the original data have been reported to be distorted [14]. However, if there is no change in the peak shape, the peak factor has a value of 1.
Computational time: Although this measure may not be relevant for the chosen peak alignment methods used in this study, owing to their fast computations, we included this measure for applicability. Chromatographic and spectroscopic data have been successfully used for bioprocess monitoring [24][25][26], which necessitates fast preprocessing techniques to be on par with bioprocess dynamics [27]. Warping algorithms often have very high computational times [28]. Initially, we considered including dynamic multi-way warping (DMW) as a peak alignment method in this study. However, DMW rendered a 1,000-fold higher computational time (data not shown) than icoshift, PAFFT and RAFFT and hence was not included. However, it is practical for the user to have an overview of time invested for a particular peak alignment method. Therefore, we calculated the computational time for the chosen peak alignment procedures. We performed all analyses in a stand-alone PC with Intel i5-3330 @ 3.00 GHz processor and 8 GB RAM.
Visualization: Visual inspection of datasets renders better understanding of peak alignment methods and contributes to further improvement of the alignment procedure by optimization of meta-parameters. Heat maps were used in this study for visualizing the UV chromatographic data based on their intensities. Strong misalignments can be easily identified using heat maps. For ease of visualization, 2D plots of the original and best alignment were generated to give the user a clear overview of the alignment procedure.

Multivariate models
As an application example, PCA models were developed on the original (misaligned) and the 'best' aligned datasets. In general, the PCA models are used to realize the impact of different peak alignment techniques on chemometric models. In short, PCA is an exploratory technique which decomposes the entire chromatographic dataset to a few latent principal components. Each sample is represented as a score and is projected across different principal components based on their similarities or differences. The resulting score plots from the PCA model can be used to identify possible groupings or similarities between samples in the UV chromatographic data.

Software
All data analyses were done using MATLAB R2016a (Mathworks, US). The PCA models were established in SIMCA v13.0 (Umetrics, Sweden).

RESULTS AND DISCUSSION
In this study we developed a methodology to screen for metaparameters and to choose a peak alignment technique based on different evaluation criteria such as correlation analysis, peak factor and computational time. Four UV chromatographic datasets with varying number of samples, complexity and data volume were analyzed in this study to show the generic applicability of our workflow.

Reference spectrum selection
Seven reference spectra were generated and correlated to each UV chromatogram from all datasets. The correlation coefficients from all datasets and their respective reference spectrum are shown as boxplots in Figure 3. The line inside the box indicates the absolute correlation of the chosen reference spectrum to all four datasets.
It is interesting to note that the first and last injections from all datasets cannot be used as reference spectrum. In Datasets 1 and 2, it is clear that the first and the last injections were not representative of all peak information. Similarly, the peak information in the first and last injections represent different vitamin compositions in Dataset 3 and metabolite profiles in Dataset 4 and render the least correlation. This can be explained with the changes in analyte concentrations over process time, which indicates release (appearance of new peaks) and/or utilization (disappearance of existing peaks) over time. Since the reference spectrum calculated with the arithmetic mean of UV chromatograms, of all samples from Datasets 1-4, rendered the highest correlation, it was chosen as the optimal reference spectrum.

Evaluation criteria
Correlation analysis: Three peak alignment methods were chosen based on their FFT cross correlation for high throughput analysis and less complexity in comparison to warping algorithms. Peak alignment was done using the chosen reference spectrum from respective datasets and correlation analysis was done between the reference spectrum and aligned datasets. For each peak alignment method five different meta-parameter constraints were used. The results from the correlation analysis for all four Datasets are shown in Figure 4.
All the chosen methods with the chosen meta-parameters achieved high correlations above 0. In Dataset 1, it is interesting to note that the RAFFT algorithm has overall lower standard deviations (as indicated with the error bars) in comparison to icoshift or PAFFT algorithms. This can be explained by complete shifts of the chromatogram in the RAFFT algorithm rather than dividing the chromatographic data into segments as in icoshift and PAFFT algorithms.
For Dataset 3 (Figure 4), the correlation coefficients of the selected reference spectrum and icoshift increased with higher intervals to be shifted, but started to decline after 1200 intervals. This indicates that the optimum intervals to be shifted using icoshift algorithm should be close to 1200 intervals. Interestingly, with PAFFT algorithm we can see Explained variance: The explained variance was calculated using a PCA model on the dataset and used to indicate the degree of alignment. The explained variance from the principal components for all alignment methods and their chosen meta-parameters are shown in Figure 5.
Aligned chromatograms explain higher variance in the first PC from a PCA model, therefore, the higher the explained variance the better is the alignment of the dataset. For Dataset 1, the results from the explained variance are in agreement with the results achieved in correlation analysis. The RAFFT algorithm between 121-301 shifts rendered the highest explained variance for Dataset 1. It is more interesting to note in Dataset 2, the peak alignment   Peak factor: The peak factor indicates net changes in the aligned chromatograms in comparison to the original chromatogram. The optimal peak value is '1' corresponding to 'no change'. The peak factors for almost all meta-parameter settings and peak alignment methods for Dataset 1 were higher than 0.96 (icoshift: 1500 intervals). This could be due to endpoint contaminations. For Dataset 2, all peak alignment methods resulted in a peak factor of 1, indicating no loss of information or distortion of peaks. The peak factor for Datasets 3 and 4 were higher than 0.97, which indicates that the used peak alignment methods did not alter the chromatographic information significantly (less than 3%). As mentioned earlier, peak shapes are altered mainly when warping or interpolation functions are integrated into the peak alignment procedure. However, for shift-based algorithms employed in this study little to no peak distortion is to be expected. Comparing all four datasets, the increasing order of computation time can be clearly seen with the increase in the number of samples. The PAFFT algorithm always rendered the minimal computational time for all the datasets considered in this study. However it has to be noted that the PAFFT algorithm performed less in terms of correlation and explained variance with comparison to other peak alignment procedures. Warping algorithms are usually 100-folds higher in computational time in comparison to the FFT correlation methods used in this study [10]. Overall, all algorithms used in this study took less than 5 seconds for peak alignment procedure.

Visualization:
Heat maps or 2D plots can be used to visualize the alignment results. In heat maps, the intensities of the significant peaks are highlighted and possible misalignments are identified. Furthermore, any improvement on a peak alignment method based on a different set of meta-parameters can be directly seen. The results from the heat map and 2D plots of the chosen methods, from Datasets 1-4, showing the unaligned dataset and best alignments achieved are shown in Figure S1. The heat maps from the original and best aligned datasets clearly highlights the misalignments in the raw dataset and alignment efficiency of the algorithm. The 2D plots shows the efficiency of the alignment procedure, where one can clearly see the improvement in peak alignment. Finally, any outliers in the UV chromatographic data can be easily identified (e.g. buffer peaks) by visualizing peak distortions using heat maps.
From all these results, we can see that the correlation analysis and explained variance rendered similar indications to peak alignment performance for the chosen meta-parameters. Peak factor resulted in similar results and indicated no interference in the peak properties, thereby no loss in information. The correlation analysis and explained variance indicated RAFFT with 181 shifts for Dataset 1 and icoshift with 1 interval for Dataset 2. For Dataset 3, PAFFT algorithm with 300 segment size and 61 shifts performed better than all other peak alignment algorithms used in this study, whereas for Dataset 4, icoshift algorithm with 1 interval considering the whole chromatogram outperformed all other algorithms. It is clear that no golden standard of preprocessing technique is available globally for all datasets. However, such a generic strategy must be used to screen for different preprocessing techniques to avoid misleading multivariate models. In order to describe deviations in modeling results, we chose the original datasets, the best aligned datasets and the worst aligned datasets for establishing multivariate models.

Multivariate models
PCA models were established on the 'best' alignments and worst alignments achieved from the peak alignment technique which was identified from all datasets. PCA models render different model variables such as scores and loadings which can be used to extract relevant information from the input datasets. In PCA, the closer the scores are to each other the more similar they are, with respect to the principal components. We analyzed the performance of the peak alignment technique based on the trends in score plots from the PCA models. The score scatter plots from the PCA model from Datasets 1-4 are shown in Figure 6A-6D.
In Figure 6, the score plot of the original data shows a wide spread of scores each representing a UV chromatogram. In the best alignment, we can see a compact trend where samples similar to each other are projected closer. This is further highlighted with the score plot from the worst alignment, where the scores are even more scattered than the scores from original data showing strong dissimilarities. We can see a clear improvement, between the original dataset and the best alignment with respect to clustering in the score plot, highlighting the peak alignment performance. Similarly, we can clearly see similarities between the original and worst aligned datasets for all datasets. In Figure 6, original and worst datasets yield almost identical results as suggested from the very similar results in the evaluation criteria (i.e. 36.1%, 36.8% explained variance for original and worst aligned datasets). It is interesting to note that in Dataset 4, the best and worst alignment was achieved with the same algorithm (icoshift) with different metaparameters (intervals). This further highlight the significance of meta-parameters in peak alignment procedures and the subsequent data driven models.

CONCLUSION
UV chromatographic data are prone to shifts along the retention time, which requires preprocessing prior to establishing multivariate models. In this study, we established a generic strategy for screening and validating different preprocessing techniques for UV chromatographic data. We chose different peak alignment techniques with different meta-parameters to evaluate their performance on four datasets. We analyzed the performance using different statistical tools to identify the optimal peak alignment technique and its meta-parameter ranges. The evaluation from statistical tools illustrated that peak alignment techniques, even though similar in correction methods and target functions, can render different results. The complexity and the sample numbers of each dataset also have an impact on the peak alignment procedure. Therefore, it is safe to hypothesize that the performance of the peak alignment technique is dependent on the initial, raw dataset and no global standard exists for all datasets. The impact of the meta-parameters of the chosen peak alignment technique affects the model results, which can be highlighted with the score scatter plots from the PCA models. Summarizing, the proposed methodology was used to choose the reference spectrum, screen for meta-parameter ranges and validate the results using data driven models. The generic methodology can be used for different chromatographic datasets and has a modular-setup which allows incorporation of any peak alignment technique and any statistical tool as evaluation criterion. We envision the proposed workflow also for spectroscopic data which is usually hampered with peak and baseline shifts.