This paper offers a sympathetic review to the environment and studies of tourism big data, including the general features, types, analytical technics and research progress of tourism big data. It uses the case of TSE model to illustrate the challenges of conducting tourism big data research. In doing so, it argues that the key of using tourism big data is to make good use of heterogenous data structure from multiple sources. This requires sophisticated technics and innovative analytical frameworks that allow us to explore potential mechanisms subsumed in our world but hard to be identified and verified by traditional methods.
Tourism big data; Multiple sources; Heterogenous structure; TSE model
The rise of big data in tourism
There has long been a plethora of data on tourism but turning it into useful information and using it to inform wise decisions have been a persistent challenge. The crossing of international borders commonly involves the completion of customs and immigrations forms, registration in hotels usually requires the sharing of personal information, airlines and credit card companies commonly amass a great deal of personal information but much of it is propriety and not available for academic research although it may be analyzed carefully by the companies that collect it. Some information, such as national tourism statistics, is published regularly, providing useful background information for research and practice.
These data have some deficiencies. For example, a visitor to three Caribbean islands on one trip would be counted three times. However, if the limitations are appreciated, such data can be helpful in answering questions about what is happening, but it seldom provides insights into why things are occurring. Also, the data are often reported at a scale, often national, that differs from the local scale at which many more specific questions arise, and much research is undertaken. Thus, researchers and agencies frequently spend a great deal of time, money and effort collecting data that are specific to their needs. For example, some years ago, Parks Canada (The Canadian Parks Service) possessed reports on more than one hundred visitor surveys conducted on its behalf concerning the activities of those who visited the properties that they manage.
While we still have problems and limitations to make full use of the ‘old’ traditional data, a new data environment gradually emerges in tourism. With the amazing progress of information and communication technologies, more convenient transportation facilities and overarching cyber space, a huge volume of new data becomes storable in a digital form and exchangeable. The new challenge is these data are from multiple sources with heterogenous structures that cannot be easily combined and analyzed.
Tourism is one of the key industries that generate such big data. Currently, there is an increasing trend of using the network in the intelligent mobile terminal for booking travel products, reviewing travel experiences, sharing and communicating with each other. It results in the huge amounts of unstructured and fragmented data, which we call the “big data” in tourism. The mobile phone and the internet have not only revolutionized the ways in which tourists make decisions, with far-reaching implications for tourism itself, they have also resulted in the production of new types of data that can be used to address and provide insights into both old and new questions. For example, blogs can provide information not only on what people do, but why they do it and their evaluations of their experiences, among many other pertinent subjects. Of course, such data are not perfect, and they have inherent biases embedded in them, such as biases related to income, age and education that influence the uptake of novel technologies. However, if such issues are understood, the large volume of data, the continual addition of new entries and easy accessibility are inherent advantages.
Currently, tourism big data has been applied to four domains. Firstly, tourism big data analysis has been assisting smart tourism development in many destinations and scenic spots. It is widely used in building smart tourism platform, smart procedure management, customer image capturing, market trend analysis and so on. The second field of application is to measure and evaluate the sentiment, attitude, satisfaction and other phycological contents. For instance, tourism online business operators often use tourism big data to optimize their website interface, product ranges and pricing. Thirdly, content innovation via big data is welcomed by the industry. It conducts content analysis in online texts, including travel notes, online tourist reviews, weblogs. In this way, tourism practitioners have more accesses to understand tourist characteristics and travel experience, further, to forecast market demand and improve their service. Big data provides more possibility for both tourism industry and research. However, how to define the scale, accuracy, and reliability of big data, how to integrate and utilize these multi-source data, how to capture precise and valuable information from massive data generated by tourists on the Internet, are still the problems to be answered by the tourism industry and research. Reviewing the current research progress will assist us to better understand the big data in the tourism area.
The rest of the paper offers a sympathetic review to the environment and studies of tourism big data. It is organized sequentially via the general features, types, analytical technics and research progress of tourism big data. It ends with a summary and brief discussions about the future of tourism big data application and studies.
Type of tourism big data
Various tourism online resources provide both researchers and practitioners with a more comprehensive and effective way to study tourists. But the problem is that these data may come from different sources with heterogenous structure. This problem poses great challenge for both researchers and industrial practitioners to use tourism big data. Currently, there are three main types of big data in tourism, which are identified as E-commerce data, User-Generated Content (UGC), and temporal-spatial behavior data.
E-commerce data: A tourism business transaction, like booking a hotel or buying an attraction ticket, easily takes place through some tourism online websites or platforms nowadays. Tourism E-commerce changes the traditional trading patterns in the tourism industry; in the meanwhile, it generates more valuable data.
Tourism E-commerce data can refer to the reservation records or transaction data in the online travel agency. It is more structured and has high privacy; on the other hand, this tourism Ecommerce data is not commonly opened to the public, since it means higher commercial direct value to the tourism enterprises. It is beneficial for tourism practitioners to forecast market demand and apply the targeted marketing strategies. For example, when online travel agencies try to predict the purchase amount of a hotel and set the optimal price in the following peak period, their predetermined amount over the year can be one of the powerful indicators. By constructing mathematical models, current studies have verified the value of these exclusive data in analyzing both tourists’ consumption characteristics or satisfaction, and the performance or reputation of the tourism organization [1, 2].
User Generated Content (UGC): With the rapid growth of online social media, travel professional websites, travel forums, and blogs, tourists prefer to share and post their experience on these platforms during and after their trips. The online User Generated Content (UGC) is a crucial e-WOM in this era.
There are several types of UGC data. Tourist online reviews, such as short comments or travel notes of scenic spots, hotels or flights, are one of the main UGC data. For example, tourists tend to rate the destination or review the tourism service in the TripAdvisor, the leading travel community; on the other hand, these contents may become the reliable sources for other tourists when they prepare their travels. Besides, tourists’ daily sharing in social media, especially on Facebook, Twitter, Instagram, and YouTube, is also the primary resource.
UGC receives special attention from researchers for their concision, strong timeliness, and large volume. Comparing to other types of tourism big data, it is unstructured and there is no predetermined theme, which may provide limited commercial value but richer information. In addition, it obtains a high degree of interconnection, which means that it is more accessible. Scholars mainly use these data to identify tourists’ satisfaction, perception credibility, brand value, market prediction and implication on tourism enterprises’ performance [3-5].
Temporal-spatial behavior data: Tourists’ temporal-spatial behavior data is becoming as another abundant source in the tourism big data. Studying on the spatial and temporal behavior patterns of tourists is based on the methodology of Time Geography . This kind of method monitors and records tourists’ time and space movement directly, utilizing digital tracking technologies, like GPS devices.
Temporal-spatial behavior data, undoubtedly, is advantageous in collecting the information of individual’s activity procedure, thus obtaining the demand changes and comprehensive evaluation of a group [7, 8]. It is more reliable and valid in understanding the mobility of tourists and further improving the quality of tourists’ experience.
Analytical technics of tourism UGC data
The volume, variety, velocity, and veracity of big data have made it difficult for manual processing, and it urges the improvement of new analytical technics. The majority of the current studies of tourist online reviews are based on objective or quantifiable elements such as hotel quality measurement, service rating, pricing and ordering information. Given to the length of the research, this paper focuses on the technics of tourism UGC textual data as an example, while there are plenty of methods applied in exploring the other two types of tourism big data. Textual UGC is a large amount of subjective and qualitative new data in the tourism industry is still underexplored, such as sentiment attitude, emotional inclination and the like. Current research, based on the methodology from Sociology and Information Science, has developed certain analytical technics to capture the sentiment orientation and emotional features [9-11].
The key of conducting textual UGC analysis is to capture the sentiment of the texts. The processor of sentimental analysis is affective computing, which processes information along with Natural Language Processing (NLP). These early studies segment the text artificially or by a computer, and then determine the emotional features based on the type and frequency of vocabulary. As more emotive and subjective texts appear, related research has developed from simple computing to sentimental analysis on sentences, paragraph, and article. But manual operation of NLP will cost a huge amount of times. Hence various technics are development. Two of the most used approaches are machine learning and lexicon filtering.
Machine learning method: Machine learning method conducts opinion mining and sentiment analysis through vocabulary training beforehand, requiring a large amount of corpus and long-term training. Its common approaches include both supervised and unsupervised types [9, 10].
Supervised machine learning approach is based on the feature extraction method, which trains a classifier using annotated data or labeled corpora, and uses the trained classifier to further judge the new text . Its common technics include Support Vector Machine Model (SVM), Naïve Bayes model, and N-gram grammar model [11, 12]. Taken Naïve Bayes as an example, it is a probabilistic classifier, which accurately classifies new data into various groups using Bayes Theorem. Scholars have already verified that these different supervised models can all obtain about 80% accuracy and do not have significant difference when the training sample is large enough . On the other hand, unsupervised machine learning approach basically refers to the cluster analysis. This method categorizes data or items into a group because they are much more similar to each other in the same group, when comparing to other items in other groups. Kmeans or other statistical models are the main technics in this approach .
Tourism research widely used machine learning method because of its reliability and maturity. However, current studies tend to use the positive and negative ratings on the online travel websites as the reference standard for learning corpus to train the classifier. It may result in the overly optimistic because of the public’s “social positive tendencies”. At the same time, these ratings and reviews come from the same group of tourists, which means that it exists a certain degree of circular argument . The logic of algorithm remains unknown after the learning and thus cannot explain why positive and negative reviews are given in such ways according to the specific of tourism activities. Therefore, we cannot explain why tourists tend to give more positive or negative reviews in which subjects by the machine learning approach .
Lexicon filtering method: The lexicon filtering method categorizes the sentiment of a sentence by calculating all the position of every emotion-related word in the positive-negative spectrum of a pre-set lexicon. In order to determine the emotional inclination of the corpus composed of travel logs and online reviews, scholars have already built various sentiment lexicons to different sentiment structure theories.
Sentiment lexicons: Lexicon filtering approaches rely on the comprehensive sentiment lexicons, which can be built either automatically or semi-automatically. It firstly finds a series of typical positive and negative words as the seed words, and then tries to find the synonyms and antonyms of these seed words by using specific vocabulary resources, and finally calculate the relation between new words and the seed words [17, 18]. For instance, WordNet-Affect has analyzed 4787 emotional words in total which can be categorized into four basic emotions, namely: Happiness, sadness, anger, and fear .
The current lexicon filtering approaches are simply based on word-frequency counting and yet to consider the semantic rules. In this case, some semantic models or rule-based linguistic models are experimentally developed. For example, Umigon, based on term lists and heuristics, provides a comprehensive system that can not only handle negations, elongated words, and hashtags, but also consider the semantic features, like time or subjectivity [9, 20]. Tourist Sentiment Analysis Model (TSA), one of the effective lexicon filtering methods, considers both tourism and specific linguistic context. It is shown as a case in the following section.
Case-Tourist sentiment evaluation model: Considering that most of lexicon filtering approaches are mainly applied to explain netizens’ emotional divination as well as public opinions, it is necessary to develop a specific lexicon with semantic rules suitable for tourism context. The Tourist Sentiment Evaluation (TSE) model has been provided to solve the above problems.
TSE model is developed by a research team from Sun Yat-sen University [5, 16, 21]. It aims to explore the sentiment evaluation of tourists based on UGC in the Chinese context. The basic logic of TSE model is the lexicon filtering method, which gives the specific polarity to each word, and matches these emotional words automatically to determine the sentimental characteristics of each text. It includes three important steps in building this model:
• Firstly, build the tourism-specific lexicon. It builds up the lexicon upon the HowNet dictionary which includes six basic types of words: positive sentiment, negative sentiment, positive evaluation, negative evaluation, magnitude of degree, and viewpoints. The team further revises tourism lexicon by reading abundant travel logs and tourist online reviews. Totally 317 positive words and 185 negative words were selected covering six major aspects, specifically: scenic spot, catering, transportation, accommodation, entertainment, and shopping. Compared with the positive/negative emotional lexicons in the How Net dictionary, this artificial-screening lexicon newly added 298 new words within which only 40% of them overlapped with the existing emotional words in the How Net dictionary. The ultimate and complete tourist sentiment lexicon contains altogether 3,507 positive words and 3,365 negative words.
• The second step is defining lexicon filtering rules that represent semantic logic. Three types of semantic logic, including magnitude adverbs, negative adverbs, and adversatives, are set with different coefficients.
• Finally, set the emotional multiplier. It is studied that the public prefers to express their positive emotion in social media, because it is easier to obtain social recognition . Hence, it would exist the deviation and exaggerate the results of positive reviews, if directly judge the sentiment polarity by using the simple computing. In order to reduce potential error, this model experimentally determines three emotional multipliers, 3 times, 4 times, and 5 times. It means that a review will be determined as positive reviews only if the positive scores are 3 times, 4 times, and 5 times higher than its negative scores.
By the above definition, 6 scoring rules are further set and applied to evaluate the tourist reviews, as shown in the following Table 1. The study has also found that Rule C2 is slightly more significant and reliable than other scoring rules .
|Direct Scoring||The rating of each review on the website.|
|l Positive review: Scoring 4-5|
|l Neutral review: Scoring 3|
|l Negative review: Scoring 1|
|Rule A||Word frequency statistics of each review.|
|l Positive review: Positive words>negative words|
|l Neutral review: Positive words=negative words|
|l Negative review: Positive words<negative words|
|Rule B||Considering Rule A and semantic logics.|
|l Positive review: Positive scorings>negative scorings|
|l Neutral review: Positive scorings=negative scorings|
|l Negative review: Positive scorings<negative scorings|
|Rule C||Rule C1||Considering Rule B and the emotional multipliers (3 times)|
|Rule C2||Considering Rule B and the emotional multipliers (4 times)|
|Rule C3||Considering Rule B and the emotional multipliers (5 times)|
Table 1: The specification and judgment of 6 scoring rules.
TSE model relies on the comprehensive tourist sentiment lexicons and sets of rules to identify the sentiment characteristics of tourists. It has been applied to several studies, visualizing the emotional images of Chinese tourists in different tourist cities in China and in other countries, like Australia and Sri Lanka. Overall, TSE model is an innovative attempt at combining sentiment analysis and big-data technic in a feasible way. It can help researchers access a large volume of real-time data on specific tourism destination in a highly efficient manner.
The research progress of tourism big data
Using big data is increasingly important because it offers a sufficient alternative source of information that cannot be reached before, and the development of analytical technics enables research to reach a comprehensive and deep understanding. Over the past decades, there has been growing tourism studies achieve significant progress from different dimensions, such as the perspectives of tourists, enterprises, and destinations have been discussed.
The angle of tourists: Tourism researchers have mainly explored tourists’ satisfaction, demand motivation, and travel characteristics by utilizing the tourism big data and the related analytical methods [3,4,14].
Customer experience is the mature and contributory theme in these studies. For example, some scholars tried to identify what types of factors can make a hotel guest happy or unhappy. Xiang, et al.  grabbed 529 hotels’ online reviews from Expedia website and analyzed them using K-mean cluster analysis, one of the machine learning method. It found that hotels with unique or salient traits can satisfy their guest, while negative factors related to cleanliness or maintenance would make them unhappy. Similar studies are applied to hospitality service with a larger amount of online data. A comprehensive study comes from Radojevic, et al. , with analyzed total 2,067,370 online reviews covering 6,768 hotels in 47 capital cities in Europe. This research has proved that hotel star rating is the most significant influential factor of tourist satisfaction, and it also identified 8 contributory factors which are different from the previous research.
Analyzing customer preference to predict potential demand is another important topic in this field regarding the angle of tourists. Different types of online tourism data are widely utilized. For instance Yang, et al.  found out the significant role of the web traffic volumes in accurately predicting the customer demand for hotel. These studies capture the travel characteristics of tourists.
The angle of tourism enterprises: Research has verified that tourism enterprises actively utilize the social media to communicate with tourists. It is widely used in the area of marketing, product distribution, and crisis management [24-26]. Hence, the anger of tourism enterprises becomes another important study perspective [1,24].
It urges tourism firms to make good use of the online platforms. The research led by Ye and Law demonstrates that there is a significant correlation between hotel online reviews and hotel sales as well as the reservation. They even quantified the correlation by pointing out that a 10% increase in online reviews leads to an increase in hotel online reservation of up to 5% [27,28].
Regarding the relationship between the performance of tourism enterprises and the social media activities, the study of Kim, et al. , has already confirmed the positive relationship between them. Another study further indicated that the overall rating is the most critical factor in predicting hotel performance, following by the factor “response to negative reviews” . Online tourist reviews become an accessible channel to evaluate enterprises’ quality and competence.
The angle of destination: Current studies have a broad view of tourism online reviews on tourists and enterprises, while the analytical scope of these studies is restricted within the service industries, particularly in the hospitality industry, with limited attention to online reviews related to scenic spots as well as tourism activities at a destination or region level.
Seeing the destination as the research subject, a study conducted by Pan, et al.  concentrated on identifying the attractive factors of Charleston based on the travel blogs. Their findings discovered that historic charm, beaches, and water activities are the major strengths of this city, while weather, infrastructure, or fast-service restaurants have a negative impact on the destination. On the contrary, Zhou, et al.  used 49,000,000 photos with tags as the databases, and handled them with the cloud computing platform. Their attempt can steadily evaluate the popularity of the destinations and scenic spots, and further compare the difference of travel patterns and preference between the local and tourists.
In addition, some research focused on how Destination Marketing Organizations (DMOs) using the online platforms as an effective channel to marketing destinations. Uşaklı, et al.  implemented content analysis on 3,546 tourist postings, involving DMOs of 50 European countries and 4 main social media platforms. Their results revealed that these DMOs may not take advantage of the social media to effectively marketing the destination. They only utilized it as a traditional marketing tool, while ignored the available function of mitigating potential customer problems.
There is no doubt that using big data can reduce the interferential effect, comparing to that in the traditional methods based on questionnaire surveys, such as the potential misguidance of an interviewer or limitation of questionnaire design . Among these studies of big data in tourism, tourists and tourism enterprises become as two main study subjects, and it has already explored several critical topics in these areas, such as tourist satisfaction, sentiment features, behavioral characteristics, the performance or reputation of a firm. However, research based on the perspective of destination is still at an early stage of exploration, and it calls for more attention and analysis.
The merits of using tourism big data are obviously. The first one is comprehensiveness. There would be no sampling bias by using big data, because we do not have to consider this issue. Whichever data you want to collect, the full set of data is there. The only problem is whether you have sufficient technics to get them. These data are also fully generated by users either intended or unintended and available, without any precondition or assumption for generating them. Secondly, big data can be rigorous source of data to compare with the traditional tourism data collecting from other resources. It can be regarded as a new information flow. The real-time feature is the third characteristic of big data, which is able to dynamically capture information, provided at any moment and even can capture live data which are being generated. Tourism big data, undoubtedly, is positively affecting the tourism industry and research with its superior characteristics.
The engagement between tourism and big data inspires many more convenient and efficient methods and more innovative research themes. Variety of online tourism platforms provides unlimited and increasing data sources, and the constant-updated analytical technics make the sophisticated analysis reachable and visual. Although current scholars have conducted various research to evaluate and analyze tourist, tourism enterprises, and destination, big data in tourism is still at the early stage, and it shows with brilliant and broad prospect in the future. By reviewing the general features, types, analytical technics and research progress, this paper would like to argue that the key of using tourism big data is to make good use of data from multiple sources with heterogenous structure. This requires sophisticated technics and innovative analytical frameworks that allow us to explore potential mechanisms subsumed in our world but hard to be identified and verified by traditional methods.
If looking the research of tourism big data in future, from the perspective of tourists and tourism enterprises, social media and online tourism change the traditional travel patterns, appearing in tourists’ pre-, during, and post-trip phase. In return, tourists’ actively engagement offers an opportunity for tourism enterprises to identify visitors’ travel characteristics and explore their preference further to provide the satisfactory service and fascinating experience. With the popularization of we-media software and travel app on mobile devices, it is incubating new business schema in this industry.
From the perspective of destination, tourism big data can be the considerable marketing and communication channel to shape the brand. Furthermore, smart tourism or new application of big data in the scenic area will constantly advance the quality and efficiency of destination management. It can provide tourists with a comfortable and intelligent experience; on the other hand, it also can improve the ability of scenic areas both in effectively managing real-time crisis and accurately predicting tourist amount. Overall, tourism big data has very promising potential for application in future.
Humanity and Social Science Foundation of Ministry of Education of China, (19YJAZH060).
Citation: Chen H, Liu Y, Chen K (2021) Big Data in Tourism: General Issues and Challenges. J Tourism Hospit.S4: 003.
Received: 19-Jul-2021 Published: 09-Aug-2021, DOI: 10.35248/2167-0269.21.s4.003
Copyright: © 2021 Chen H, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.