INTRODUCTION
With the advent of the Fourth Industrial Revolution, efforts to apply artificial intelligence (AI) and machine learning in the medical field are actively underway [
1,
2]. In particular, imaging diagnosis, disease diagnosis, and prediction using clinical data and genomic Big data are medical fields of AI that currently receive the most attention [
3,
4]. AI technologies associated with natural language processing (NLU) are also being used in healthcare [
5]. Conversational AI is an application of NLU and refers to AI technology that can talk to people, including chatbots or virtual agents [
6].
Unlike the written text-based chatbot, a computer system that can communicate by voice is called a spoken dialog system (SDS) [
7]. Unlike the command and control speech system that simply answers requests and cannot maintain the conversation continuously, a SDS can maintain the continuity of the conversation over long periods of time. SDSs are already being applied in everyday life through in-home AI speakers, such as Amazon Alexa (Amazon, Seattle, WA, USA) [
8]. Moreover, conversational AI is being applied in various medical fields, such as patient education, medical appointments, and voice-based electronic medical record (EMR) creation [
7,
9]. Recent attempts have been made to collect medical data, such as patient-reported outcomes, health status checks and tracking, and remote home monitoring, using conversational AI [
7,
10].
In assessing patient with spinal disease, doctor-patient dialogue about pain is the first step in diagnosis, and a pain questionnaire is the most important tool during follow-up after treatment or spine surgery. The purpose of our study was to develop a SDS for a pain questionnaire for patients with spinal diseases. We aimed to evaluate user satisfaction and validate the performance accuracy of the system in medical staff and patients. This study is a preliminary study for the development of an interactive medical robot. Based on the results of this study, a follow-up study on robot-based interactive questionnaire is planned.
MATERIALS AND METHODS
1. Development of the Pain Questionnaire Protocol
First, a pain questionnaire protocol for a SDS was developed by dividing the preoperative and postoperative pain questionnaires to assess the outcomes of patients undergoing spine surgery. The pain questionnaire consisted of questions to reflect the actual conversation between the medical staff and the patient. The items were created based on questions that medical staff usually ask during rounds of inpatients. The protocol included questions about the location, type, influencing factors, intensity, time of onset, and duration of pain. In addition, questions about the patient’s psychological state, such as questions regarding mood, anxiety, and sleep quality, were included as indirect indicators of pain. Postoperative question items were replaced with question items about pain at the surgical site. Furthermore, a question about whether the patient’s preoperative pain had improved or not was added. Questions about psychological status were the same as the preoperative questions. Each question was structured in a closed question format so that the pain questionnaire system could easily process the patients’ responses. The developed pain questionnaire protocol is shown in
Table 1.
2. Collection of Dialogue Dataset for the SDS
To build a database of patients’ various expressions for NLU, real doctor-patient dialogue sets were collected. The study was approved by the Institutional Review Board (IRB No. 1905-023-079). Informed consent was obtained from all patients. A total of 1,314 dialogue sets were collected from 100 hospitalized patients who underwent spinal surgery between September 2019 and August 2021. One dialogue set was defined as one question and one answer. The age range was 22–82 years (mean, 62.6 years), and 47 patients were male. There were 48 spinal stenosis, 13 herniated disc herniation, 13 spinal infection, 11 spinal tumor, 8 spinal deformity, 4 spine trauma, and 3 myelopathy cases.
Three doctors asked inpatients questions naturally following the pain questionnaire protocol during the rounds, and the conversations were recorded using a voice recorder. The preoperative pain questionnaire was used the day before surgery, and the postoperative pain questionnaire was used between 3 and 7 days after surgery. The recordings were documented in the format of text and stored in a database for NLU. Additionally, the virtual conversations of the researchers were also collected, and 2,000 dialogue sets were used for the database.
3. Development of the SDS
The SDS was structured as shown in
Fig. 1. The patient’s response voice was entered into the speech recognition module and converted into text data. The text data was the input value of the NLU module. The NLU module played a role in understanding users’ intentions by analyzing the intents, name entity recognition, and keywords in the user’s answers. The output value of the NLU module was again entered into the dialog management module, which managed the flow of conversation between the user and SDS. It searched the database for information to be given to the user and outputted the content necessary for system utterance. The system utterance output was automatically generated in the format of text data through a natural language generation (NLG) module, which was again keyed into a speech synthesis module. The speech synthesis module finally completed system utterance generation by outputting the result in a voice format that the user can understand. The pain questionnaire SDS was developed using Python 3.8 for Windows 10. IBM Watson Text-to-Speech (IBM, Armonk, NY, USA) was used for utterance of the SDS. NLU was performed using IBM Watson Assistant and KoNLPy to understand the patient’s intent from the text data [
11]. The utterance was performed using the Google Cloud Speech-to-Text module (Google, Mountain View, CA, USA).
After analyzing the dialogue datasets obtained from the patients, patients’ intents that express the character of pain and psychological state were classified into 95 in the intents column of IBM Watson Assistant. A total of 1,229 expression examples were registered in the user example of the intent column. A total of 770 examples for timing, duration, and influence factors were registered in the name entity column.
Fig. 2 shows the conversation flow of the pain questionnaire SDS for implementation of the questionnaire protocol in
Table 1. The SDS starts by entering a unique identification number (UID) that anonymizes the patient’s personal information and stores it in the virtual EMR. When the UID is entered, the SDS checks whether the UID exists in the database. If the UID exists, the SDS starts asking questions after repeating the previous questionnaire’s summary. The SDS checks whether the answers to the 10 questions were obtained from the patient during the questionnaire. If proper information was not obtained, the SDS asks the question until proper information is obtained. When the patient’s answer is not recognized, the SDS utters similar questions without repeating the same question. When the patient’s answer is properly recognized, the Q-learning status is updated to determine the next question, and the SDS checks whether all answers are obtained. When all answers are obtained, the SDS utters the summarized result and finishes the questionnaire after saving the results in the form of text in the virtual EMR. An example of the questionnaire results that were transmitted to the virtual EMR is shown in
Fig. 3.
Supplementary video clip 1 is actual conversation video between SDS and a participant.
4. Validation and User Satisfaction for the SDS
User satisfaction and performance accuracy of the developed pain questionnaire SDS were evaluated. Validation of the SDS was performed for 3 user groups: doctors, nurses, and patients. The participants volunteered to be recruited. The study was approved by the Institutional Review Board of Pusan National University Hospital (IRB No. 2012-010-097). Informed consents were obtained from all patients. Ten participants were included in each group. The participants were pretrained to engage in routine conversations rather than simple short-answer conversations. The participants were provided with basic information about the purpose of the study and SDS, and we helped the participants adapt to the conversation with the SDS. The mean ages of the doctors, nurses, and patients were 35.3 years (range, 25–47 years), 31.2 years (range, 21–58 years), and 64.0 years (range, 48–82 years), respectively. The male-to-female ratios in doctors, nurses, and patients were 9:1, 10:0, and 5:5, respectively. Validation of the SDS was performed in a sitting position on the bed of an inpatient ward. The SDS was mounted on a laptop notebook and placed on a bed table. When the start button was pressed, a conversation was initiated automatically. The SDS first asked a question about pain and recognized the answer, and it followed up with further questions. After the last question and answer, the SDS uttered the summarized result to the users and ended the program immediately after the test, the participants completed a user satisfaction questionnaire about SDS. The questionnaire consisted of 10 items, including the accuracy of the SDS’s voice, the degree of similarity to human conversation, and overall satisfaction, and followed the 7-point Likert scale [
7].
5. Statistical Analysis
To verify the performance accuracy of the SDS, the recognition error of the patient’s answer, summary error, the causes of the errors, and summary omission of the summarized comment were analyzed. User satisfaction and accuracy between the participant groups were statistically analyzed using 1-way analysis of variance and post hoc Tukey honestly significant difference analysis. A p-value of < 0.05 was considered statistically significant.
DISCUSSSION
Conversational AI is increasingly being used in medical healthcare field [
6,
9]. Conversational AI, such as voice chatbots and voice assistants, can provide primary medical education services that answer common questions based on knowledge databases. For example, if people ask a question about first aid in the case of a fever or insect bites, the SDS can tell the treatment method via voice [
12]. Recently, hospitals have been actively introducing a doctor appointment service using chatbots [
13]. Currently, the most actively researched field is document automation through voice recognition [
7,
14]. Speech recognition technology can dramatically reduce the time required to write medical records for doctors and nurses by automatically inputting data in the medical records. It has been reported that this technology has reduced the burden and fatigue experienced by doctors and nurses and increased the time spent caring for patients [
15]. In addition, conversational AI can be used to automate patient data collection as the SDS used in this study can collect important medical history and patient-reported outcomes. It can also be used for remote home monitoring [
9,
10,
15].
The term, voice-based conversational AI, is used interchangeably with chatbot or voice assistant; however, the more specialized term is “spoken dialogue system.” The SDS can be defined as a dialog software system that can communicate with people using voice [
16]. SDS includes several NLU technologies, such as speech recognition, NLU, and NLG. Dialogue systems can be broadly classified into 4 categories depending on whether the type of dialogue is open or closed dialogue and whether the dialogue system is based on a retrieval or a generative model [
17]. A retrieval model-based dialogue system called closed conversation responds to a specific topic with a premade answer. The pain questionnaire SDS is based on a retrieval model that allows a closed conversation. In many dialogue systems, the user initiates the conversation, and the conversation flow is determined by the user requesting information to the dialogue system [
16]. The pain questionnaire SDS in our study has a flow of asking and processing information from the patients as the system takes the initiative in conversation.
This SDS was developed for the purpose of being mounted on a medical assistant robot that provides medical services to the inpatients, especially those undergoing spinal surgery since pre- and postoperative pain assessments in these patients are the most important items for diagnosis and treatment follow-up. Therefore, the conversation flow of the SDS actually followed the pre- and postoperative pain assessments for inpatients with spinal diseases. Although the SDS was developed with a focus on inpatients, it can be sufficiently used for first outpatient visits or remote monitoring due to the general content of the conversation.
In the user satisfaction evaluation of the SDS, there was no statistical difference in satisfaction among the 3 groups, but satisfaction of nurses was slightly higher than that of doctors and patients. In the nurse group, there was no summary errors; hence, the overall accuracy was high, and it is presumed that the expectation for the use of the SDS was reflected in the nurse groups with a high actual workload. On the other hand, it seems that doctors showed relatively low satisfaction because the accuracy of the SDS did not meet their expectations as they require a high level of information accuracy. As for the satisfaction of patients, the mean age was relatively older; hence, unfamiliarity with the digital system may have contributed to the low score. In particular, in item Q1, patients showed significantly lower satisfaction than medical staff; hence, their understanding of the SDS question may have been low. Therefore, the question content and method should be upgraded to be easier to understand for elderly patients. In the performance evaluation of the SDS, recognition errors in the patient group were significantly more in number. The high error rate may be due to the fact that many unstructured speech recognitions occurred because the patient’s answer was long, specific, and varied as a routine expression. In addition, the patient’s voice tended to be lower in volume and unclear; hence, the recognition error was likely to be high. On the other hand, due to their prior education for natural conversation, doctors and nurses tended to intentionally give clear and simple answers so that the SDS could recognize the answers themselves. There were cases in which the user could not predict the end time of the utterance of the SDS and answered before the end of the question. Therefore, it is necessary to improve the usability by adding system feedback so that patients can predict the end point of the SDS utterance. Finally, when users answered a question with multiple contents, the SDS recognized only one content. For example, when users answered about the location of the pain, they complained of pain in several locations, including the back, buttocks, and legs. However, the SDS only recognized only one of the 3 pain sites. This is because the SDS fills the slot by selecting only one keyword from the user’s answer. Therefore, the SDS should be upgraded to recognize these types of answers.
To improve the overall accuracy of the SDS, it is necessary to significantly improve the current voice recognition technology. Despite the rapid development of voice recognition, the rate of its use is still 80% or less, which is not adequate for medical information that requires high accuracy [
18]. In addition, it is important to secure the vocabulary and sentences for patients’ expressions through the collection of more dialog sets from actual conversations between medical staff and patients. However, because the doctor-patient dialog is protected by the patient’s right to privacy, collection of a large number of dialog sets is challenging, unlike general dialog sets that can be easily obtained from the internet.
Until now, commercialized conversational AI for collecting medical information through voice conversations with patients has not been developed. Conversational AI for collection of medical information can reduce the time and effort needed of medical staff by automating the questionnaire during the first outpatient visit in the future. In addition, it is expected that it will be applied in telemedicine and remote patient monitoring, which is receiving increasing interest due to the recent coronavirus disease 2019 pandemic. In particular, for older patients, collection of patient outcome reports using text-based chatbots or apps are limited due to presbyopia and difficulty in using smart devices. Therefore, it will be more useful if remote monitoring can be performed using conversational AI in elderly patients. If clinical decisions supporting AI and conversational AI are combined in the future, it could be applied to software in medical devices for diagnosis, treatment, and prevention beyond collecting medical information [
9].
SDS can be used for remote pain monitoring of spinal patients through automation of pain questionnaires for spine patients, and shortening of doctor consultation time through automation of initial consultations. In this case, the collection of pain information can be automated through follow-up of the patient before and after surgery, which can help in tracking the patient’s prognosis. By frequently performing additional pain questionnaires as well as pain evaluation during rounds by medical staff, pain evaluation will be possible more frequently while reducing the medical staff's work loading.
A limitation of this study is the small number of test subjects; thus, there may be bias in the evaluation of user satisfaction and performance accuracy. Nevertheless, our study reports the first development of conversational AI for a spinal pain questionnaire. Our study can also provide an important starting point and reference for future related research as our findings validate the accuracy and satisfaction of real patients and medical staff. In the future, we hope to improve the SDS and evaluate user satisfaction and performance accuracy in a large sample of patients.