Mining Social Media Big Data for Health

Advances in information technology (IT) and big data are affecting nearly every facet of the public and private sectors. Social media platforms are one example of such advances: its nature allows users to connect, collaborate, and debate on any topic with comparative ease. The result is a hefty volume of user-generated content that, if properly mined and analyzed, could help the public and private health care sectors improve the quality of their products and services while reducing costs. The users of these platforms are the key to these improvements, as their valuable feedback will help improve health solutions.

The advantages of harvesting big social media data include readily available bulk data and continuous monitoring, which cuts response time to a minimum. This stands in contrast to traditional social science surveys that require subject participation, resulting in very limited sample sizes and considerably large gaps in gathered data between surveys.

Despite technical advances in mining and analytic tools, challenges remain in capturing, analyzing, and interpreting the big (health) data and converting it to actionable public health solutions. These issues include offline-user behavior, and privacy concerns over the use of acquired data.


There are many methods used to analyze the content of social media networks. Gathering raw data entails finding specific tag words relating to the health topic of interest ranging from pharmaceutical drugs to pandemics. The next step involves analyzing the most frequently used words within the topics of interest in the social media network. The frequently-used words can give the data gatherers an idea of which topics are being discussed. Researchers may also come across words that require incorporating a specialized medical dictionary into their text analysis. The last step depends on the nature of the research: viral outbreaks would require geographical information, side effects would require studying symptoms and public notices from regulatory agencies, in addition to developing/implementing medical lexicons into text analysis.

Another method includes modeling a network using data and analysis software, from which a model can be developed. This model lets us look at the two components of the network: the nodes (representing either individuals or organizations), and the edges (connecting the nodes based on different friendship, kinship, interests levels, etc. amongst the nodes). The construction of a model allows us to visualize the information and study its internal dynamics (user dissemination of knowledge) and bond formations through time.

The ultimate goal of mining big data in real-time is to develop a greater understanding of the interplay among factors affecting health outcomes. This will help consumers, companies, and health policy leaders to develop smart decisions for their application to health care ranging from personal health decisions to resource allocations during a pandemic.

The fundamental method of data mining and analysis is divided into four steps: collection, breakdown, keywords, and patterns. Table 1 summarizes these steps. The first step is to collect the data from various social media networks. The tools in Table 1 summarize a few of the most popular data and text mining software. The second task is to breakdown the data to search for specific words and phrases on the topic of interest (i.e., research on diabetes focuses on insulin strips, etc.). This task requires a specific natural language processing lexicon for the sought-after sickness and treatment. The third step is to search and identify keywords and phrases to link specific products and services. For this task, current tools make use of the already existing controlled vocabularies specifically developed for biomedical sciences (e.g., the Unified Medical Language System—UMLS—maintained by the National Library of Medicine that incorporates structured medical classification lists such as ICD-10 or SNOMED CT), which allows mapping the mined data to computer readable relevant medical terms. Advanced methods must also account for keyword placement within a context of a post. The last step is to discover patterns within the keywords, their connections to certain words and phrases, and converting the patterns into readable data (customer satisfaction/dissatisfaction of pharmaceutical drugs and services).

Table 1: Data Analytics
The process of data analytics, extraction, and analysis.

Collection Analyze social media outlets (from micro-blog posts to professional forums). RapidMiner. Spinn3r, NodeXL, Weka, Tanagra, Xpresso, Textalytics, etc.
Breakdown Large posts and complex responses are broken down into keywords and phrases. NLTK, OpenNLP, Stanford NLP, cTAKES, REEL, ClearNLP, spaCy, General Architecture for Text Engineering (GATE), etc.
Keywords Find certain words and phrases to identify reviews of products or services.
Patterns Search for patterns within the keywords that correlate to responses that resulted in changes of the targeted product or service. Classification and clustering, probabilistic methods, graph & network analysis

Building on this principal approach, researchers have studied various topics to understand consumer tendencies, and act on information based on social network data. A common characteristic of these approaches is first applying a natural language processing program that can extract relevant information. The complexity of the big data then becomes more manageable, allowing for further analyzing and modeling to extract pertinent data.

The health care sector can use this critical information by developing an action plan for recalling or improving products and services. Health care providers and pharmaceutical companies can evaluate the level of satisfaction (or dissatisfaction) of their services among patients. This data can also provide doctors feedback from their peers and patients to help improve treatment plans. Additionally, patients can evaluate, and leverage, other consumer knowledge prior to making crucial healthcare decisions.

Social Media in Health Care

Social media in the public and private sector has expanded exponentially. Consumer groups and professional sites have increased direct accessibility between groups and individuals. Direct engagement between public and private sectors creates opportunities to develop solutions quickly and efficiently. Various businesses and government agencies have taken advantage of social media and the resulting impact of their investments [1]. The return on investments ranged from improving customer service to improving communications with the public.

In 2009, the Centers for Disease Control and Prevention (CDC) took advantage of social media posts made by users ranging from possible symptoms to claims of possible outbreak of the H1N1 virus. In addition to traditional avenues, the CDC used social media to directly engage the public. The public and the federal government benefitted from this approach. The general public quickly acquired information to make decisions such as identifying symptoms and locations to avoid due to a possible outbreak. The government intelligently mobilized resources and prevented mass panic. A 24-hour informational hotline was created in addition to press briefings for media and health alert networks, daily postings to the CDC 2009 H1N1 website, Facebook, and Twitter, and by partnering with other organizations to reach additional audiences [2]. The CDC, in the wake of the pandemic, created the “Predict the Influenza Season Challenge” competition. The participants were encouraged to develop modeling tools that predicted seasonal flu activity based solely on information gathered from social media networks [3].

Depression, a growing global health problem, is receiving growing attention from researchers. Social media networks are allowing patients diagnosed with depression to share their thoughts and connect with other patients and doctors. Researchers argue that monitoring these networks in real-time will give health policymakers a more accurate overview of depression levels in their populations, as opposed to traditional annual surveys. The methods used to collect information include identifying (and cross-referencing) key words used in clinical depression circles and developing (and training) models to predict the user’s ‘mood’ in a post [4-6].

Depression, if left untreated, leads to multiple mental and physical health problems. Suicide is one risk factor for depressed patients. Researchers have been studying how to predict suicide by identifying content in social media networks that serve as ‘red flags’ to suicidal tendencies [7]. These methods would contribute to a more rigorous suicide screening and prevention programs that would greatly benefit counselors and patients.

Pharmacovigilance is another can also benefit from big data mining. Adverse drug reaction (ADR) is the biggest risk factor when taking pharmaceutical drugs. Agencies have taken steps to monitor the ADRs of drugs sold in the market while developing programs to monitor social media networks to enhance their alert systems to drug side effects [8]. These methods will allow consumers in social media networks to voice their concerns over ADRs of drugs and would demand the attention of regulatory agencies and pharmaceutical companies.


The nature of social media, ranging from specific vocabulary used amongst its members, to inconsistent content in posts, makes data analysis difficult. Consequently, several methods have been developed and launched to solve technical challenges associated with gathering and analyzing data collected from social media. Researchers have developed tools including data mining by topic, assessing outliers based on either unusual side effects or misinformation.

The user’s offline behavior is one limitation of these studies: The offline environment (relationships, socioeconomic status, physical environment) can provide valuable clues to ascertain how and why users became diagnosed with clinical depression. The offline behavior can also provide clues to the progression of an outbreak of disease. Acquiring these answers can come in the form of analyzing the user’s other social media accounts, and their web search history when combined with their health records [6].

The above solution gives rise to privacy concerns. Users may be concerned with how their online and offline data will be used, despite most of their online data being available in a public website. Researchers can alleviate these fears by making sure that patients understand the nature of the research and providing them the option of withdrawing from the study. Further, researchers can take steps to ensure the anonymity of the users.

The ‘digital divide,’ a social and economic divide that restricts access to information and communication technology, is another challenge. The sample sizes, although vast, can result in either underestimating or overestimating the spread of diseases. This can lead to mistakes at the policy level that can negatively affect the public. Health care organizations should be aware that while consumer feedback from social media networks can be helpful, the digital divide should also encourage them to seek other sources to confirm that information derived from offline sources matches information from the social media networks.

Another challenge may consist of the ease of access to the data, as some social media platforms allow only restricted access to user posts through their public APIs through which raw data download is performed by computers.


Social media networks, when combined with Big Data applications and health policymaking, will require a broad framework that will enable the development of smart public health applications that will result in high-quality health delivery and reduced costs. Context posting will require the development of more advanced lexicons of formal language, specific terms based on the forum topics, and development of informal dictionaries to clarify potentially confusing posts to those outside of the network. Social media platforms may require several networks that allow users multiple viewing options to ensure up-to-date information in the event that one network stops working. The large amount of data available through social media offers the promise of discovering associations, understanding patterns and trends that may help healthcare stakeholders adjust clinical pathlines.

Moreover, healthcare stakeholders can use social media networks to greatly expand their communication and engagement to consumers to give the latter a greater say in product and service development, which will be a crucial precursor for the development of highly-personalized health solutions.

The aforementioned solutions and advances can lead to further advances for post-marketing and intelligence gathering of products and services from formal (federal government) and informal (social media platforms) to respond more rapidly to consumer sentiment to remain competitive.

With the implementation of social media as part of a broader IT strategy, public health leaders, working with their international counterparts and the private sector, can proactively develop solutions that can lead to other further advances in research and development. This will be crucial as we enter the era of personalized medicine where every patient will require a unique solution based on his or her physiological and genomic data. The investments that the private and public sectors make in social media (and IT) to engage and work with users will go a long way in improving the quality of products and services while reducing healthcare costs.

To conclude, the benefits of social media network data mining and analysis in the era of global healthcare will greatly benefit patients, national and international health agencies, and the private sector in the development and execution of smart public health policies that will ensure the higher quality of healthcare delivery at lower costs.


  1. Keckley, L., “Social Networks in Health Care.” Deloitte Center for Health Solutions, Deloitte LLP, New York, NY, 2010.
  2. The 2009 H1N1 Pandemic: Summary Highlights, April 2009-April 2010 “CDC Competition Encourages Use of Social Media to Predict Flu,”
  4. Park, M., Chiyoung, C., Meeyoung, C., “Depressive Moods of Users Captured in Twitter,” In Proc. ACM SIGKDD Workshop on Healthcare Informatics (HI-KDD).
  5. Larsen, M., Boonstra, T., Batterham, P., O’Dea, B., Paris, C., Christensen, H., “We Feel: Mapping Emotion on Twitter,” Journal of Biomedical and Health Informatics, Vol. 19, No. 4, pp. 1246-1252.
  6. De Choudhury, M., Counts, S., Horvitz, E., “Social Media as a Measurement Tool of Depression in Populations,” Proceedings of the 5th Annual ACM Web Science Conference, pp. 47-56.
  7. Burnap, P., Colombo, G., Scourfield, J., “Machine Classification and Analysis of Suicide-Related Communication on Twitter,” Proceedings of the 26th ACM Conference on Hypertext & Social Media. Association for Computing Machinery, pp. 75-84.
  8. Sarker, A., Ginn, R., Nikfarjam, A., O’Connor, K., Smith, K., Jayaraman, S., Upadhaya, T., Gonzalez, G., “Utilizing Social Media data for Pharmacovigilance: A Review,” Journal of Biomedical Informatics, Vol. 54, pp. 202-212.