发布新帖

查找

问题
· 十二月 4, 2023

Custom Application Metric

I made a custom application metric, imported it to the USER namespace and used:

set status = ##class(SYS.Monitor.SAM.Config).Add.ApplicationClass("historymonitor.errorSensor", "USER")

to add it. When I do 'w status' it returns 1 so it is added but I still can't see the custom metric in the api/monitor/metrics endpoint. Even though I added %DB_USER in the application roles for api/monitor.

Does anyone know where the problem might be that the metrics endpoint still doesn't show my metric?

1 Comment
讨论 (1)2
登录或注册以继续
文章
· 十二月 1, 2023 阅读大约需 13 分钟

"What's taking so long?" - Process Sampling for Performance Analysis

When there's a performance issue, whether for all users on the system or a single process, the shortest path to understanding the root cause is usually to understand what the processes in question are spending their time doing.  Are they mostly using CPU to dutifully march through their algorithm (for better or worse); or are they mostly reading database blocks from disk; or mostly waiting for something else, like LOCKs, ECP or database block collisions?

Tools to help answer the questions above have always been available in various forms. You start with ^JOBEXAM or the Management Portal's Process view to see a process's Routine, its State, and other tidbits, refreshing frequently to get a sense of what is dominating the process's time. You might then use ^mgstat or ^GLOSTAT to measure total system throughput, or use ^LOCKTAB or ^BLKCOL to see if there are sources of LOCK conflicts or block collisions, though it's not always clear how observations at this level reflect on the processes in question.  Lower-level tools like 'iris stat' or OS-level profiling can provide more direct evidence, but involve making inferences about what's going on inside the database kernel. Debuggers and ^%SYS.MONLBL can surely answer a lot of these questions but usually aren't appropriate for use on working production systems.

I created ^PERFSAMPLE to make narrowing in on the root cause of performance issues in the wild quicker and more straight-forward. It's been available in InterSystems IRIS since version 2021.1. PERFSAMPLE samples the state of a set of processes at high frequency, then sorts and counts the sampled data along various dimensions: the process's current routine and namespace, its state string (e.g. GSETW), whether the state is one that indicates waiting or using CPU, the wait state within the database kernel if any, and the PID being sampled (if multiple).  The UI then allows you to see the sorted values for each dimension and dig into them in an order of your choosing.  

Using PERFSAMPLE doesn't change the behavior of the processes being sampled.  It samples information that each process always stores in shared memory, so it has no impact on their performance, and is therefore safe to use on an a live system.  The process running PERFSAMPLE itself does of course consume CPU - more as the sample rate or number of processes to sample is increased - but never more than a single CPU thread.

I'm hopeful that this tool might offer you a little more insight into the performance of your application and help make the most of InterSystems IRIS. 

A Simple Single-Process Example

Take a simple example of one process that is performing slowly.  We'll sample it and perhaps start by looking at what routines are seen most in the samples. In other words, what routine is it spending the most time executing?  Is that expected for this application, or is it surprising?  Then we might look at the most common State (as ^JOBEXAM or %SYS.ProcessQuery would report).  Is it mostly doing global references (e.g. GGET or GSET), doing device IO (READ/WRITE), waiting on a lock (LOCKW), etc?  Maybe its mostly doing global references and so we can look at the Kernel Wait State to see if it's mostly waiting or not, and if it is, for what: disk reads, block collisions, journal writes, another internal resource, etc.  ^PERFSAMPLE lets you aggregate these dimensions of analysis in a heirarchy you choose, like a pivot table.  

Here's what PERFSAMPLE looks like for one process sampled while doing some application activity. We'll look at it first in two dimensions: Using CPU? -> State.  Using CPU says whether the sampled state would indicate that the process is running, or at least it could be assuming CPU is available at the system level, as opposed to waiting for something else. 

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 11:26:46
8842 samples  |  CPULoad* 0.91
-----------------------------'?' for help-------------------------------
Using CPU? [100 %-total]
 > yes                [90.7 %-total]
   no                 [9.33 %-total] 

So this process was spending 90.7% of its time in states where we'd expect it to be using CPU (and indeed this matches the true measure of its CPU time at the operating system level). Now digging in to that 90.7% we find the following states.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 11:26:46
8842 samples  |  CPULoad* 0.91
-----------------------------'?' for help-------------------------------
Using CPU? [yes] -> Process State [90.7 %-total]
 > RUN                [67.0 %-total]
   GGET               [15.6 %-total]
   GDEF               [5.77 %-total]
   GORD               [1.82 %-total]
   LOCK               [0.509 %-total]

Here we see that some of its CPU time is spent accessing globals (getting values, $order, etc), but it's mostly in other application logic (the general RUN state).  What about the time where it wasn't using CPU?  We go back and dig into Using CPU? [no].

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 11:26:46
8842 samples  |  CPULoad* 0.91
-----------------------------'?' for help-------------------------------
Using CPU? [no] -> Process State [9.33 %-total]
 > GDEF               [4.89 %-total]
   GGET               [4.42 %-total]
   GORD               [0.0226 %-total]

We see that when it wasn't using CPU is all global accesses, but that doesn't tell us why, so we go back up to Using CPU? [no] and add the Kernel Wait State.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 11:26:46
8842 samples  |  CPULoad* 0.91
-----------------------------'?' for help-------------------------------
Using CPU? [no] -> Kernel Wait State [9.33 %-total]
 > diskio             [9.33 %-total]

Now we can see that this portion of its time was reading database blocks from disk.  

So, what's our conclusion in this simple example?  This process is spending roughly 10% of its time reading the database from disk, 20% of its time doing the rest of what's required for the accessing globals, and 70% in other logic.  This particular example shows a reasonable mix that suggests it might be performing about as expected given the application algorithm.  If that's too slow, we'll need to understand the application code that it's running and perhaps look for opportunities for improvement or parallelization.  If, on the other hand, we had seen that this process was dominated by the diskio wait state, questions about global buffer configuration and underlying storage hardware would come to mind, along with considering opportunities for parallelization or $prefetchon in the application.  

In either case, the immediate next step in data collection might be that we end up back in ^JOBEXAM to see exactly what globals it's referencing, but now better informed with the shape of its performance profile.  Or we might even decide to use ^TRACE (a new utility in 2021.2+) to follow the exact sequence of global references that it's doing and at what lines of application code.

Multiple Processes

PERSAMPLE can sample multiple or all processes, and the PID from which each sample came is available as a dimension of analysis.  So, for example, choosing to analyze Using CPU? -> PID would show the highest CPU users, and Routine -> PID would get the top routines and then the top processes found running each of them.  Choosing to analyze the dimensions in the opposite order with the PID first allows you to see the data for the other dimensions sorted separately for each individual processes out of the multiple processes sampled.

Here's what PERFSAMPLE looks like after sampling all processes on the system under some particular application load.  I chose the option to Ignore samples where the process appears idle (READ, HANG, etc) to filter out processes that aren't likely to be interesting.  Thereby, and as highlighted below, the captured samples are only 27.4% of the total.  We''ll start by looking at the Routines that were running in our samples.  At the same time, note the CPULoad metric, which is simply expressing the average number of jobs that were Using CPU? [yes] across the samples; if the system had a sufficient number of CPUs to schedule all these jobs and if the performance was fairly uniform across the sample period, this would closely match the number of CPU threads reported busy for IRIS processes at the OS level (e.g. if this system had 8 cores with hyperthreading, the OS might show about 25% utilization: ~4 CPU threads utilized on average out of 16).

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 12:05:26
103655 events in 378240 samples [27.4 %-total]  |  CPULoad* 3.96
Multiple jobs included: 9456 samples per job
-----------------------------'?' for help-------------------------------
Routine [27.4 %-total]
 > RECOVTEST          [25.0 %-total]
   JRNDMN             [2.38 %-total]
   c                  [0.0402 %-total]
   shell              [0.0280 %-total]
   %SYS.WorkQueueMgr  [0.00132 %-total]
   %SYS.Monitor.AbstractSensor.1          [0.000529 %-total]
   SYS.Database.1     [0.000529 %-total]
   %Library.ResultSet.1                   [0.000264 %-total]

When we're sampling every process on the system, the display above, expressed as percentages of all samples isn't always the most helpful view.  The system could have a large number of largely idle processes, with the bulk of the application activity being only a small percentage of the total. We can press 'c' to cycle the display to display the counts as a percentage of the subset.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 12:05:26
103655 events in 378240 samples [27.4 %-total]  |  CPULoad* 3.96
Multiple jobs included: 9456 samples per job
-----------------------------'?' for help-------------------------------
Routine [27.4 %-total]
 > RECOVTEST          [91.1 %-subset]
   JRNDMN             [8.67 %-subset]
   c                  [0.147 %-subset]
   shell              [0.102 %-subset]
   %SYS.WorkQueueMgr  [0.00482 %-subset]
   %SYS.Monitor.AbstractSensor.1          [0.00193 %-subset]
   SYS.Database.1     [0.00193 %-subset]
   %Library.ResultSet.1                   [0.000965 %-subset]

Or, we can press 'c' to display the counts in terms of the number of processes they represent.  It's simply the number of matching samples divided by the number of samples per process, but it's useful because if the performance was fairly uniform across the sample period, this can closely match the number of processes really observed in that state at any one time. Pressing 'c' can also cycle to raw counts. Here's what the top two routines look like in those alternative displays.

Routine [27.4 %-total]
 > RECOVTEST          [9.98 jobs]
   JRNDMN             [0.951 jobs]
...
Routine [103655]
 > RECOVTEST          [94396]
   JRNDMN             [8991]

Note that JRNDMN data point is immediately interesting in this view.  We know there's only one journal daemon, and with the sampled job count very close to 1 (95.1%), it was seen as non-idle in almost every sample - remember we told PERFSAMPLE to ignore samples that looked idle (and if we didn't it'd of course be exactly 1). So we immediately learn that there was substantial journal activity. While there are much more direct ways to measure journal activity if we were looking for it, it's the sort of detail that can jump out when we slice the samples in a certain way.

Now, let's focus in on that RECOVTEST routine that was dominating 90% of the non-idle samples. In a real application, the routine names alone would be more telling and might immediately point you an area of interest, but in my simple example, the load I generated was indeed almost all from this one large routine, so we need to look further into what it's doing.  With the '>' cursor pointing at RECOVTEST, we'll press '+' and add the State dimension...

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 12:05:26
103655 events in 378240 samples [27.4 %-total]  |  CPULoad* 3.96
Multiple jobs included: 9456 samples per job
-----------------------------'?' for help-------------------------------
Routine [RECOVTEST] -> Process State [25.0 %-total]
 > GSETW              [55.1 %-subset]
   RUN                [23.1 %-subset]
   BSETW              [7.75 %-subset]
   GGETW              [6.82 %-subset]
   GSET               [3.76 %-subset]
   INCR               [1.71 %-subset]
   BSET               [0.767 %-subset]
   INCRW              [0.407 %-subset]
   GGET               [0.316 %-subset]
   LOCK               [0.279 %-subset]

In the above we see 55% of this routine's time was spent in GSETW, which means it's doing a global SET, but the W means that it's sleeping waiting for something (see the class reference for the %SYS.ProcessQuery's State property).  We press '+' again and add the Kernel Wait State.  Notice that we're still looking just under these samples of RECOVTEST routine in the GSETW state; we may be interested in going up to start a new analysis with Kernel Wait State as the top dimension but for now we're looking only for the explanation of this one particular set of data points.

PERFSAMPLE for Local Process Activity.  11.00s at 12/01/2023 12:05:26
103655 events in 378240 samples [27.4 %-total]  |  CPULoad* 3.96
Multiple jobs included: 9456 samples per job
-----------------------------'?' for help-------------------------------
Routine [RECOVTEST] -> Process State [GSETW] -> Kernel Wait State [13.8 %-total]
 > inusebufwt         [99.9 %-subset]
   resenqPer-BDB      [0.0577 %-subset]

The 'inusebufwt' state (see the ^PERFSAMPLE documentation) means that this process was waiting due to block collisions: the block that this process wanted to modify was momentarily being used by another process so this one had to wait.  Either multiple processes are SETting, KILLing or fetching the same global variable name (global subscript) simultaneously, or there's a "false sharing" pattern where different subscripts being modified and fetched simultaneously happen to be colocated in the same block.  Using PERFSAMPLE to return to start a new analysis of Kernel Wait State -> Routine would show all routines that were found in the 'inusebufwt' state.  From there, inspection of the application code, and use of ^BLKCOL or ^TRACE would identify the global references that were contending, while ^REPAIR would allow you to see what subscripts are colocated in the blocks in question. 

Sampling ECP Requests to the Data Server

PERFSAMPLE includes a special sampling mode for ECP Data Servers. When run on an ECP Data Server and the Sample ECP Server Requests option is used, it samples the incoming ECP requests that the data server is currently processing, including the global or lock name and its subscripts. This can be very helpful in understanding what application activity contributes the most to the load on the data server from the ECP Application Servers.  It also samples the process state of the ECP server daemon processing the request, so that the State and Kernel Wait State are available just as in the above examples.

1 Comment
讨论 (1)1
登录或注册以继续
问题
· 十一月 29, 2023

How can we set the properties of a package in VSCode?

In the wew versions of IRIS the Studio is going to be deprecated. In the Studio when editing classes there is an option to add information at package level, with the option "Package Information" that shows this dialog:
 

 

In VSCode there is the option to add/edit this package information? If no, how can one add/edit this information without the Studio?

Thanks.

8 Comments
讨论 (8)4
登录或注册以继续
问题
· 十一月 28, 2023

Converting API Monitor Metrics to JSON: Addressing Carriage Return and Line Feed Challenges in InterSystems IRIS Integration

Hello,

First of all thanks for your help, time, and answers.

We would like to know what are we doing wrong and how could we improve it and fix it.

We need to convert the Api Monitor Metrics which are a String with this format:

iris_cache_efficiency 13449.122
iris_cpu_pct{id="CSPDMN"} 0
iris_cpu_pct{id="ECPWorker"} 0

[...]

iris_wdwij_time 11
iris_wd_write_time 8
iris_wij_writes_per_sec 0

 

To JSON.

We would expect them to look like a normal JSON as follows:

{
"iris_cache_efficiency": "13449.122",
"iris_cpu_pct{id='CSPDMN'}": "0",

[...]
"iris_wij_writes_per_sec": 0
}

 

We have currently developed a REST Operation which gets them and tries to convert them from the raw string to a JSON format:

Class Operaciones.REST.MetricasApiMonitorv01r00 Extends EnsLib.REST.Operation
{

Parameter INVOCATION = "Queue";
Method obtenerMetricas(pRequest As Mensajes.Request.Metricas.ObtenerRequest, pResponse As Mensajes.Response.Metricas.ObtenerResponse) As %Library.Status
{
	//Creamos Request y Response HTTP
	Set httpRequest=##class(%Net.HttpRequest).%New()
	set tResponse  = ##class(%Net.HttpResponse).%New()
	set pResponse = ##class(Mensajes.Response.Metricas.ObtenerResponse).%New()

	// Se obtiene la URL de la configurada en la Producción
	;set URL		= ..Adapter.URL
	set URL = "http://[Ip]]:[Port]]/api/monitor/metrics"
	$$$LOGINFO("URL: "_URL)

	//Enviamos al sistema externo
	set tSC=httpRequest.Get(URL,0) 
	$$$LOGALERT("tSC: "_$System.Status.GetErrorText(tSC))

	//Lanzamos excepcion si hubo error
	if $$$ISERR(tSC){
			$$$ThrowOnError(tSC)
	}
	
	set tResponse = httpRequest.HttpResponse
	
	set linea = ""
	//Leemos respuesta
	while (tResponse.Data.AtEnd = 0) {
		set linea = linea_tResponse.Data.Read()
	}
	$$$LOGINFO("linea: "_linea)
	
	set lineaSinComillasDobles = $REPLACE(linea,"""","'")
	$$$LOGINFO("lineaSinComillasDobles: "_lineaSinComillasDobles)
	
	set lineaConDobleComillaDosPuntosEnMedio = $REPLACE(lineaSinComillasDobles," ",""": ")
	$$$LOGINFO("lineaConDobleComillaDosPuntosEnMedio: "_lineaConDobleComillaDosPuntosEnMedio)
	
	$$$LOGALERT("$FIND(lineaConDobleComillaDosPuntosEnMedio,$CHAR(13,10)): "_$FIND(lineaConDobleComillaDosPuntosEnMedio,$CHAR(13,10)))
	
	set lineaConComasAlFinal = $REPLACE(lineaConDobleComillaDosPuntosEnMedio, $CHAR(13,10),",")
	$$$LOGINFO("lineaConComasAlFinal: "_lineaConComasAlFinal)
	
	set pResponse.resultado = "{"_lineaConComasAlFinal_"}"
	Quit pResponse
}

XData MessageMap
{
<MapItems>
  <MapItem MessageType="Mensajes.Request.Metricas.ObtenerRequest">
    <Method>obtenerMetricas</Method>
  </MapItem>  
  </MapItems>
}

}

 

However we do not know hot to replace Carrie Return and Line Feed with a comma and a double quote.

We have tried:

	$$$LOGALERT("$FIND(lineaConDobleComillaDosPuntosEnMedio,$CHAR(13,10)): "_$FIND(lineaConDobleComillaDosPuntosEnMedio,$CHAR(13,10)))
	
	set lineaConComasAlFinal = $REPLACE(lineaConDobleComillaDosPuntosEnMedio, $CHAR(13,10),",")
	$$$LOGINFO("lineaConComasAlFinal: "_lineaConComasAlFinal)

 

However the $FIND outputs "0", so we think it means that is does not find a Carrie Return and Line Feed at all.

Even more the $REPLACE outputs no effect.

The response currently shows:

{iris_cache_efficiency": 13492.868
iris_cpu_pct{id='CSPDMN'}": 0
iris_cpu_pct{id='CSPSRV'}": 1

[...]

iris_wdwij_time": 24
iris_wd_write_time": 23
iris_wij_writes_per_sec": 0}

 

Being outputted at the visual trace:

 

How could we convert the API Monitor Metrics to JSON in an effective way?

What have we done wrong?

How could we improve and fix our code to accomplish this need?

 

In addition we have also read:

https://community.intersystems.com/post/replace-carriage-return-linefeed...

https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls...

https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls...

https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls...

 

Thanks for your help

4 Comments
讨论 (4)1
登录或注册以继续
文章
· 十一月 28, 2023 阅读大约需 3 分钟

InterLang: Technical Implementation of Streamlit Chatbot and LangChain FHIR Tool

Overview

In our previous post, we discussed the motivation for developing a chatbot agent with access to FHIR resources. In this post, we will dive into the high-level design aspects of integrating a Streamlit-based chat interface with a Java SpringBoot backend, and enabling a LangChain agent with access to FHIR (Fast Healthcare Interoperability Resources) via APIs.


 

Connecting Streamlit to Java SpringBoot

The Streamlit application acts as the frontend, providing an interactive chat interface. It communicates with the Java SpringBoot backend, which handles processing and responding to user inputs. Here's a high-level overview of this integration:

  1. User Input Handling: The Streamlit app captures user input through a chat interface. This input is then sent to the Java SpringBoot backend via an HTTP POST request.
response = requests.post(BASE_URL, json={"user_input": prompt})
  1. Backend Processing: The Java SpringBoot application receives this input, processes it, and generates a response. This process may involve accessing and manipulating FHIR resources.
  2. Response Delivery: The response from the backend is sent back to the Streamlit app. Streamlit then displays this response in the chat interface, providing an interactive experience for the user.
  3. State Management: Streamlit maintains a session state to keep track of the chat history. This allows for a persistent conversation flow, which is crucial for a chatbot experience.

 

LangChain Agent with FHIR Resources via API

The LangChain agent, integrated into the Java SpringBoot backend, leverages FHIR resources to provide healthcare-related information and services. The key components of this setup are:

 

FHIR Tools Service: This Java-based service is responsible for interacting with FHIR resources. It includes functionalities like fetching patient data, creating observations, and managing goals.

    @Tool("Fetches a FHIR Patient resource using a GET request by ID")
    Patient getPatientById(String patientId) {
        return client.read()
                .resource(Patient.class)
                .withId(patientId)
                .execute();
    }

 

Integration with LangChain: The LangChain agent, utilizing GPT-4, is integrated into the SpringBoot application. This agent has access to the FHIR Tools service, allowing it to perform actions like retrieving patient information or updating records based on user input.

this.assistant = AiServices.builder(Assistant.class)
                .chatLanguageModel(chatModel)
                .tools(this.fhirTools)
                .chatMemory(MessageWindowChatMemory.withMaxMessages(10))
                .build();

 

Handling Conversational Context: The agent uses a MessageWindowChatMemory to maintain the context of the conversation. This is crucial for understanding the flow of the conversation and providing relevant responses.

 

 

API Key Management: Both the FHIR service and the OpenAI GPT-4 model require API keys for authentication. These keys are managed securely within the SpringBoot application.

 


Video Demo

Demo

 

Authors

*   Zacchaeus Chok
*   Varun Swaminathan
*   Gabriel Yang

1 Comment
讨论 (1)2
登录或注册以继续