发布新帖

查找

文章
· 六月 15, 2023 阅读大约需 6 分钟

LangChain InterSystems PDF to Interview Questions and FlashCards

Demonstration example for the current Grand Prix contest for use of a more complex Parameter template to test the AI.

Interview Questions

There is documentation. A recruitment consultant wants to quickly challenge candidates with some relevant technical questions to a role.

Can they automate making a list of questions and answers from the available documentation?

Interview Answers and Learning

One of the most effective ways to cement new facts into accessible long term memory is with phased recall.

In essence you take a block of text information, reorganize it into a series of self-contained Questions and Facts.

Now imagine two questions:

  • What day of the week is the trash-bin placed outside for collection?
  • When is the marriage anniversary?

Quickly recalling correct answers can mean a happier life!!

Recalling the answer to each question IS the mechanism to enforce a fact into memory.

Phased Recall re-asks each question with longed and longer time gaps when the correct answer is recalled.
For example:

  • You consistently get the right answer: The question is asked again tomorrow, in 4 days, in 1 week, in 2 weeks, in 1 month.
  • You consistently get the answer wrong: The question will be asked every day until it starts to be recalled.

If you can easily see challenging answers, it is productive to re-work difficult answers, to make them more memorable.

There is a free software package called Anki that provides this full phased recall process for you.

If you can automate the creation of questions and answers into a text file, the Anki will create new flashcards for you.

Hypothesis

We can use LangChain to transform InterSystems PDF documentation into a series of Questions and answers to:

  • Make interview questions and answers
  • Make Learner Anki flash cards

Create new virtual environment

mkdir chainpdf

cd chainpdf

python -m venv .

scripts\activate 

pip install openai
pip install langchain
pip install wget
pip install lancedb
pip install tiktoken
pip install pypdf

set OPENAI_API_KEY=[ Your OpenAI Key ]

python

Prepare the docs

import glob
import wget;

url='https://docs.intersystems.com/irisforhealth20231/csp/docbook/pdfs.zip';
wget.download(url)
# extract docs
import zipfile
with zipfile.ZipFile('pdfs.zip','r') as zip_ref:
  zip_ref.extractall('.')

Extract PDF text

from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts.prompt import PromptTemplate
from langchain import OpenAI
from langchain.chains import LLMChain

# To limit for the example
# From the documentation site I could see that documentation sets
# GCOS = Using ObjectScript
# RCOS = ObjectScript Reference
pdfFiles=['./pdfs/pdfs/GCOS.pdf','./pdfs/pdfs/RCOS.pdf']

# The prompt will be really big and need to leave space for the answer to be constructed
# Therefore reduce the input string
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 200,
    chunk_overlap  = 50,
    length_function = len,
)

# split document text into chuncks
documentsAll=[]
for file_name in pdfFiles:
  loader = PyPDFLoader(file_name)
  pages = loader.load_and_split()
  # Strip unwanted padding
  for page in pages:
    del page.lc_kwargs
    page.page_content=("".join((page.page_content.split('\xa0'))))
  documents = text_splitter.split_documents(pages)
  # Ignore the cover pages
  for document in documents[2:]:
    # skip table of contents
    if document.page_content.__contains__('........'):
      continue
    documentsAll.append(document)

Prep search template

_GetDocWords_TEMPLATE = """From the following documents create a list of distinct facts.
For each fact create a concise question that is answered by the fact.
Do NOT restate the fact in the question.

Output format:
Each question and fact should be output on a seperate line delimited by a comma character
Escape every double quote character in a question with two double quotes
Add a double quote to the beginning and end of each question
Escape every double quote character in a fact with two double quotes
Add a double quote to the beginning and end of each fact
Each line should end with {labels}

The documents to reference to create facts and questions are as follows:
{docs}
"""

PROMPT = PromptTemplate(
     input_variables=["docs","labels"], template=_GetDocWords_TEMPLATE
)

llm = OpenAI(temperature=0, verbose=True)
chain = LLMChain(llm=llm, prompt=PROMPT)

Process each document and place output in file

# open an output file
with open('QandA.txt','w') as file:
  # iterate over each text chunck
  for document in documentsAll:
    # set the label for Anki flashcard
    source=document.metadata['source']
    if source.__contains__('GCOS.pdf'):
      label='Using ObjectScript'
    else:
      label='ObjectScript Reference'
    output=chain.run(docs=document,labels=label)
    file.write(output+'\n')
    file.flush()

 

There were some retry and force-close messages during loop.

Anticipate this is limiting the openAI API to a fair use.

Alternatively a local LLM could be applied instead.

Examine the output file

"What are the contexts in which ObjectScript can be used?", "You can use ObjectScript in any of the following contexts: Interactively from the command line of the Terminal, As the implementation language for methods of InterSystems IRIS object classes, To create ObjectScript routines, and As the implementation language for Stored Procedures and Triggers within InterSystems SQL.", Using ObjectScript,
"What is a global?", "A global is a sparse, multidimensional database array.", Using ObjectScript,
"What is the effect of the ##; comment on INT code line numbering?", "It does not change INT code line numbering.", Using ObjectScript,
"What characters can be used in an explicit namespace name after the first character?", "letters, numbers, hyphens, or underscores", Using ObjectScript
"Are string equality comparisons case-sensitive?", "Yes" Using ObjectScript,
"What happens when the number of references to an object reaches 0?", "The system automatically destroys the object.",Using ObjectScript
Question: "What operations can take an undefined or defined variable?", Fact: "The READ command, the $INCREMENT function, the $BIT function, and the two-argument form of the $GET function.", Using ObjectScript,  a

While a good attempt at formatting answers has occurred there is some deviation.

Manually reviewing I can pick some questions and answers to continue the experiment.

Importing FlashCards into Anki

Reviewed text file:

"What are the contexts in which ObjectScript can be used?", "You can use ObjectScript in any of the following contexts: Interactively from the command line of the Terminal, As the implementation language for methods of InterSystems IRIS object classes, To create ObjectScript routines, and As the implementation language for Stored Procedures and Triggers within InterSystems SQL.", "Using ObjectScript",
"What is a global?", "A global is a sparse, multidimensional database array.", "Using ObjectScript",
"What is the effect of the ##; comment on INT code line numbering?", "It does not change INT code line numbering.", "Using ObjectScript",
"What characters can be used in an explicit namespace name after the first character?", "letters, numbers, hyphens, or underscores", "Using ObjectScript"
"Are string equality comparisons case-sensitive?", "Yes", "Using ObjectScript",
"What happens when the number of references to an object reaches 0?", "The system automatically destroys the object.","Using ObjectScript"
"What operations can take an undefined or defined variable?", "The READ command, the $INCREMENT function, the $BIT function, and the two-argument form of the $GET function.", "Using ObjectScript"

Creating new Anki card deck

Open Anki and select File -> Import

 

Select the reviewed text file

Optionally create a new Card Deck for "Object Script"

A basic card type is fine for this format

 

There was mention of a "Field 4" so should check the records.

Anki import success

Lets Study

Now choose the reinforcement schedule

Happy Learning !!

References

Anki software is available from https://apps.ankiweb.net/

讨论 (0)1
登录或注册以继续
请注意,此帖子已过时。
InterSystems 官方
· 六月 14, 2023

2023 年 6月 13 日 - 勧告:プロセスメモリ使用量の増加

インターシステムズは、InterSystems IRIS 製品でプロセスメモリの使用量が増加する不具合を修正しました。

 

対象バージョン: 
  InterSystems IRIS                      2022.2, 2022.3, 2023.1.0
  InterSystems IRIS for Health   2022.2, 2022.3, 2023.1.0
  HealthShare Health Connect   2022.2, 2022.3, 2023.1.0
  Healthcare Action Engine         2022.1


  
対象プラットフォーム: すべて

 

問題の詳細:
ローカル変数に対して $Order$Query または Merge を実行する際に、プロセスのローカル変数テーブルのメモリ消費量の増加が発生します。 この問題は、ほとんどの実行環境では悪影響を与えませんが、プロセス数が多い環境、またはプロセス当たりの最大メモリを厳密に制限している環境では、影響を受ける可能性があります。 また、一部のプロセスで <STORE>エラーが発生する場合があります。

 

解決方法:
この問題は修正 ID : DP-423127 および DP-423237 で解決します。
これらの修正は、今後のすべてのバージョンに含まれる予定です。 

また、既に公開されていた InterSystems IRIS 2023.1.0.229.0 はこの修正を含むバージョン InterSystems IRIS 2023.1.0.235.1 に更新されました。
 
お客様のご要望により、修正を現在お使いの EM リリースの製品に対するパッチとして個別に作成してご提供することが可能です。お使いのシステムに対するパッチが必要な場合は、バージョン情報とライセンスキー情報をご確認の上インターシステムズカスタマーサポートセンターまでお知らせ下さい。この勧告について質問がある場合は、インターシステムズカスタマーサポートセンターまでご連絡下さい。

讨论 (0)1
登录或注册以继续
问题
· 六月 14, 2023

TCP Adapter - Local Interface setting

Forgive me but our System Administrator who knows how the networking works is OOO...

How does IRIS know which local adapters are available to populate in an Inbound or Outbound TCP Adapter Object? We recently moved from HealthShare Health Connect 2018.1.3 to IRIS HealthShare Health Connect  2022.1. When we migrated we moved the VIP over to the new box and set it at the hardware level.

On RedHat when I do an ifconfig I have two ens192 adapaters..

ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8900
        inet xxxxxx  netmask xxxxx broadcast xxxxxxxx
        inet6 xxxxxxx  prefixlen 64  scopeid 0x20<link>
        ether xxxxxx  txqueuelen 1000  (Ethernet)
        RX packets 2844737404  bytes 489525499847 (455.9 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2890261627  bytes 6219593374601 (5.6 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens192:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8900
        inet xxxxxxx  netmask xxxxxx broadcast xxxxxx
        ether xxxxx txqueuelen 1000  (Ethernet)

the ens192:1 represents the VIP. So we have set the Local Interface in our connections that use VPN to the address that ens192:1 represents. However, the networking team is saying they are seeing packets with the IP address of ens192 not the VIP. 

If we have the Local Interface set to the correct IP address that it should represent, why would the network folks see the other IP address in the packet as we are trying to troubleshoot a new VPN connection?

2 Comments
讨论 (2)1
登录或注册以继续
文章
· 六月 14, 2023 阅读大约需 2 分钟

LangChain Ghost in the PDF

Posing a question to consider during the current Grand Prix competition.

I wanted to share an observation about using PDFs with LangChain.

When loading the text out of a PDF, I noticed there was an artifact of gaps within some of the words extracted.

For example (highlighted in red)

Adapti ve Analytics is an optional e xtension that pro vides a b usiness-oriented, virtual data model layer\nbetween InterSystems IRIS and popular Business Intelligence (BI) and Artificial Intelligence (AI) client tools. It includes\nan intuiti ve user interf ace for de veloping a data model in the form of virtual cubes  where data can be or ganized, calculated\nmeasures consistently defined, and data fields clearly named. By ha ving a centralized common data model, enterprises\nsolve the problem of dif fering definitions and calculations to pro vide their end users with one consistent vie w of b usiness\nmetrics and data characterization.

It was concerning this would affect:
1) The quality of document search for related content
2) The ability of OpenAI model to generate answers

What might be needed to stitch these words back together to improve things?

Could this use a word dictionary?

What would be the risk of linking two seperate words together.

Pushing ahead the unanticipated outcome was:

  • It didn't make a difference to either the document search or the ability to generate answers.

I suspect this is down to the way that OpenAI encoding and tokenizing operate.
The number of tokens is always higher than the number of words.
So tokens are already like "partial" words where tokens follow one another.
Thus the spaces in the middle of words didn't affect the answer.

Please share your experiences of Ghosts / Curious effects when using LangChain with IRIS.

讨论 (0)2
登录或注册以继续
文章
· 六月 13, 2023 阅读大约需 2 分钟

OEX mapping #2

Technology Strategy

When I started this project I had set myself limits:
Though there is a wide range of almost ready-to-use modules in various languages
and though IRIS has excellent facilities and interfaces to make use of them
I decided to solve the challenge "totally internal" just with embedded Python, SQL, ObjectScript
Neither Java, nor Nodes, nor Angular, PEX, ... you name it.
The combination of embedded Python and SQL is preferred. ObjectScript is just my last chance.

I was especially impressed how easy reading an HTTPS page with Python was.
On the other hand, I left Unit Test and Global Merge and Object Property Setter in COS 

Add on after 1st release

The fact that the initial load took about 50 min was rather shocking to have 730 records in the end.
So kind of a QUICK preload was added. In practical work only the first page and eventually during a contest
the 2nd page of the directory holds new entries. The rest is almost static, not to say frozen.

Loading a page 1 and 2  is mostly sufficient to get all new packages
Then loading DETAILS for the few newbies is not worth mentioning.

Collecting results with SQl is an easy exercise but pivoting a cube is a bit more comfortable
So I added today classic IRIS Analytics to my package.
It's enabled in Namespace USER and is named OEX  similar to the first Pivot to start with

After starting the container the Unit Test leaves a test set of page 1 with ~30 records
Which is also the initial content of the Cube.

-

If you decide to run a completely fresh load it is up to you to rebuild the cube in Analytics Architect.

While using the QUICK variant the final step is a rebuild of the cube and you get this result.

So whether you intend to use SQL or Analytics is your decision.

I count on your votes in the contest
 

讨论 (0)1
登录或注册以继续